pandas matrix calculation till the diagonal

First of all here is a profiling of your code. First all commands separately, and then as you posted it.

%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)

The above profiling returned the following results:

Explode   : 1000 loops, best of 3: 201 µs per loop
Dummies   : 1000 loops, best of 3: 697 µs per loop
Sum       : 1000 loops, best of 3: 1.36 ms per loop
Dot       : 1000 loops, best of 3: 453 µs per loop
Sum2      : 10000 loops, best of 3: 162 µs per loop
Divide    : 100 loops, best of 3: 1.81 ms per loop

Running Your two lines together results in:

100 loops, best of 3: 5.35 ms per loop

Using a different approach relying less on the (sometimes expensive) functionality of pandas, the code I created takes just about a third of the time by skipping the calculation for the upper triangular matrix and the diagonal as well.

import numpy as np

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
    d0 = set(df.iloc[i].list_of_value)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(df)):
        df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])

With df given as

df = pd.DataFrame(
    [[['a','b','c']],
     [['d','b','c']],
     [['a','b','c']],
     [['a','b','c']]],
     columns = ["list_of_value"])

the profiling for this code results in a running time of only 1.68ms.

1000 loops, best of 3: 1.68 ms per loop

UPDATE

Instead of operating on the entire DataFrame, just picking the Series that is needed gives a huge speedup.

Three methods to iterate over the entries in the Series have been tested, and all of them are more or less equal regarding the performance.

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))

# get the Series from the DataFrame
dfl = df.list_of_value

for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems():  # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

There are a lot of pitfalls with pandas. E.g. always access the rows of a DataFrame or Series via df.iloc[0] instead of df[0]. Both works but df.iloc[0] is much faster.

The timings for the first matrix with 4 elements each with a list of size 3 resulted in a speedup of about 3 times as fast.

1000 loops, best of 3: 443 µs per loop

And when using a bigger dataset I got far better results with a speedup of over 11:

# operating on the DataFrame
10 loop, best of 3: 565 ms per loop

# operating on the Series
10 loops, best of 3: 47.7 ms per loop

UPDATE 2

When not using pandas at all (during the calculation), you get another significant speedup. Therefore you simply need to convert the column to operate on into a list.

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

On the data provided in the question we only see a slightly better result compared to the first update.

1000 loops, best of 3: 363 µs per loop

But when using bigger data (100 rows with lists of size 15) the advantage gets obvious:

100 loops, best of 3: 5.26 ms per loop

Here a comparison of all the suggested methods:

+----------+-----------------------------------------+
|          | Using the Dataset from the question     |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop  |
+----------+-----------------------------------------+
| Answer   | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop  |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop  |
+----------+-----------------------------------------+

Although this question is well answered I will show a more readable and also very efficient alternative:

from itertools import product
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
         product(df['list_of_value'], repeat=2)))

pd.DataFrame(index=df['id'],
             columns=df['id'],
             data=np.array(values).reshape(len_df, len_df))

id         0         1         2         3
id                                        
0   1.000000  0.666667  1.000000  1.000000
1   0.666667  1.000000  0.666667  0.666667
2   1.000000  0.666667  1.000000  1.000000
3   1.000000  0.666667  1.000000  1.000000

%%timeit
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
         product(df['list_of_value'], repeat=2)))

pd.DataFrame(index=df['id'],
             columns=df['id'],
             data=np.array(values).reshape(len_df, len_df))

850 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
#convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

470 µs ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I am not inclined to change your first line, although I'm sure it could be faster, because it's not going to be the bottleneck as your data gets larger. But the second line could be, and is also extremely easy to improve:

Change this:

s.dot(s.T).div(s.sum(1))

To:

arr=s.values
np.dot( arr, arr.T ) / arr[0].sum()

That's just doing it in numpy instead of pandas, but often you'll get a huge speedup. On your small, sample data it will only speed up by 2x, but if you increase your dataframe from 4 rows to 400 rows, then I see a speedup of over 20x.

As an aside, I would be inclined to not worry about the triangular aspect of the problem, at least as far as speed. You have to make the code considerably more complex and you probably aren't even gaining any speed in a situation like this.

Conversely, if conserving storage space is important, then obviously retaining only the upper (or lower) triangle will cut your storage needs by slightly more than half.

(If you really do care about the triangular aspect for dimensionality numpy does have related functions/methods but I don't know them offhand and, again, it's not clear to me if it's worth the extra complexity in this case.)

Tags:

Python

Pandas