Optimizing cartesian product between two Pandas Dataframe

Of all the alternatives tested, the one that gave me the best results was the following:

An iteration product was made with itertools.product().
All the iterations on both iterrows were performed on a Pool of parallel processes (using a map function).

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

In the cython_modules.pyx file:

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
    (df1_key, df1_values), (df2_key, df2_values) = row
    cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
    return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Then I set up the setup.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
      ext_modules=cythonize("cython_modules.pyx")

Finally I compiled it with: python setup.py build_ext --inplace

The final code was left, then:

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

The solution with Dask can be found in my other question.

Optimizing cartesian product between two Pandas Dataframe

Tags:

Python

Pandas

Python 3.X

Related

Recent Posts