What's the fastest way to acces a Pandas DataFrame?

Don't do iloc/loc/chained-indexing. Using the NumPy interface alone increases speed by ~180x. If you further remove element access, we can bump this to 180,000x.

fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)

fp = pd.DataFrame(fp)

# this confirms how slow data access is on my computer
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx

1 loops, best of 3: 3min 9s per loop

# this accesses the underlying NumPy array, so you can directly set the data
%timeit for idx in range(0, len(fp)): fp.values[idx, 0] = idx

1 loops, best of 3: 1.19 s per loop

This is because there's extensive code that goes in the Python layer for this fancing indexing, taking ~10µs per loop. Using Pandas indexing should be done to retrieve entire subsets of data, which you then use to do vectorized operations on the entire dataframe. Individual element access is glacial: using Python dictionaries will give you a > 180 fold increase in performance.

Things get a lot better when you access columns or rows instead of individual elements: 3 orders of magnitude better.

# set all items in 1 go.
%timeit fp[0] = np.arange(146611)
1000 loops, best of 3: 814 µs per loop

Moral

Don't try to access individual elements via chained indexing, loc, or iloc. Generate a NumPy array in a single allocation, from a Python list (or a C-interface if performance is absolutely critical), and then perform operations on entire columns or dataframes.

Using NumPy arrays and performing operations directly on columns rather than individual elements, we got a whopping 180,000+ fold increase in performance. Not too shabby.

Edit

Comments from @kushy suggest Pandas may have optimized indexing in certain cases since I originally wrote this answer. Always profile your own code, and your mileage may vary.

Alexander's answer was the fastest for me as of 2020-01-06 when using .is_numpy() instead of .values. Tested in Jupyter Notebook on Windows 10. Pandas version = 0.24.2

import numpy as np 
import pandas as pd
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
pd.__version__ # '0.24.2'

def func1():
    # Asker badmax solution
    for idx in range(0, len(fp)): 
        fp.iloc[idx, 0] = idx

def func2():
    # Alexander Huszagh solution 1
    for idx in range(0, len(fp)):
        fp.to_numpy()[idx, 0] = idx

def func3():
    # user4322543 answer to
    # https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ
    new = []
    for idx in range(0, len(fp)):
        new.append(idx)
    fp[0] = new

def func4():
    # Alexander Huszagh solution 2
    fp[0] = np.arange(146611)

%timeit func1
19.7 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func2
19.1 ns ± 0.465 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func3
21.1 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func4
24.7 ns ± 0.889 ns per loop (mean ± std. dev. of 7 runs, 50000000 loops each)

What's the fastest way to acces a Pandas DataFrame?

Tags:

Python

Pandas

Related

Recent Posts