pandas - find first occurrence

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index()

3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec    ratio wrt fastest algo
searchsorted numpy:        0.0007        1.00
argmax numpy:              0.0009        1.29
for loop:                  0.0045        6.43
searchsorted pandas:       0.0075       10.71
idxmax pandas:             0.0267       38.14
index[0]:                  0.0295       42.14
first_valid_index pandas:  0.1181      168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

                     total_time_sec ratio wrt fastest algo
searchsorted numpy:        0.0007       1.00
searchsorted pandas:       0.0076      10.86
argmax numpy:              0.0117      16.71
index[0]:                  0.0815     116.43
idxmax pandas:             0.0904     129.14
first_valid_index pandas:  0.1691     241.57
for loop:                  9.6504   13786.29

The code to produce these results is below:

import timeit

# code snippet to be executed only once 
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''

# code snippets whose execution time is to be measured   
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]

mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")

mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")

mycode_set.append(  '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")

mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")

mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")

mycode_set.append( '''for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        ans = index
        break
        ''')
message.append("for loop: ")

total_time_in_sec = []
for i in range(len(mycode_set)):
    mycode = mycode_set[i]
    total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
         stmt = mycode, number = 100),4))

output = pd.DataFrame(total_time_in_sec, index = message, \
                      columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)

output = output.sort_values(by = "total_time_sec")
display(output)

For the larger dataframe:

mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()

3

or the numpy equivalent

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')

3

Tags:

Python

Pandas