Faster way to accomplish this Pandas job than by using Apply for large data set?

My shot at the problem:

  • extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
  • extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks

My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.

Considering that, my approach was first to create a new column to count the number of object_a ticks for any parent. If this number is >0, it means that at least 1 object_a exists with Ticks>0.

In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())
Out[63]: 
Parent Name
3217863     2
4556421    34
dtype: int64

Let's now merge that into the original DataFrame...

In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())

In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)

In [66]: merged
Out[66]: 
       Type  Parent Name  Ticks  nbATicks
0  object_a      4556421     34        34
1  object_a      4556421      0        34
2  object_b      4556421      0        34
3  object_a      3217863      2         2
4  object_b      3217863      1         2

...and extract all the interesting rows, according to the criteria I stated above:

In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]
Out[67]: 
       Type  Parent Name  Ticks  nbATicks
1  object_a      4556421      0        34
2  object_b      4556421      0        34

Hopefully I didn't forget any fringe case...

Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.


Here comes my idea for the problem:

I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:

df_1 = df.loc[df['Type']=='object_a']
object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)&
                    (df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)&
                    (a['Ticks']==0)
                   ]
Out[1]: 
       Type  Parent_Name  Ticks
1  object_a      4556421      0

For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.

a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()
a_condition = a_condition[a_condition>0].index

object_b = df.loc[(df['Type']=='object_b')&
                  (df['Ticks']==0)&
                  (df['Parent_Name'].isin(a_condition))
                 ]
Out[2]: 
       Type  Parent_Name  Ticks
2  object_b      4556421      0

In [35]: df
Out[32]: 
       Type         Parent Name   Ticks
0  object_a             4556421      34
1  object_a             4556421       0
2  object_b             4556421       0
3  object_a             3217863       2
4  object_b             3217863       1

Aggregate the data into tuples

In [33]: df1 = df.groupby(['Parent Name',
                           'Type']).agg(lambda x: tuple(x)).unstack(1)

In [34]: df1
Out[34]: 
                      Ticks         
Type               object_a object_b
       Parent Name                  
3217863                (2,)     (1,)
4556421             (34, 0)     (0,)

Build the Boolean mask for your case #1

In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1), 
                           axis=1)

In [36]: mask1
Out[36]: 
       Parent Name
3217863    False
4556421     True
dtype: bool

Build Boolean mask for your case #2

In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) & 
                                      (len(set(x[0]).difference([0]))>0) &
                                      (len(x[1])==1) & 
                                      (x[1][0]==0)),
                           axis=1)

In [38]: mask2
Out[38]: 
       Parent Name
3217863    False
4556421     True
dtype: bool

Get the result for case #1

In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]
Out[39]: 
                      Ticks
Type               object_a
       Parent Name         
4556421             (34, 0)

Get the result for case #2

In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]
Out[30]: 
                      Ticks
Type               object_b
       Parent Name         
4556421                (0,)

Tags:

Python

Pandas