Computing diffs within groups of a dataframe

wouldn't be just easier to do what yourself describe, namely

df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()

and then correct for borders:

mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

df.filter(['ticker', 'date', 'value'])

and then join the two dataframes at the end.

edit: alternatively, ( though still not using groupby )

df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan 

for idx in df.index.levels[0]:
    df.diffs[idx] = df.value[idx].diff()

for

   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

this will produce:

             value  diffs
ticker date              
A      22     0.55    NaN
       32    -2.50  -3.05
       34    -0.23   2.27
       76    -0.79  -0.56
B      59    -1.01    NaN
       72    -1.24  -0.23
       92     2.43   3.67
C      22    -1.29    NaN
       63     1.65   2.94
       88    -1.93  -3.58

I know this is an old question, so I'm assuming this functionality didn't exist at the time. But for those with this question now, this solution works well:

df.sort_values(['ticker', 'date'], inplace=True)
df['diffs'] = df.groupby('ticker')['value'].diff()

In order to return to the original order, you can the use

df.sort_index(inplace=True)

# Make sure your data is sorted properly
df = df.sort_values(by=['group_var', 'value'])

# only take diffs where next row is of the same group
df['diffs'] = np.where(df.group_var == df.group_var.shift(1), df.value.diff(), 0)

Explanation: enter image description here


Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:

df.sort(['ticker', 'date'], inplace=True)

# for this example, with diff, I think this syntax is a bit clunky
# but for more general examples, this should be good.  But can we do better?
df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff()) 

df.sort_index(inplace=True)

This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20) to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupby takes care of that for you...).

So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:

df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)

when I do that, I get a diffs column full of 0's. Any thoughts on that?

Tags:

Python

Pandas