Does groupby in pandas create a copy of the data or just a view?

The groupby code in pandas gets a bit complex so it's hard to find out from first principles. A quick test makes it seem like the memory use grows as the data grows and that more groups = more memory, but it doesn't appear to making a full copy or anything:

In [7]: df = pd.DataFrame(np.random.random((1000,5)))

In [8]: def ret_df(df):
   ...:     return df

In [9]: def ret_gb_df(df):
   ...:     return df, df.groupby(0).mean()

In [10]: %memit ret_df(df)
peak memory: 75.91 MiB, increment: 0.00 MiB

In [11]: %memit ret_gb_df(df)
peak memory: 75.96 MiB, increment: 0.05 MiB

In [12]: df = pd.DataFrame(np.random.random((100000,5)))

In [13]: %memit ret_df(df)
peak memory: 79.76 MiB, increment: -0.02 MiB

In [14]: %memit ret_gb_df(df)
peak memory: 94.88 MiB, increment: 15.12 MiB

In [15]: df = pd.DataFrame(np.random.random((1000000,5)))

In [16]: %memit ret_df(df)
peak memory: 113.98 MiB, increment: 0.01 MiB

In [17]: %memit ret_gb_df(df)
peak memory: 263.14 MiB, increment: 149.16 MiB

In [18]: df = pd.DataFrame(np.random.choice([0,1,2,3], (1000000, 5)))

In [19]: %memit ret_df(df)
peak memory: 95.34 MiB, increment: 0.00 MiB

In [20]: %memit ret_gb_df(df)
peak memory: 166.91 MiB, increment: 71.56 MiB

I did a little more research on this since someone asked me to help them with this question, and the pandas source code has been revised somewhat since the accepted answer was written.

According to what I can tell from the source code:

Groupby returns the groups on a Grouper object (i.e. Grouper.groups), which are “a specification for a groupby instruction”.

Ok, so what does that mean?

“Groupers are ultimately index mappings.”

I've always thought of this as meaning that groupby is creating a new object. It's not a full copy of the original dataframe, because you're performing selections and aggregations. So it's more like a transformation in that sense.

If your definition of a view is like this: "A view is nothing more than a SQL statement that is stored in the database with an associated name. A view is actually a composition of a table in the form of a predefined SQL query", then I'm wondering if what you're really asking is whether the groupby operation has to be re-applied each time you execute the same grouping on the same dataframe?

If that's what you're asking, I'd say the answer is no, it's not like a view, as long as you store the result of the grouping operation. The output object of a grouped dataframe or series is a (new) dataframe or series.


I know the original question was about memory usage, but for people coming to this question looking for whether modifications to the group chunk affect the original dataframe, the pandas groupby user guide says:

Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.

Tags:

Python

Pandas