When does Pandas default to broadcasting Series and Dataframes?

What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:

s1 = pd.Series([1,2,3], index=['a','b','c'])
s2 = pd.Series([2,4,6], index=['a','b','c'])
s1 + s2
#Ouput as expected:
a    3
b    6
c    9
dtype: int64

Now, let's run a couple other examples with different indexing:

s2 = pd.Series([2,4,6], index=['a','a','c'])
s1 + s2
#Ouput
a    3.0
a    5.0
b    NaN
c    9.0
dtype: float64

A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.

And, no matching indexes:

s2 = pd.Series([2,4,6], index=['e','f','g'])
s1 + s2
#Output
a   NaN
b   NaN
c   NaN
e   NaN
f   NaN
g   NaN
dtype: float64

So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.

Bottom line, Pandas compares each series value to the column with the title which matches the value index. The indices in your second example are 0..10, and the column names cell1..4, so no column name matches, and you just append new columns. This is essentially treating the series as a dataframe with the index as the column titles.

You can actually see part of what pandas does in your first example if you make your series longer than the amount of columns:

>>> my_ser = pd.Series(np.random.randint(0, 100, size=20))
>>> my_df
    0   1   2   3   4
0   9  10  27  45  71
1  39  61  85  97  44
2  34  34  88  33   5
3  36   0  75  34  69
4  53  80  62   8  61
5   1  81  35  91  40
6  36  48  25  67  35
7  30  29  33  18  17
8  93  84   2  69  12
9  44  66  91  85  39
>>> my_ser
0     92
1     36
2     25
3     32
4     42
5     14
6     86
7     28
8     20
9     82
10    68
11    22
12    99
13    83
14     7
15    72
16    61
17    13
18     5
19     0
dtype: int64
>>> my_ser>my_df
      0      1      2      3      4      5      6      7      8      9   \
0   True   True  False  False  False  False  False  False  False  False
1   True  False  False  False  False  False  False  False  False  False
2   True   True  False  False   True  False  False  False  False  False
3   True   True  False  False  False  False  False  False  False  False
4   True  False  False   True  False  False  False  False  False  False
5   True  False  False  False   True  False  False  False  False  False
6   True  False  False  False   True  False  False  False  False  False
7   True   True  False   True   True  False  False  False  False  False
8  False  False   True  False   True  False  False  False  False  False
9   True  False  False  False   True  False  False  False  False  False

      10     11     12     13     14     15     16     17     18     19
0  False  False  False  False  False  False  False  False  False  False
1  False  False  False  False  False  False  False  False  False  False
2  False  False  False  False  False  False  False  False  False  False
3  False  False  False  False  False  False  False  False  False  False
4  False  False  False  False  False  False  False  False  False  False
5  False  False  False  False  False  False  False  False  False  False
6  False  False  False  False  False  False  False  False  False  False
7  False  False  False  False  False  False  False  False  False  False
8  False  False  False  False  False  False  False  False  False  False
9  False  False  False  False  False  False  False  False  False  False

Note what is happening - 92 is compared to the first column, so you get a single False at 93. Then 36 is compared to the second column etc. If your series matches in length your amount of columns, then you get the expected behavior.

But what happens when your series is longer? Well, you need to append a new fake column to the data frame to continue the comparison. What is it filled with? I found no documentation, but my impression is it just fills in False, since there is nothing to compare to. Hence you get extra columns to match the series length, all False.

But what about your example. You do not get 11 columns, but 4+11=15! Let's make another test:

>>> my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10),columns=[chr(i) for i in range(10)])
>>> my_ser = pd.Series(np.random.randint(0, 100, size=10))
>>> (my_df>my_ser).shape
(10, 20)

This time we got the sum of the dimensions, 10+10=20, as the amount of output columns!

What was the difference? Pandas compares each series index with the matching column title. In your first example, the index of my_ser and my_df titles matched, so it compared them. If there are extra columns - the above is what happens. If all columns have different names then the series indices, then all the columns are extra, and you get your result, and what happens in my example where the titles are now characters, and the index integers.

When does Pandas default to broadcasting Series and Dataframes?

Tags:

Python

Pandas

Array Broadcasting

Related

Recent Posts