How to count duplicate rows in pandas dataframe?

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

I use:

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

Tags:

Python

Pandas