How to "select distinct" across multiple data frame columns in pandas?

To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).


I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.