# Remove duplicates based on combination of two columns in Pandas

## Remove duplicates based on combination of two columns in Pandas

Contents

Problem Description:

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.
For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.

``````Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
``````
`````` df
person1 person2 messages
0   0   ryan    delta   1
1   1   delta   ryan    1
2   2   delta   alpha   2
3   3   delta   bravo   3
4   4   bravo   delta   3
5   5   alpha   ryan    9
6   6   ryan    alpha   9
``````

Answer df should be:

`````` finaldf
person1 person2 messages
0   0   ryan    delta   1
1   2   delta   alpha   2
2   3   delta   bravo   3
3   5   alpha   ryan    9
``````

## Solution – 1

Try as follows:

``````res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))

print(res)

person1 person2  messages
0  0    ryan   delta         1
1  2   delta   alpha         2
2  3   delta   bravo         3
3  5   alpha    ryan         9
``````

Explanation

``````0     (ryan, delta)
1     (ryan, delta)
2    (alpha, delta)
3    (bravo, delta)
4    (bravo, delta)
5     (alpha, ryan)
6     (alpha, ryan)
dtype: object
``````

## Solution – 2

Here’s a less general approach than the one given by @ouroboros1, this only works for your two columns case

``````#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)

#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
``````

## Solution – 3

You can put the two person columns in order within each row, then drop duplicates.

``````import pandas as pd

df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})

print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values

df = df.drop_duplicates(subset=['person1', 'person2'])

print(df)
``````

After the swap:

``````     person1 person2  messages
0  0    ryan   delta         1
1  1    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
4  4   delta   bravo         3
5  5    ryan   alpha         9
6  6    ryan   alpha         9
``````

After dropping duplicates:

``````     person1 person2  messages
0  0    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
5  5    ryan   alpha         9
``````
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.