Search code snippets, questions, articles...

Remove duplicate rows from DataFrame in Pandas

If you are working on Pandas DataFrame that contains multiple rows and columns. Some of the rows in the DataFrame are the same and you want to remove them. In order to do that, you can use the code examples explained in this post.
import pandas as pd

# create a dataframe
df = pd.DataFrame({
  "values": [1, 2, 1, 4, 2, 5],
  "alphabets": ["A", "B", "A", "C", "B", "D"],
  "numbers": [10, 20, 10, 40, 20, 50]
})
print(df)

# delete duplicate rows
df = df.drop_duplicates()

print(df)
Best JSON Validator, JSON Tree Viewer, JSON Beautifier at same place.

Output

╒════╤══════════╤═════════════╤═══════════╕
│    │   values │ alphabets   │   numbers │
╞════╪══════════╪═════════════╪═══════════╡
│  0 │        1 │ A           │        10 │
├────┼──────────┼─────────────┼───────────┤
│  1 │        2 │ B           │        20 │
├────┼──────────┼─────────────┼───────────┤
│  2 │        1 │ A           │        10 │
├────┼──────────┼─────────────┼───────────┤
│  3 │        4 │ C           │        40 │
├────┼──────────┼─────────────┼───────────┤
│  4 │        2 │ B           │        20 │
├────┼──────────┼─────────────┼───────────┤
│  5 │        5 │ D           │        50 │
╘════╧══════════╧═════════════╧═══════════╛

╒════╤══════════╤═════════════╤═══════════╕
│    │   values │ alphabets   │   numbers │
╞════╪══════════╪═════════════╪═══════════╡
│  0 │        1 │ A           │        10 │
├────┼──────────┼─────────────┼───────────┤
│  1 │        2 │ B           │        20 │
├────┼──────────┼─────────────┼───────────┤
│  3 │        4 │ C           │        40 │
├────┼──────────┼─────────────┼───────────┤
│  5 │        5 │ D           │        50 │
╘════╧══════════╧═════════════╧═══════════╛

If you do not want to reassign the DataFrame then you can use the inplace=True parameter in the drop_duplicates() function. So you can use 

df.drop_duplicates(inplace=True)

in place of,

df = df.drop_duplicates()

Remove duplicate rows from DataFrame using drop_duplicates() function

Pandas library has an in-built function drop_duplicates() to remove the duplicate rows from the DataFrame. By default, it checks the duplicate rows for all the columns but can specify the columns in the subsets parameter.

By default, the inplace parameter is False means you have to resign or crate the copy of DataFrame. To avoid that use inplace=True.

# Delete duplicate rows based on all columns

import pandas as pd

# create a dataframe
df = pd.DataFrame({
  "product": ["P1", "P2", "P2", "P1", "P3", "P3"],
  "price": [100, 130, 130, 100, 200, 200],
  "sales": [40, 100, 100, 40, 90, 90]
})
print(df)

# delete duplicate rows
df = df.drop_duplicates()

print(df)

Output

╒════╤═══════════╤═════════╤═════════╕
│    │ product   │   price │   sales │
╞════╪═══════════╪═════════╪═════════╡
│  0 │ P1        │     100 │      40 │
├────┼───────────┼─────────┼─────────┤
│  1 │ P2        │     130 │     100 │
├────┼───────────┼─────────┼─────────┤
│  2 │ P2        │     130 │     100 │
├────┼───────────┼─────────┼─────────┤
│  3 │ P1        │     100 │      40 │
├────┼───────────┼─────────┼─────────┤
│  4 │ P3        │     200 │      90 │
├────┼───────────┼─────────┼─────────┤
│  5 │ P3        │     200 │      90 │
╘════╧═══════════╧═════════╧═════════╛

╒════╤═══════════╤═════════╤═════════╕
│    │ product   │   price │   sales │
╞════╪═══════════╪═════════╪═════════╡
│  0 │ P1        │     100 │      40 │
├────┼───────────┼─────────┼─────────┤
│  1 │ P2        │     130 │     100 │
├────┼───────────┼─────────┼─────────┤
│  4 │ P3        │     200 │      90 │
╘════╧═══════════╧═════════╧═════════╛

# Delete duplicate rows based on one column

We can pass the parameter subset to the drop_duplicates() function to remove rows based on specific columns.

import pandas as pd

# create a dataframe
df = pd.DataFrame({
  "A": ["Roy", "Roy", "Luke", "Luke", "Luke"],
  "B": [1, 2, 3, 4, 5],
  "C": [10, 20, 30, 40, 50]
})
print(df)

# delete duplicate rows
df = df.drop_duplicates(subset=["A"])

print(df)

Output

╒════╤══════╤═════╤═════╕
│    │ A    │   B │   C │
╞════╪══════╪═════╪═════╡
│  0 │ Roy  │   1 │  10 │
├────┼──────┼─────┼─────┤
│  1 │ Roy  │   2 │  20 │
├────┼──────┼─────┼─────┤
│  2 │ Luke │   3 │  30 │
├────┼──────┼─────┼─────┤
│  3 │ Luke │   4 │  40 │
├────┼──────┼─────┼─────┤
│  4 │ Luke │   5 │  50 │
╘════╧══════╧═════╧═════╛
╒════╤══════╤═════╤═════╕
│    │ A    │   B │   C │
╞════╪══════╪═════╪═════╡
│  0 │ Roy  │   1 │  10 │
├────┼──────┼─────┼─────┤
│  2 │ Luke │   3 │  30 │
╘════╧══════╧═════╧═════╛
Was this helpful?
0 Comments