Remove duplicate rows from DataFrame in Pandas
If you are working on Pandas DataFrame that contains multiple rows and columns. Some of the rows in the DataFrame are the same and you want to remove them. In order to do that, you can use the code examples explained in this post.
import pandas as pd
# create a dataframe
df = pd.DataFrame({
"values": [1, 2, 1, 4, 2, 5],
"alphabets": ["A", "B", "A", "C", "B", "D"],
"numbers": [10, 20, 10, 40, 20, 50]
})
print(df)
# delete duplicate rows
df = df.drop_duplicates()
print(df)
Output
╒════╤══════════╤═════════════╤═══════════╕
│ │ values │ alphabets │ numbers │
╞════╪══════════╪═════════════╪═══════════╡
│ 0 │ 1 │ A │ 10 │
├────┼──────────┼─────────────┼───────────┤
│ 1 │ 2 │ B │ 20 │
├────┼──────────┼─────────────┼───────────┤
│ 2 │ 1 │ A │ 10 │
├────┼──────────┼─────────────┼───────────┤
│ 3 │ 4 │ C │ 40 │
├────┼──────────┼─────────────┼───────────┤
│ 4 │ 2 │ B │ 20 │
├────┼──────────┼─────────────┼───────────┤
│ 5 │ 5 │ D │ 50 │
╘════╧══════════╧═════════════╧═══════════╛
╒════╤══════════╤═════════════╤═══════════╕
│ │ values │ alphabets │ numbers │
╞════╪══════════╪═════════════╪═══════════╡
│ 0 │ 1 │ A │ 10 │
├────┼──────────┼─────────────┼───────────┤
│ 1 │ 2 │ B │ 20 │
├────┼──────────┼─────────────┼───────────┤
│ 3 │ 4 │ C │ 40 │
├────┼──────────┼─────────────┼───────────┤
│ 5 │ 5 │ D │ 50 │
╘════╧══════════╧═════════════╧═══════════╛
If you do not want to reassign the DataFrame then you can use the inplace=True parameter in the drop_duplicates() function. So you can use
df.drop_duplicates(inplace=True)
in place of,
df = df.drop_duplicates()
Remove duplicate rows from DataFrame using drop_duplicates() function
Pandas library has an in-built function drop_duplicates() to remove the duplicate rows from the DataFrame. By default, it checks the duplicate rows for all the columns but can specify the columns in the subsets parameter.
By default, the inplace parameter is False means you have to resign or crate the copy of DataFrame. To avoid that use inplace=True.
# Delete duplicate rows based on all columns
import pandas as pd
# create a dataframe
df = pd.DataFrame({
"product": ["P1", "P2", "P2", "P1", "P3", "P3"],
"price": [100, 130, 130, 100, 200, 200],
"sales": [40, 100, 100, 40, 90, 90]
})
print(df)
# delete duplicate rows
df = df.drop_duplicates()
print(df)
Output
╒════╤═══════════╤═════════╤═════════╕
│ │ product │ price │ sales │
╞════╪═══════════╪═════════╪═════════╡
│ 0 │ P1 │ 100 │ 40 │
├────┼───────────┼─────────┼─────────┤
│ 1 │ P2 │ 130 │ 100 │
├────┼───────────┼─────────┼─────────┤
│ 2 │ P2 │ 130 │ 100 │
├────┼───────────┼─────────┼─────────┤
│ 3 │ P1 │ 100 │ 40 │
├────┼───────────┼─────────┼─────────┤
│ 4 │ P3 │ 200 │ 90 │
├────┼───────────┼─────────┼─────────┤
│ 5 │ P3 │ 200 │ 90 │
╘════╧═══════════╧═════════╧═════════╛
╒════╤═══════════╤═════════╤═════════╕
│ │ product │ price │ sales │
╞════╪═══════════╪═════════╪═════════╡
│ 0 │ P1 │ 100 │ 40 │
├────┼───────────┼─────────┼─────────┤
│ 1 │ P2 │ 130 │ 100 │
├────┼───────────┼─────────┼─────────┤
│ 4 │ P3 │ 200 │ 90 │
╘════╧═══════════╧═════════╧═════════╛
# Delete duplicate rows based on one column
We can pass the parameter subset to the drop_duplicates() function to remove rows based on specific columns.
import pandas as pd
# create a dataframe
df = pd.DataFrame({
"A": ["Roy", "Roy", "Luke", "Luke", "Luke"],
"B": [1, 2, 3, 4, 5],
"C": [10, 20, 30, 40, 50]
})
print(df)
# delete duplicate rows
df = df.drop_duplicates(subset=["A"])
print(df)
Output
╒════╤══════╤═════╤═════╕
│ │ A │ B │ C │
╞════╪══════╪═════╪═════╡
│ 0 │ Roy │ 1 │ 10 │
├────┼──────┼─────┼─────┤
│ 1 │ Roy │ 2 │ 20 │
├────┼──────┼─────┼─────┤
│ 2 │ Luke │ 3 │ 30 │
├────┼──────┼─────┼─────┤
│ 3 │ Luke │ 4 │ 40 │
├────┼──────┼─────┼─────┤
│ 4 │ Luke │ 5 │ 50 │
╘════╧══════╧═════╧═════╛
╒════╤══════╤═════╤═════╕
│ │ A │ B │ C │
╞════╪══════╪═════╪═════╡
│ 0 │ Roy │ 1 │ 10 │
├────┼──────┼─────┼─────┤
│ 2 │ Luke │ 3 │ 30 │
╘════╧══════╧═════╧═════╛
- Pandas - Remove duplicate items from list
- Pandas - Delete,Remove,Drop, column from pandas DataFrame
- Create pandas DataFrame and add columns and rows to it
- Loop through DataFrame rows in python pandas
- Get a column rows as a List in Pandas Dataframe
- Get the count of rows and columns of a Pandas DataFrame
- Sort a DataFrame by rows and columns in Pandas