How to Remove Duplicated Data in Pandas: A Step-by-Step Guide
We‘ve all been there – staring at a spreadsheet and realizing that some of the data looks eerily familiar. Upon closer inspection, you confirm your suspicion: There are duplicated rows, taking up precious storage space and potentially skewing your analysis.
Messy, duplicated data is a common issue, but thankfully one that‘s easy to address using the pandas library in Python. Pandas provides powerful methods like .duplicated() and .drop_duplicates() that allow you to identify and remove duplicate rows and columns in just a few lines of code.
In this article, we‘ll walk through exactly how to use these methods to clean up your data. We‘ll cover how to:
- Count the number of duplicated rows using
.duplicated() - Customize
.duplicated()behavior usingkeepandsubsetarguments - Actually remove duplicated rows using
.drop_duplicates() - Handle duplicated columns by transposing with
.T - Avoid common pitfalls and follow best practices
Whether you‘re a pandas novice or simply need a refresher, this guide will equip you with the tools you need to deduplicate your data with confidence. Let‘s dive in!
Counting Duplicated Rows with .duplicated()
Before we start deleting data, it‘s important to verify what pandas considers a "duplicate" and how many duplicates are present. That‘s where the .duplicated() method comes in. When called on a DataFrame, .duplicated() returns a boolean Series indicating which rows are duplicates.
Let‘s look at an example. Consider the following DataFrame of kitchen utensils:
import pandas as pd
data = {
‘utensil‘: [‘fork‘, ‘knife‘, ‘spoon‘, ‘fork‘, ‘spoon‘],
‘material‘: [‘steel‘, ‘steel‘, ‘steel‘, ‘silver‘, ‘silver‘]
}
df = pd.DataFrame(data)
print(df)
utensil material
0 fork steel
1 knife steel
2 spoon steel
3 fork silver
4 spoon silver
To identify duplicates, simply call .duplicated() on the DataFrame:
df.duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Here, .duplicated() scans through each row, comparing it against all previous rows. If an exact match is found (i.e. all values are the same), it marks that row as True. The first instance of a duplicated row is marked False, as it‘s considered the "original".
In our example, rows 3 and 4 are marked as duplicates, as their ‘fork‘ and ‘spoon‘ values were already seen in rows 0 and 2 respectively.
Customizing .duplicated() with keep and subset
By default, .duplicated() marks the last instance of a duplicated row as True. You can change this behavior using the keep argument:
keep=‘first‘(default): Mark last instance as duplicatedkeep=‘last‘: Mark first instance as duplicatedkeep=False: Mark all instances as duplicated
For example:
df.duplicated(keep=‘last‘)
0 True
1 False
2 True
3 False
4 False
dtype: bool
We can also choose to only consider certain columns when identifying duplicates by passing a column name or list of names to the subset argument:
df.duplicated(subset=‘utensil‘)
0 False
1 False
2 False
3 True
4 True
dtype: bool
This is helpful for focusing on the columns that matter for identifying "true" duplicates.
Dropping Duplicated Rows with .drop_duplicates()
Now that we‘ve identified which rows are duplicates, we can remove them using the intuitively-named .drop_duplicates() method.
Calling .drop_duplicates() on a DataFrame will return a new DataFrame with the duplicated rows removed, keeping the first instance by default:
df.drop_duplicates()
utensil material
0 fork steel
1 knife steel
2 spoon steel
Like with .duplicated(), we can customize this behavior with keep and subset:
df.drop_duplicates(keep=‘last‘, subset=[‘utensil‘])
utensil material
1 knife steel
3 fork silver
4 spoon silver
This removes duplicates based on the ‘utensil‘ column only and keeps the last instance of each duplicated value.
It‘s important to note that .drop_duplicates() does not mutate the original DataFrame inplace. To actually remove the rows from df, you need to either reassign or use the inplace=True argument:
df = df.drop_duplicates()
# or
df.drop_duplicates(inplace=True)
Handling Duplicated Columns
In some cases, you may want to remove entire duplicated columns rather than just rows. While .drop_duplicates() does not work on columns by default, we can easily get around this by temporarily transposing our DataFrame with the .T attribute.
Let‘s add a duplicate ‘utensil‘ column to our example:
df[‘gadget‘] = df[‘utensil‘]
print(df)
utensil material gadget
0 fork steel fork
1 knife steel knife
2 spoon steel spoon
To remove this duplicated column, we can chain .T, .drop_duplicates() and .T again:
df = df.T.drop_duplicates().T
print(df)
utensil material
0 fork steel
1 knife steel
2 spoon steel
Transposing switches our rows and columns, allowing .drop_duplicates() to identify the duplicated column values. We then transpose back to restore the original orientation.
Best Practices and Pitfalls
While extremely useful, .drop_duplicates() is not without its risks. Here are some best practices to keep in mind:
- Always check for duplicates with
.duplicated()before removing to verify which rows will be affected. - Be careful when using
subsetthat you are only considering the columns that define a "true" duplicate for your use case. - Avoid unintended data loss by either reassigning the result of
.drop_duplicates()to a new variable or usinginplace=Trueto mutate the original. - Think carefully about when to drop duplicates in your workflow. If aggregating or analyzing data, you often want to deduplicate first to avoid double-counting.
With these tips in mind, you can use .duplicated() and .drop_duplicates() with confidence!
Putting It All Together
Removing duplicate data is a critical part of the data cleaning process. Pandas makes this easy with the .duplicated() and .drop_duplicates() methods.
To recap, .duplicated() helps us identify which rows are duplicates, while .drop_duplicates() removes those duplicated rows. We can customize which duplicates to flag or remove using the keep and subset arguments. Duplicate columns can also be handled by first transposing the DataFrame with .T.
With the knowledge from this guide, you‘re well-equipped to tackle duplicated data in your own pandas DataFrames. Doing so will save storage space, speed up computations and improve the overall quality of your data.
Data cleaning is just the beginning when it comes to pandas. To continue your learning journey, check out the official pandas documentation, which is full of helpful guides and examples. You can also find a wealth of tutorials and articles on sites like Real Python, Towards Data Science and more.
Happy deduping!
