How to Remove Duplicated Data in Pandas: A Step-by-Step Guide

We‘ve all been there – staring at a spreadsheet and realizing that some of the data looks eerily familiar. Upon closer inspection, you confirm your suspicion: There are duplicated rows, taking up precious storage space and potentially skewing your analysis.

Messy, duplicated data is a common issue, but thankfully one that‘s easy to address using the pandas library in Python. Pandas provides powerful methods like .duplicated() and .drop_duplicates() that allow you to identify and remove duplicate rows and columns in just a few lines of code.

In this article, we‘ll walk through exactly how to use these methods to clean up your data. We‘ll cover how to:

  • Count the number of duplicated rows using .duplicated()
  • Customize .duplicated() behavior using keep and subset arguments
  • Actually remove duplicated rows using .drop_duplicates()
  • Handle duplicated columns by transposing with .T
  • Avoid common pitfalls and follow best practices

Whether you‘re a pandas novice or simply need a refresher, this guide will equip you with the tools you need to deduplicate your data with confidence. Let‘s dive in!

Counting Duplicated Rows with .duplicated()

Before we start deleting data, it‘s important to verify what pandas considers a "duplicate" and how many duplicates are present. That‘s where the .duplicated() method comes in. When called on a DataFrame, .duplicated() returns a boolean Series indicating which rows are duplicates.

Let‘s look at an example. Consider the following DataFrame of kitchen utensils:

import pandas as pd

data = {
    ‘utensil‘: [‘fork‘, ‘knife‘, ‘spoon‘, ‘fork‘, ‘spoon‘], 
    ‘material‘: [‘steel‘, ‘steel‘, ‘steel‘, ‘silver‘, ‘silver‘]
}

df = pd.DataFrame(data)

print(df)
  utensil material
0    fork    steel
1   knife    steel
2   spoon    steel
3    fork   silver
4   spoon   silver

To identify duplicates, simply call .duplicated() on the DataFrame:

df.duplicated()
0    False
1    False
2    False
3     True
4     True
dtype: bool

Here, .duplicated() scans through each row, comparing it against all previous rows. If an exact match is found (i.e. all values are the same), it marks that row as True. The first instance of a duplicated row is marked False, as it‘s considered the "original".

In our example, rows 3 and 4 are marked as duplicates, as their ‘fork‘ and ‘spoon‘ values were already seen in rows 0 and 2 respectively.

Customizing .duplicated() with keep and subset

By default, .duplicated() marks the last instance of a duplicated row as True. You can change this behavior using the keep argument:

  • keep=‘first‘ (default): Mark last instance as duplicated
  • keep=‘last‘: Mark first instance as duplicated
  • keep=False: Mark all instances as duplicated

For example:

df.duplicated(keep=‘last‘)
0     True
1    False
2     True
3    False
4    False
dtype: bool

We can also choose to only consider certain columns when identifying duplicates by passing a column name or list of names to the subset argument:

df.duplicated(subset=‘utensil‘)
0    False
1    False
2    False
3     True
4     True
dtype: bool

This is helpful for focusing on the columns that matter for identifying "true" duplicates.

Dropping Duplicated Rows with .drop_duplicates()

Now that we‘ve identified which rows are duplicates, we can remove them using the intuitively-named .drop_duplicates() method.

Calling .drop_duplicates() on a DataFrame will return a new DataFrame with the duplicated rows removed, keeping the first instance by default:

df.drop_duplicates()
  utensil material
0    fork    steel
1   knife    steel
2   spoon    steel

Like with .duplicated(), we can customize this behavior with keep and subset:

df.drop_duplicates(keep=‘last‘, subset=[‘utensil‘])
  utensil material
1   knife    steel
3    fork   silver
4   spoon   silver

This removes duplicates based on the ‘utensil‘ column only and keeps the last instance of each duplicated value.

It‘s important to note that .drop_duplicates() does not mutate the original DataFrame inplace. To actually remove the rows from df, you need to either reassign or use the inplace=True argument:

df = df.drop_duplicates()
# or 
df.drop_duplicates(inplace=True)

Handling Duplicated Columns

In some cases, you may want to remove entire duplicated columns rather than just rows. While .drop_duplicates() does not work on columns by default, we can easily get around this by temporarily transposing our DataFrame with the .T attribute.

Let‘s add a duplicate ‘utensil‘ column to our example:

df[‘gadget‘] = df[‘utensil‘] 

print(df)
  utensil material  gadget
0    fork    steel    fork
1   knife    steel   knife
2   spoon    steel   spoon

To remove this duplicated column, we can chain .T, .drop_duplicates() and .T again:

df = df.T.drop_duplicates().T

print(df)  
  utensil material
0    fork    steel
1   knife    steel
2   spoon    steel

Transposing switches our rows and columns, allowing .drop_duplicates() to identify the duplicated column values. We then transpose back to restore the original orientation.

Best Practices and Pitfalls

While extremely useful, .drop_duplicates() is not without its risks. Here are some best practices to keep in mind:

  1. Always check for duplicates with .duplicated() before removing to verify which rows will be affected.
  2. Be careful when using subset that you are only considering the columns that define a "true" duplicate for your use case.
  3. Avoid unintended data loss by either reassigning the result of .drop_duplicates() to a new variable or using inplace=True to mutate the original.
  4. Think carefully about when to drop duplicates in your workflow. If aggregating or analyzing data, you often want to deduplicate first to avoid double-counting.

With these tips in mind, you can use .duplicated() and .drop_duplicates() with confidence!

Putting It All Together

Removing duplicate data is a critical part of the data cleaning process. Pandas makes this easy with the .duplicated() and .drop_duplicates() methods.

To recap, .duplicated() helps us identify which rows are duplicates, while .drop_duplicates() removes those duplicated rows. We can customize which duplicates to flag or remove using the keep and subset arguments. Duplicate columns can also be handled by first transposing the DataFrame with .T.

With the knowledge from this guide, you‘re well-equipped to tackle duplicated data in your own pandas DataFrames. Doing so will save storage space, speed up computations and improve the overall quality of your data.

Data cleaning is just the beginning when it comes to pandas. To continue your learning journey, check out the official pandas documentation, which is full of helpful guides and examples. You can also find a wealth of tutorials and articles on sites like Real Python, Towards Data Science and more.

Happy deduping!

Similar Posts