🧯 Pandas Removing Duplicates – Clean Up Repeated Data for Accurate Results
🧲 Introduction – Why Remove Duplicates?
Duplicate rows can skew results, inflate counts, and corrupt analyses. Whether from data entry errors, file merges, or API pulls, duplicates need to be detected and removed. Pandas offers fast and flexible methods to identify and drop duplicates at the row or column level.
🎯 In this guide, you’ll learn:
- How to detect duplicate rows and entries
- Remove full or partial duplicates
- Keep specific instances (first or last)
- Apply duplicate logic to selected columns
🔍 1. Detect Duplicate Rows
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35],
'Score': [90, 85, 90, 88]
})
print(df.duplicated())
👉 Output:
0 False
1 False
2 True
3 False
dtype: bool
✔️ .duplicated() returns a Boolean Series—True for rows that are exact duplicates of previous rows.
🧹 2. Remove Duplicate Rows
df_no_dup = df.drop_duplicates()
✔️ Drops rows that are fully duplicated across all columns.
🎯 3. Remove Duplicates by Specific Columns
df_no_dup = df.drop_duplicates(subset=['Name'])
✔️ Drops duplicate rows based on the Name column only. Keeps the first occurrence by default.
👉 Output:
Name Age Score
0 Alice 25 90
1 Bob 30 85
3 Charlie 35 88
🧾 4. Keep the Last Occurrence Instead
df_last = df.drop_duplicates(subset=['Name'], keep='last')
✔️ Keeps the last entry when duplicates are found.
🔄 5. Mark Duplicates Without Dropping
df['is_duplicate'] = df.duplicated()
✔️ Adds a column flag so you can analyze duplicates before removing.
🧱 6. Remove Duplicates In-Place
df.drop_duplicates(inplace=True)
✔️ Modifies the original DataFrame without needing reassignment.
⚙️ 7. Customize Duplicate Logic (e.g., Case-Insensitive)
df['Name'] = df['Name'].str.lower()
df.drop_duplicates(subset=['Name'], inplace=True)
✔️ Normalizes column values (like Alice vs alice) before checking for duplicates.
📌 Summary – Key Takeaways
Removing duplicates is essential for accurate aggregations, reporting, and machine learning. Pandas gives you the tools to drop, flag, or filter duplicates easily.
🔍 Key Takeaways:
- Use
.duplicated()to find repeated rows .drop_duplicates()removes them—entire rows or by columnkeep='first'(default) vskeep='last'controls which record is retained- Use
inplace=Trueto apply changes directly
⚙️ Real-world relevance: Critical for sales reports, customer lists, transaction logs, and any data that comes from multiple sources.
❓ FAQs – Removing Duplicates in Pandas
❓ How do I check for duplicates in specific columns only?
df.duplicated(subset=['Email', 'Phone'])
❓ How do I keep the last occurrence of a duplicate?
df.drop_duplicates(keep='last')
❓ Can I remove duplicates but keep a flag for reference?
✅ Yes:
df['is_dup'] = df.duplicated()
❓ What if I want to remove duplicates case-insensitively?
Lowercase the values first:
df['Name'] = df['Name'].str.lower()
df.drop_duplicates(subset=['Name'])
❓ Does drop_duplicates() affect the original DataFrame?
❌ Only if inplace=True is specified.
Share Now :
