♻️ Pandas Handling Duplicated Data – Detect, Manage, and Clean Repetitive Records
🧲 Introduction – Why Handle Duplicated Data?
Duplicated data can sneak into datasets through manual entry, data merges, system glitches, or data scraping. If left unchecked, it may lead to incorrect analytics, inflated counts, and data quality issues. Pandas provides powerful tools to detect, flag, remove, or isolate duplicates at both row and column levels.
🎯 In this guide, you’ll learn:
- How to identify duplicated rows and values
- Drop duplicates selectively
- Keep specific entries (first, last)
- Flag and isolate duplicates for review
📥 1. Create a Sample DataFrame with Duplicates
import pandas as pd
df = pd.DataFrame({
    'ID': [101, 102, 103, 101, 104, 105, 101],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Alice'],
    'Score': [85, 90, 88, 85, 92, 87, 85]
})
print(df)
👉 Output:
    ID     Name  Score
0  101    Alice     85
1  102      Bob     90
2  103  Charlie     88
3  101    Alice     85
4  104    David     92
5  105      Eve     87
6  101    Alice     85
🔍 2. Detect Duplicate Rows
df.duplicated()
✔️ Returns a Boolean Series – True for rows that are exact duplicates of a previous one.
🧼 3. Remove Duplicate Rows
df_no_dup = df.drop_duplicates()
✔️ Drops exact duplicate rows across all columns (keeps first by default).
🎯 4. Drop Duplicates Based on Specific Columns
df.drop_duplicates(subset=['ID', 'Name'])
✔️ Removes duplicates based only on selected columns (not the full row).
🔄 5. Keep the Last Occurrence
df.drop_duplicates(subset=['ID'], keep='last')
✔️ Keeps the last occurrence of each duplicated ID.
🏷️ 6. Flag Duplicates with Boolean Column
df['is_duplicate'] = df.duplicated(subset=['ID', 'Name'])
✔️ Adds a new column that flags duplicated rows for review instead of deleting them.
📤 7. Extract Only the Duplicated Rows
dup_only = df[df.duplicated(subset=['ID', 'Name'], keep=False)]
✔️ Retrieves all duplicated entries (both original and duplicate versions).
🧾 8. Remove Duplicates In-Place
df.drop_duplicates(inplace=True)
✔️ Removes duplicates directly from the original DataFrame.
⚙️ 9. Detect Duplicates in a Single Column
df['ID'].duplicated()
✔️ Identifies repeated values in a single column like ID.
📌 Summary – Key Takeaways
Handling duplicated data is critical for ensuring integrity in reports, summaries, and ML training sets. Pandas gives you precise control over how duplicates are detected, flagged, or removed.
🔍 Key Takeaways:
- Use .duplicated()to detect full or column-specific duplicates
- Use .drop_duplicates()to remove them (with control over which one to keep)
- Use keep=Falseto capture all duplicated versions
- Combine .duplicated()with logic to flag or isolate records
⚙️ Real-world relevance: Used in CRM cleanup, transaction logs, deduplication before joins, and data warehouse management.
❓ FAQs – Handling Duplicated Data in Pandas
❓ How do I keep only the last occurrence of each duplicate?
df.drop_duplicates(keep='last')
❓ Can I detect duplicates on just one column?
✅ Yes:
df['ID'].duplicated()
❓ How to remove duplicates but keep the original DataFrame untouched?
Create a new copy:
df_cleaned = df.drop_duplicates()
❓ How do I remove duplicates across only selected columns?
Use the subset parameter:
df.drop_duplicates(subset=['Name'])
❓ Can I highlight duplicates instead of removing them?
✅ Yes. Create a flag:
df['duplicate_flag'] = df.duplicated(subset=['Name'])
Share Now :
