4️⃣ 🧹 Pandas Data Cleaning & Preprocessing
Estimated reading: 3 minutes 39 views

♻️ Pandas Handling Duplicated Data – Detect, Manage, and Clean Repetitive Records


🧲 Introduction – Why Handle Duplicated Data?

Duplicated data can sneak into datasets through manual entry, data merges, system glitches, or data scraping. If left unchecked, it may lead to incorrect analytics, inflated counts, and data quality issues. Pandas provides powerful tools to detect, flag, remove, or isolate duplicates at both row and column levels.

🎯 In this guide, you’ll learn:

  • How to identify duplicated rows and values
  • Drop duplicates selectively
  • Keep specific entries (first, last)
  • Flag and isolate duplicates for review

📥 1. Create a Sample DataFrame with Duplicates

import pandas as pd

df = pd.DataFrame({
    'ID': [101, 102, 103, 101, 104, 105, 101],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Alice'],
    'Score': [85, 90, 88, 85, 92, 87, 85]
})

print(df)

👉 Output:

    ID     Name  Score
0  101    Alice     85
1  102      Bob     90
2  103  Charlie     88
3  101    Alice     85
4  104    David     92
5  105      Eve     87
6  101    Alice     85

🔍 2. Detect Duplicate Rows

df.duplicated()

✔️ Returns a Boolean Series – True for rows that are exact duplicates of a previous one.


🧼 3. Remove Duplicate Rows

df_no_dup = df.drop_duplicates()

✔️ Drops exact duplicate rows across all columns (keeps first by default).


🎯 4. Drop Duplicates Based on Specific Columns

df.drop_duplicates(subset=['ID', 'Name'])

✔️ Removes duplicates based only on selected columns (not the full row).


🔄 5. Keep the Last Occurrence

df.drop_duplicates(subset=['ID'], keep='last')

✔️ Keeps the last occurrence of each duplicated ID.


🏷️ 6. Flag Duplicates with Boolean Column

df['is_duplicate'] = df.duplicated(subset=['ID', 'Name'])

✔️ Adds a new column that flags duplicated rows for review instead of deleting them.


📤 7. Extract Only the Duplicated Rows

dup_only = df[df.duplicated(subset=['ID', 'Name'], keep=False)]

✔️ Retrieves all duplicated entries (both original and duplicate versions).


🧾 8. Remove Duplicates In-Place

df.drop_duplicates(inplace=True)

✔️ Removes duplicates directly from the original DataFrame.


⚙️ 9. Detect Duplicates in a Single Column

df['ID'].duplicated()

✔️ Identifies repeated values in a single column like ID.


📌 Summary – Key Takeaways

Handling duplicated data is critical for ensuring integrity in reports, summaries, and ML training sets. Pandas gives you precise control over how duplicates are detected, flagged, or removed.

🔍 Key Takeaways:

  • Use .duplicated() to detect full or column-specific duplicates
  • Use .drop_duplicates() to remove them (with control over which one to keep)
  • Use keep=False to capture all duplicated versions
  • Combine .duplicated() with logic to flag or isolate records

⚙️ Real-world relevance: Used in CRM cleanup, transaction logs, deduplication before joins, and data warehouse management.


❓ FAQs – Handling Duplicated Data in Pandas

❓ How do I keep only the last occurrence of each duplicate?

df.drop_duplicates(keep='last')

❓ Can I detect duplicates on just one column?
✅ Yes:

df['ID'].duplicated()

❓ How to remove duplicates but keep the original DataFrame untouched?
Create a new copy:

df_cleaned = df.drop_duplicates()

❓ How do I remove duplicates across only selected columns?
Use the subset parameter:

df.drop_duplicates(subset=['Name'])

❓ Can I highlight duplicates instead of removing them?
✅ Yes. Create a flag:

df['duplicate_flag'] = df.duplicated(subset=['Name'])

Share Now :

Leave a Reply

Your email address will not be published. Required fields are marked *

Share

Pandas Handling Duplicated Data

Or Copy Link

CONTENTS
Scroll to Top