4️⃣ 🧹 Pandas Data Cleaning & Preprocessing

Estimated reading: 3 minutes 44 views

🧯 Pandas Detecting & Dropping Duplicates – Clean Repetitive Data Easily

🧲 Introduction – Why Detect & Drop Duplicates?

Duplicate records in your dataset can cause inaccurate analyses, wrong aggregates, or data quality issues. Pandas makes it simple to detect, analyze, and remove duplicate entries using intuitive methods like .duplicated() and .drop_duplicates()—whether you’re cleaning entire rows or targeting specific columns.

🎯 In this guide, you’ll learn:

How to detect duplicates in rows or columns
Drop duplicates while keeping first or last entries
Use subsets and flags for controlled cleaning
Perform in-place or selective duplication handling

🧪 1. Create a Sample DataFrame with Duplicates

import pandas as pd

df = pd.DataFrame({
    'ID': [101, 102, 103, 101, 104, 105, 101],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Alice'],
    'Score': [85, 90, 88, 85, 92, 87, 85]
})

print(df)

👉 Output:

    ID     Name  Score
0  101    Alice     85
1  102      Bob     90
2  103  Charlie     88
3  101    Alice     85
4  104    David     92
5  105      Eve     87
6  101    Alice     85

🔍 2. Detect Duplicate Rows

df.duplicated()

✔️ Returns a Boolean Series with True for rows that are duplicates of a previous row.

👉 Output:

0    False
1    False
2    False
3     True
4    False
5    False
6     True

🧼 3. Drop Duplicate Rows (All Columns)

df_no_duplicates = df.drop_duplicates()

✔️ Drops rows that are fully identical across all columns—keeps the first occurrence by default.

🎯 4. Drop Duplicates Based on Specific Columns

df_unique_id = df.drop_duplicates(subset=['ID'])

✔️ Keeps only the first occurrence of each unique ID.

👉 Use subset=['ID', 'Name'] to combine multiple columns for uniqueness.

🔁 5. Keep the Last Occurrence Instead

df_last = df.drop_duplicates(subset=['ID'], keep='last')

✔️ Keeps the last occurrence of each ID instead of the first.

🧾 6. Detect and Flag Duplicates Without Dropping

df['is_duplicate'] = df.duplicated(subset=['ID', 'Name'])

✔️ Adds a new column that flags duplicates as True—good for manual review or conditional deletion.

📤 7. Extract All Duplicate Entries (Including Originals)

duplicates = df[df.duplicated(subset=['ID', 'Name'], keep=False)]

✔️ Shows all rows involved in duplication, not just the second or third copies.

🧱 8. Remove Duplicates In-Place

df.drop_duplicates(inplace=True)

✔️ Modifies the original DataFrame by deleting duplicates directly—no reassignment needed.

📌 Summary – Key Takeaways

Detecting and dropping duplicates is essential for maintaining data integrity and accurate insights. Pandas gives you full control with .duplicated() for detection and .drop_duplicates() for removal—both flexible and fast.

🔍 Key Takeaways:

Use .duplicated() to flag duplicate rows (full or partial)
Use .drop_duplicates() to remove duplicates while keeping the first or last
Use subset=[] to check duplication on selected columns
Combine flags and subset control for smarter deduplication
Use inplace=True if you want to modify the original DataFrame

⚙️ Real-world relevance: Common in CRM databases, Excel imports, transactional logs, and dataset merges.

❓ FAQs – Detecting & Dropping Duplicates in Pandas

❓ How do I remove only specific column duplicates?

df.drop_duplicates(subset=['Name'])

❓ Can I find all rows that are duplicates (including the original)?
✅ Yes:

df[df.duplicated(keep=False)]

❓ What happens if I use keep='last'?
✔️ Keeps the last occurrence and drops earlier ones:

df.drop_duplicates(keep='last')

❓ Does dropping duplicates affect the original DataFrame?
❌ No—unless you set inplace=True.

❓ Can I just flag duplicates for review instead of deleting them?
Yes:

df['is_duplicate'] = df.duplicated(subset=['ID'])

« Previous Next »

Share Now :