4️⃣ 🧹 Pandas Data Cleaning & Preprocessing

Estimated reading: 3 minutes 56 views

🧯 Pandas Removing Duplicates – Clean Up Repeated Data for Accurate Results

🧲 Introduction – Why Remove Duplicates?

Duplicate rows can skew results, inflate counts, and corrupt analyses. Whether from data entry errors, file merges, or API pulls, duplicates need to be detected and removed. Pandas offers fast and flexible methods to identify and drop duplicates at the row or column level.

🎯 In this guide, you’ll learn:

How to detect duplicate rows and entries
Remove full or partial duplicates
Keep specific instances (first or last)
Apply duplicate logic to selected columns

🔍 1. Detect Duplicate Rows

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35],
    'Score': [90, 85, 90, 88]
})

print(df.duplicated())

👉 Output:

0    False
1    False
2     True
3    False
dtype: bool

✔️ .duplicated() returns a Boolean Series—True for rows that are exact duplicates of previous rows.

🧹 2. Remove Duplicate Rows

df_no_dup = df.drop_duplicates()

✔️ Drops rows that are fully duplicated across all columns.

🎯 3. Remove Duplicates by Specific Columns

df_no_dup = df.drop_duplicates(subset=['Name'])

✔️ Drops duplicate rows based on the Name column only. Keeps the first occurrence by default.

👉 Output:

      Name  Age  Score
0    Alice   25     90
1      Bob   30     85
3  Charlie   35     88

🧾 4. Keep the Last Occurrence Instead

df_last = df.drop_duplicates(subset=['Name'], keep='last')

✔️ Keeps the last entry when duplicates are found.

🔄 5. Mark Duplicates Without Dropping

df['is_duplicate'] = df.duplicated()

✔️ Adds a column flag so you can analyze duplicates before removing.

🧱 6. Remove Duplicates In-Place

df.drop_duplicates(inplace=True)

✔️ Modifies the original DataFrame without needing reassignment.

⚙️ 7. Customize Duplicate Logic (e.g., Case-Insensitive)

df['Name'] = df['Name'].str.lower()
df.drop_duplicates(subset=['Name'], inplace=True)

✔️ Normalizes column values (like Alice vs alice) before checking for duplicates.

📌 Summary – Key Takeaways

Removing duplicates is essential for accurate aggregations, reporting, and machine learning. Pandas gives you the tools to drop, flag, or filter duplicates easily.

🔍 Key Takeaways:

Use .duplicated() to find repeated rows
.drop_duplicates() removes them—entire rows or by column
keep='first' (default) vs keep='last' controls which record is retained
Use inplace=True to apply changes directly

⚙️ Real-world relevance: Critical for sales reports, customer lists, transaction logs, and any data that comes from multiple sources.

❓ FAQs – Removing Duplicates in Pandas

❓ How do I check for duplicates in specific columns only?

df.duplicated(subset=['Email', 'Phone'])

❓ How do I keep the last occurrence of a duplicate?

df.drop_duplicates(keep='last')

❓ Can I remove duplicates but keep a flag for reference?
✅ Yes:

df['is_dup'] = df.duplicated()

❓ What if I want to remove duplicates case-insensitively?
Lowercase the values first:

df['Name'] = df['Name'].str.lower()
df.drop_duplicates(subset=['Name'])

❓ Does drop_duplicates() affect the original DataFrame?
❌ Only if inplace=True is specified.

« Previous Next »

Share Now :