🧯 Pandas Detecting & Dropping Duplicates – Clean Repetitive Data Easily
🧲 Introduction – Why Detect & Drop Duplicates?
Duplicate records in your dataset can cause inaccurate analyses, wrong aggregates, or data quality issues. Pandas makes it simple to detect, analyze, and remove duplicate entries using intuitive methods like .duplicated()
and .drop_duplicates()
—whether you’re cleaning entire rows or targeting specific columns.
🎯 In this guide, you’ll learn:
- How to detect duplicates in rows or columns
- Drop duplicates while keeping first or last entries
- Use subsets and flags for controlled cleaning
- Perform in-place or selective duplication handling
🧪 1. Create a Sample DataFrame with Duplicates
import pandas as pd
df = pd.DataFrame({
'ID': [101, 102, 103, 101, 104, 105, 101],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Alice'],
'Score': [85, 90, 88, 85, 92, 87, 85]
})
print(df)
👉 Output:
ID Name Score
0 101 Alice 85
1 102 Bob 90
2 103 Charlie 88
3 101 Alice 85
4 104 David 92
5 105 Eve 87
6 101 Alice 85
🔍 2. Detect Duplicate Rows
df.duplicated()
✔️ Returns a Boolean Series with True
for rows that are duplicates of a previous row.
👉 Output:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
🧼 3. Drop Duplicate Rows (All Columns)
df_no_duplicates = df.drop_duplicates()
✔️ Drops rows that are fully identical across all columns—keeps the first occurrence by default.
🎯 4. Drop Duplicates Based on Specific Columns
df_unique_id = df.drop_duplicates(subset=['ID'])
✔️ Keeps only the first occurrence of each unique ID
.
👉 Use subset=['ID', 'Name']
to combine multiple columns for uniqueness.
🔁 5. Keep the Last Occurrence Instead
df_last = df.drop_duplicates(subset=['ID'], keep='last')
✔️ Keeps the last occurrence of each ID
instead of the first.
🧾 6. Detect and Flag Duplicates Without Dropping
df['is_duplicate'] = df.duplicated(subset=['ID', 'Name'])
✔️ Adds a new column that flags duplicates as True
—good for manual review or conditional deletion.
📤 7. Extract All Duplicate Entries (Including Originals)
duplicates = df[df.duplicated(subset=['ID', 'Name'], keep=False)]
✔️ Shows all rows involved in duplication, not just the second or third copies.
🧱 8. Remove Duplicates In-Place
df.drop_duplicates(inplace=True)
✔️ Modifies the original DataFrame by deleting duplicates directly—no reassignment needed.
📌 Summary – Key Takeaways
Detecting and dropping duplicates is essential for maintaining data integrity and accurate insights. Pandas gives you full control with .duplicated()
for detection and .drop_duplicates()
for removal—both flexible and fast.
🔍 Key Takeaways:
- Use
.duplicated()
to flag duplicate rows (full or partial) - Use
.drop_duplicates()
to remove duplicates while keeping the first or last - Use
subset=[]
to check duplication on selected columns - Combine flags and subset control for smarter deduplication
- Use
inplace=True
if you want to modify the original DataFrame
⚙️ Real-world relevance: Common in CRM databases, Excel imports, transactional logs, and dataset merges.
❓ FAQs – Detecting & Dropping Duplicates in Pandas
❓ How do I remove only specific column duplicates?
df.drop_duplicates(subset=['Name'])
❓ Can I find all rows that are duplicates (including the original)?
✅ Yes:
df[df.duplicated(keep=False)]
❓ What happens if I use keep='last'
?
✔️ Keeps the last occurrence and drops earlier ones:
df.drop_duplicates(keep='last')
❓ Does dropping duplicates affect the original DataFrame?
❌ No—unless you set inplace=True
.
❓ Can I just flag duplicates for review instead of deleting them?
Yes:
df['is_duplicate'] = df.duplicated(subset=['ID'])
Share Now :