🛠️ Pandas Fixing Wrong Data – Correct Inaccurate and Inconsistent Values for Clean Analysis
🧲 Introduction – Why Fix Wrong Data?
Datasets often contain wrong values—like misspellings, incorrect entries, inconsistent formats, or logically invalid values. If left unresolved, these can lead to flawed analysis, poor model accuracy, and incorrect conclusions. Pandas offers practical tools to identify and fix wrong data efficiently.
🎯 In this guide, you’ll learn:
- How to detect and correct wrong data entries
- Replace invalid values and typos
- Validate logical conditions
- Standardize inconsistent categorical labels
🕵️♂️ 1. Detect Wrong or Unexpected Values
import pandas as pd
df = pd.DataFrame({
'Age': [25, -5, 300, 40], # Invalid ages
'Gender': ['M', 'F', 'femle', 'male'], # Misspelled categories
'Score': [88, 92, None, 77] # Missing score
})
print(df)
👉 Output:
Age Gender Score
0 25 M 88.0
1 -5 F 92.0
2 300 femle NaN
3 40 male 77.0
✔️ This dataset contains:
- Invalid ages (e.g., -5, 300)
- Misspelled genders (
femle
,male
) - Missing score
🧼 2. Replace Misspelled or Inconsistent Categories
df['Gender'] = df['Gender'].str.lower().replace({
'm': 'male',
'f': 'female',
'femle': 'female'
})
✔️ Converts all values to lowercase and replaces common spelling errors with standardized terms.
👉 Result:
Gender
0 male
1 female
2 female
3 male
📏 3. Fix Invalid Numerical Values
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = None
✔️ Replaces unrealistic age values (less than 0 or greater than 120) with NaN
.
🔁 4. Fill or Replace Incorrect Values
df['Score'].fillna(df['Score'].mean(), inplace=True)
✔️ Replaces missing or wrong values in the Score
column with the mean score.
🧾 5. Replace Specific Wrong Values
df['Age'].replace({300: 30}, inplace=True)
✔️ Replaces specific incorrect value (300) with a corrected one (30).
🧠 6. Use Conditional Replacement with apply()
def fix_age(x):
return None if x < 0 or x > 120 else x
df['Age'] = df['Age'].apply(fix_age)
✔️ Applies a custom rule to each value for fixing complex logic errors.
🔍 7. Validate Column Against Allowed Values
valid_genders = ['male', 'female']
df = df[df['Gender'].isin(valid_genders)]
✔️ Filters out rows where Gender
is not one of the allowed categories.
📌 Summary – Key Takeaways
Fixing wrong data ensures your dataset is trustworthy, consistent, and ready for analysis or modeling. Pandas makes this easy with powerful string, logic, and transformation tools.
🔍 Key Takeaways:
- Replace typos with
.replace()
or.apply()
functions - Validate numerical values using logical conditions
- Standardize categories with
.str.lower()
and replacement dictionaries - Filter out or correct invalid data rows
⚙️ Real-world relevance: Essential in data cleaning pipelines, especially in retail, healthcare, finance, and surveys where user inputs or sensors introduce noise.
❓ FAQs – Fixing Wrong Data in Pandas
❓ How do I replace a single wrong value?
df['Age'].replace({300: 30}, inplace=True)
❓ Can I apply logic-based replacements column-wise?
✅ Yes:
df['Age'] = df['Age'].apply(lambda x: None if x < 0 else x)
❓ How can I fix multiple string typos at once?
Use .replace()
with a dictionary:
df['Gender'].replace({'femle': 'female', 'm': 'male'}, inplace=True)
❓ What if I want to remove invalid rows entirely?
Use:
df = df[df['Age'] <= 120]
Share Now :