4️⃣ 🧹 Pandas Data Cleaning & Preprocessing

Estimated reading: 3 minutes 106 views

🛠️ Pandas Fixing Wrong Data – Correct Inaccurate and Inconsistent Values for Clean Analysis

🧲 Introduction – Why Fix Wrong Data?

Datasets often contain wrong values—like misspellings, incorrect entries, inconsistent formats, or logically invalid values. If left unresolved, these can lead to flawed analysis, poor model accuracy, and incorrect conclusions. Pandas offers practical tools to identify and fix wrong data efficiently.

🎯 In this guide, you’ll learn:

How to detect and correct wrong data entries
Replace invalid values and typos
Validate logical conditions
Standardize inconsistent categorical labels

🕵️‍♂️ 1. Detect Wrong or Unexpected Values

import pandas as pd

df = pd.DataFrame({
    'Age': [25, -5, 300, 40],       # Invalid ages
    'Gender': ['M', 'F', 'femle', 'male'],  # Misspelled categories
    'Score': [88, 92, None, 77]     # Missing score
})

print(df)

👉 Output:

   Age Gender  Score
0   25      M   88.0
1   -5      F   92.0
2  300  femle    NaN
3   40   male   77.0

✔️ This dataset contains:

Invalid ages (e.g., -5, 300)
Misspelled genders (femle, male)
Missing score

🧼 2. Replace Misspelled or Inconsistent Categories

df['Gender'] = df['Gender'].str.lower().replace({
    'm': 'male',
    'f': 'female',
    'femle': 'female'
})

✔️ Converts all values to lowercase and replaces common spelling errors with standardized terms.

👉 Result:

   Gender
0   male
1 female
2 female
3  male

📏 3. Fix Invalid Numerical Values

df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = None

✔️ Replaces unrealistic age values (less than 0 or greater than 120) with NaN.

🔁 4. Fill or Replace Incorrect Values

df['Score'].fillna(df['Score'].mean(), inplace=True)

✔️ Replaces missing or wrong values in the Score column with the mean score.

🧾 5. Replace Specific Wrong Values

df['Age'].replace({300: 30}, inplace=True)

✔️ Replaces specific incorrect value (300) with a corrected one (30).

🧠 6. Use Conditional Replacement with `apply()`

def fix_age(x):
    return None if x < 0 or x > 120 else x

df['Age'] = df['Age'].apply(fix_age)

✔️ Applies a custom rule to each value for fixing complex logic errors.

🔍 7. Validate Column Against Allowed Values

valid_genders = ['male', 'female']
df = df[df['Gender'].isin(valid_genders)]

✔️ Filters out rows where Gender is not one of the allowed categories.

📌 Summary – Key Takeaways

Fixing wrong data ensures your dataset is trustworthy, consistent, and ready for analysis or modeling. Pandas makes this easy with powerful string, logic, and transformation tools.

🔍 Key Takeaways:

Replace typos with .replace() or .apply() functions
Validate numerical values using logical conditions
Standardize categories with .str.lower() and replacement dictionaries
Filter out or correct invalid data rows

⚙️ Real-world relevance: Essential in data cleaning pipelines, especially in retail, healthcare, finance, and surveys where user inputs or sensors introduce noise.

❓ FAQs – Fixing Wrong Data in Pandas

❓ How do I replace a single wrong value?

df['Age'].replace({300: 30}, inplace=True)

❓ Can I apply logic-based replacements column-wise?
✅ Yes:

df['Age'] = df['Age'].apply(lambda x: None if x < 0 else x)

❓ How can I fix multiple string typos at once?
Use .replace() with a dictionary:

df['Gender'].replace({'femle': 'female', 'm': 'male'}, inplace=True)

❓ What if I want to remove invalid rows entirely?
Use:

df = df[df['Age'] <= 120]

« Previous Next »

Share Now :