Pandas Tutorial
Estimated reading: 4 minutes 40 views

4️⃣ 🧹 Pandas Data Cleaning & Preprocessing – Prepare Your Data for Analysis


🧲 Introduction – Why Learn Pandas Data Cleaning?

Data is rarely clean. In real-world scenarios, you’ll face missing values, incorrect formats, duplicate entries, and inconsistent labels. Pandas offers a suite of robust tools to clean, transform, and standardize datasets, preparing them for meaningful analysis.

🎯 In this guide, you’ll learn:

  • How to identify and handle missing data
  • Fix incorrect data formats and duplicated entries
  • Use Pandas preprocessing tools to get your data analysis-ready

📘 Topics Covered

🧹 Topic🔍 Description
Pandas Data Cleaning OverviewHigh-level methods for preprocessing data
Cleaning Empty CellsDetect and manage blank or NaN values
Handling Wrong FormatsFix incorrect data types (e.g., strings as dates)
Fixing Wrong DataReplace invalid or incorrect entries
Removing DuplicatesDetect and delete repeated rows
Handling Missing DataIdentify and process missing values in DataFrames
Filling Missing ValuesUse strategies like mean, median, or ffill
Interpolating Missing ValuesEstimate values between known points
Dropping Missing DataRemove rows or columns with NaN values
Calculations with Missing DataEnsure accurate aggregation with NaNs
Handling Duplicated DataUnderstand and resolve duplicate issues
Detecting & Dropping DuplicatesUse duplicated() and drop_duplicates() effectively
Counting Unique ValuesCount distinct values in series or columns
Managing Duplicated LabelsResolve multiple columns/rows with same labels

🧹 Pandas Data Cleaning Overview

Data cleaning in Pandas involves:

  • Identifying and correcting errors
  • Filling or removing missing values
  • Detecting duplicates
  • Ensuring consistency in format and structure

🧼 Cleaning Empty Cells

df.isnull().sum()

Use .isnull() to detect and .dropna() or .fillna() to handle them.


🛠️ Handling Wrong Formats

Convert string dates:

df['date'] = pd.to_datetime(df['date'], errors='coerce')

Force numeric types:

df['value'] = pd.to_numeric(df['value'], errors='coerce')

❌ Fixing Wrong Data

Replace invalid entries:

df.loc[df['score'] < 0, 'score'] = 0

Use .replace() for mapping:

df['grade'] = df['grade'].replace({'A+': 'A'})

🧽 Removing Duplicates

df.duplicated()
df.drop_duplicates(inplace=True)

Can check based on all columns or selected subset.


🔍 Handling Missing Data

df.isna().sum()

Use .notna() for identifying present values.


🧯 Filling Missing Values

df.fillna(0)
df['age'].fillna(df['age'].mean(), inplace=True)

Other options: .median(), .mode(), forward/backward fill.


📈 Interpolating Missing Values

Estimate missing entries:

df.interpolate(method='linear')

Useful for time series and numeric sequences.


🧹 Dropping Missing Data

Drop rows or columns:

df.dropna()
df.dropna(axis=1)

Threshold control:

df.dropna(thresh=2)

➕ Calculations with Missing Data

By default, Pandas ignores NaNs in aggregations:

df['sales'].mean()
df['sales'].sum()

Set skipna=False if needed:

df['sales'].sum(skipna=False)

🔁 Handling Duplicated Data

Check duplicate rows:

df.duplicated()

Remove them:

df.drop_duplicates(inplace=True)

🔍 Detecting & Dropping Duplicates

Subset filtering:

df.drop_duplicates(subset=['email'])

Keep first or last duplicate:

df.drop_duplicates(keep='last')

🧮 Counting Unique Values

df['city'].nunique()
df['city'].value_counts()

Helps detect categorical cardinality or errors.


🧭 Managing Duplicated Labels

Detect duplicate columns:

df.columns[df.columns.duplicated()]

Use df.loc[:, ~df.columns.duplicated()] to remove.


📌 Summary – Recap & Next Steps

Data cleaning is a critical step in every data science pipeline. Pandas provides flexible methods to identify and fix missing values, wrong formats, and duplicates, ensuring your data is clean and reliable before analysis begins.

🔍 Key Takeaways:

  • Use .isnull(), .fillna(), .dropna() for missing data
  • Convert wrong formats using to_datetime() and to_numeric()
  • Remove duplicates using duplicated() and drop_duplicates()
  • Use interpolation for smart filling of time-series gaps

⚙️ Real-World Relevance:
Clean data leads to accurate insights. Mastering Pandas preprocessing tools helps analysts, data scientists, and engineers make reliable decisions with confidence.


❓ FAQ – Pandas Data Cleaning

❓ What’s the best method to handle missing values?

✅ Use .fillna() to replace them with the mean, median, or other values. Use .dropna() to remove them entirely.


❓ How can I fix date columns stored as text?

✅ Use pd.to_datetime() with errors='coerce' to convert them safely.


❓ Can I undo changes after dropping rows?

✅ Only if you haven’t used inplace=True. Otherwise, re-read the data or backup before modifying.


❓ What’s the difference between isnull() and isna()?

✅ They are functionally identical and can be used interchangeably.


❓ How do I remove duplicate columns?

✅ Use:

df = df.loc[:, ~df.columns.duplicated()]

Share Now :

Leave a Reply

Your email address will not be published. Required fields are marked *

Share

4️⃣ 🧹 Pandas Data Cleaning & Preprocessing

Or Copy Link

CONTENTS
Scroll to Top