4️⃣ 🧹 Pandas Data Cleaning & Preprocessing
Estimated reading: 3 minutes 27 views

🧼 Pandas Data Cleaning Overview – Prepare Clean and Reliable Data for Analysis


🧲 Introduction – Why Clean Data in Pandas?

Real-world datasets often contain missing values, duplicates, inconsistent formats, and incorrect types. Pandas provides robust tools for cleaning and preparing this data, enabling you to perform accurate and meaningful analysis.


1️⃣ Handling Missing Data (NaN)

🔍 Detect Missing Values

df.isnull()

✔️ Returns a DataFrame of the same shape where each value is True if missing and False if not.

df.isnull().sum()

✔️ Sums True values column-wise to show how many missing entries are in each column.


🧹 Remove Missing Values

df.dropna()

✔️ Drops all rows that contain any missing value.

df.dropna(axis=1)

✔️ Drops entire columns that contain any missing values.


🩹 Fill Missing Values

df.fillna(0)

✔️ Replaces all NaN entries with the value 0.

df.fillna(method='ffill')

✔️ Uses forward fill to propagate last valid value forward.

df.fillna(method='bfill')

✔️ Uses backward fill to fill missing values with the next valid value.

df['col'].fillna(df['col'].mean(), inplace=True)

✔️ Fills NaN in a specific column with its mean value, modifying the column in place.


2️⃣ Removing Duplicates

df.duplicated()

✔️ Returns a Boolean Series showing duplicate rows as True.

df.drop_duplicates(inplace=True)

✔️ Removes duplicated rows permanently from the DataFrame.

df.drop_duplicates(subset=['Name', 'Email'])

✔️ Removes duplicates only where both Name and Email are repeated.


3️⃣ Fixing Data Types

df['Age'] = df['Age'].astype(int)

✔️ Converts the Age column to integer type.

df['Date'] = pd.to_datetime(df['Date'])

✔️ Converts the Date column to Pandas datetime format.

df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

✔️ Converts to numeric type and replaces invalid parsing with NaN.


4️⃣ Standardizing Column Names

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

✔️ Strips whitespace, converts to lowercase, and replaces spaces with underscores in all column names.


5️⃣ Cleaning String/Text Columns

df['Name'] = df['Name'].str.strip()

✔️ Removes leading/trailing whitespaces.

df['City'] = df['City'].str.title()

✔️ Capitalizes the first letter of each word in the City column.

df['Email'] = df['Email'].str.lower()

✔️ Converts all email addresses to lowercase.


6️⃣ Removing Outliers (Basic Example)

df = df[df['Score'] < 100]

✔️ Keeps only the rows where Score is less than 100 — filtering out extreme values.


7️⃣ Renaming Columns

df.rename(columns={'old_name': 'new_name'}, inplace=True)

✔️ Renames one or more columns by passing a dictionary to the columns parameter.


8️⃣ Replacing Values

df['Gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True)

✔️ Replaces shorthand gender values (M, F) with full words (Male, Female).


🔍 Summary – Key Takeaways

  • Detect with isnull(), handle with fillna(), or remove with dropna()
  • Fix column names, data types, and text inconsistencies
  • Drop duplicates and filter outliers
  • Clean data makes your models and reports more trustworthy ✅

❓ FAQs – Explained with Code

❓ How do I clean all string columns in one go?

df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

✔️ Applies strip() to every string in the DataFrame.


❓ How do I drop rows where specific columns are missing?

df.dropna(subset=['Email', 'Name'])

✔️ Only drops rows if Email or Name is missing, not the entire row blindly.


❓ How do I chain multiple string operations?

df['Name'] = df['Name'].str.strip().str.lower().str.replace('admin', 'administrator')

✔️ Clean and standardize text entries using chained string methods.


❓ How do I safely clean a DataFrame?

df_clean = df.copy()

✔️ Always make a copy before permanent operations like dropping data.


Share Now :

Leave a Reply

Your email address will not be published. Required fields are marked *

Share

Pandas Data Cleaning Overview

Or Copy Link

CONTENTS
Scroll to Top