🧼 Pandas Data Cleaning Overview – Prepare Clean and Reliable Data for Analysis
🧲 Introduction – Why Clean Data in Pandas?
Real-world datasets often contain missing values, duplicates, inconsistent formats, and incorrect types. Pandas provides robust tools for cleaning and preparing this data, enabling you to perform accurate and meaningful analysis.
1️⃣ Handling Missing Data (NaN
)
🔍 Detect Missing Values
df.isnull()
✔️ Returns a DataFrame of the same shape where each value is True
if missing and False
if not.
df.isnull().sum()
✔️ Sums True
values column-wise to show how many missing entries are in each column.
🧹 Remove Missing Values
df.dropna()
✔️ Drops all rows that contain any missing value.
df.dropna(axis=1)
✔️ Drops entire columns that contain any missing values.
🩹 Fill Missing Values
df.fillna(0)
✔️ Replaces all NaN
entries with the value 0
.
df.fillna(method='ffill')
✔️ Uses forward fill to propagate last valid value forward.
df.fillna(method='bfill')
✔️ Uses backward fill to fill missing values with the next valid value.
df['col'].fillna(df['col'].mean(), inplace=True)
✔️ Fills NaN
in a specific column with its mean value, modifying the column in place.
2️⃣ Removing Duplicates
df.duplicated()
✔️ Returns a Boolean Series showing duplicate rows as True
.
df.drop_duplicates(inplace=True)
✔️ Removes duplicated rows permanently from the DataFrame.
df.drop_duplicates(subset=['Name', 'Email'])
✔️ Removes duplicates only where both Name
and Email
are repeated.
3️⃣ Fixing Data Types
df['Age'] = df['Age'].astype(int)
✔️ Converts the Age
column to integer type.
df['Date'] = pd.to_datetime(df['Date'])
✔️ Converts the Date
column to Pandas datetime format.
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
✔️ Converts to numeric type and replaces invalid parsing with NaN
.
4️⃣ Standardizing Column Names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
✔️ Strips whitespace, converts to lowercase, and replaces spaces with underscores in all column names.
5️⃣ Cleaning String/Text Columns
df['Name'] = df['Name'].str.strip()
✔️ Removes leading/trailing whitespaces.
df['City'] = df['City'].str.title()
✔️ Capitalizes the first letter of each word in the City
column.
df['Email'] = df['Email'].str.lower()
✔️ Converts all email addresses to lowercase.
6️⃣ Removing Outliers (Basic Example)
df = df[df['Score'] < 100]
✔️ Keeps only the rows where Score
is less than 100 — filtering out extreme values.
7️⃣ Renaming Columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
✔️ Renames one or more columns by passing a dictionary to the columns
parameter.
8️⃣ Replacing Values
df['Gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True)
✔️ Replaces shorthand gender values (M
, F
) with full words (Male
, Female
).
🔍 Summary – Key Takeaways
- Detect with
isnull()
, handle withfillna()
, or remove withdropna()
- Fix column names, data types, and text inconsistencies
- Drop duplicates and filter outliers
- Clean data makes your models and reports more trustworthy ✅
❓ FAQs – Explained with Code
❓ How do I clean all string columns in one go?
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
✔️ Applies strip()
to every string in the DataFrame.
❓ How do I drop rows where specific columns are missing?
df.dropna(subset=['Email', 'Name'])
✔️ Only drops rows if Email
or Name
is missing, not the entire row blindly.
❓ How do I chain multiple string operations?
df['Name'] = df['Name'].str.strip().str.lower().str.replace('admin', 'administrator')
✔️ Clean and standardize text entries using chained string methods.
❓ How do I safely clean a DataFrame?
df_clean = df.copy()
✔️ Always make a copy before permanent operations like dropping data.
Share Now :