⚠️ Pandas Handling Wrong Formats – Detect and Fix Inconsistent or Invalid Data
🧲 Introduction – Why Fix Wrong Data Formats?
Inconsistent or wrong formats in a dataset—like strings in numeric columns, text dates, or mixed types—can lead to errors, broken calculations, and failed model training. With Pandas, you can detect, convert, and correct these formats to ensure your data is clean, consistent, and usable.
🎯 In this guide, you’ll learn:
- How to detect wrong data formats
- Convert strings to numbers or dates
- Handle non-convertible values using coercion
- Clean up formatting issues across entire columns
🔍 1. Detect Data Type Issues
import pandas as pd
df = pd.DataFrame({
'Price': ['100', '200', 'invalid', '300'],
'Date': ['2023-01-01', 'Feb 2, 2023', 'not a date', '2023/03/01']
})
print(df.dtypes)
👉 Output:
Price object
Date object
dtype: object
✔️ Both Price and Date should be numeric and datetime, but are currently strings (object type).
🔄 2. Convert String to Numeric with Error Handling
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
✔️ Converts Price to numbers. Any non-numeric values (like 'invalid') will be replaced with NaN.
👉 Output:
Price
0 100.0
1 200.0
2 NaN
3 300.0
📆 3. Convert String to Datetime Format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
✔️ Converts mixed-format date strings into proper datetime objects. Invalid entries become NaT (Not a Time).
👉 Output:
Date
0 2023-01-01
1 2023-02-02
2 NaT
3 2023-03-01
🧹 4. Strip Unwanted Characters Before Conversion
df = pd.DataFrame({'Amount': ['$1,000', '$2,500', '$3,000']})
df['Amount'] = df['Amount'].str.replace('[\$,]', '', regex=True)
df['Amount'] = pd.to_numeric(df['Amount'])
✔️ Removes $ and , before converting to numeric.
👉 Output:
Amount
0 1000
1 2500
2 3000
🧱 5. Detect Non-Standard Types or Mixed Types
print(df.applymap(type))
✔️ Shows the Python type of each cell. Helps spot inconsistent types in columns.
🔁 6. Fix Categorical Values or Replace Typos
df = pd.DataFrame({'Gender': ['M', 'F', 'male', 'Female', 'MALE', 'f']})
df['Gender'] = df['Gender'].str.lower().replace({'male': 'm', 'female': 'f'})
✔️ Standardizes inconsistent strings for consistent analysis.
👉 Output:
Gender
0 m
1 f
2 m
3 f
4 m
5 f
🧠 7. Custom Function to Clean a Column
def clean_price(x):
try:
return float(x.replace('$', '').replace(',', ''))
except:
return None
df['Cleaned_Price'] = df['Price'].apply(clean_price)
✔️ Use custom logic to clean values when built-in converters aren’t enough.
📌 Summary – Key Takeaways
- Use
pd.to_numeric()andpd.to_datetime()witherrors='coerce'to handle invalid entries - Clean string formatting before conversion (e.g., symbols, commas)
- Use
.str.lower(),.replace(), and custom functions to fix categorical or messy values - Always check
dtypesand inspect mixed or wrong formats early
⚙️ Real-world relevance: Fixing formats is crucial for financial, time-series, retail, and machine learning pipelines.
❓ FAQs – Handling Wrong Formats in Pandas
❓ How do I handle strings in numeric columns?
Use:
pd.to_numeric(df['col'], errors='coerce')
❓ What happens if conversion fails with errors='raise'?
Pandas will throw an error and stop execution on the first invalid value.
❓ How to convert mixed-format date strings?
Use:
pd.to_datetime(df['date_col'], errors='coerce')
❓ Can I automatically detect and convert column types?
Use:
df.infer_objects()
✔️ Infers better types for object columns.
Share Now :
