7️⃣ 🔤 Pandas Text, Categorical & Dummy Data
Estimated reading: 3 minutes 30 views

🧩 Pandas Categorical Data Handling – Optimize and Analyze Category-Based Columns


🧲 Introduction – Why Handle Categorical Data?

Categorical data consists of fixed values (like gender, country, product type) that represent categories. Pandas offers special handling for categorical types to:

  • Improve memory efficiency
  • Enable fast comparisons and groupings
  • Enforce category ordering for logical sorting or analysis

🎯 In this guide, you’ll learn:

  • How to convert columns to categorical dtype
  • Create ordered categories
  • Use .cat accessor for manipulation
  • Optimize storage and enable faster analytics

📥 1. Create a Categorical Column

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'IT', 'HR', 'Finance', 'IT']
})

df['Department'] = df['Department'].astype('category')

✔️ Converts the Department column to categorical dtype.


💡 2. Benefits of Categorical Type

df['Department'].memory_usage(deep=True)

✔️ Categorical types consume less memory than object strings, especially when values are repeated.


🧱 3. Define Ordered Categories

sizes = pd.Series(['small', 'medium', 'large', 'medium', 'small'])

sizes = sizes.astype(pd.CategoricalDtype(categories=['small', 'medium', 'large'], ordered=True))

✔️ Useful for ordinal data like sizes, ranks, education levels.


📊 4. Sort Ordered Categories

sizes.sort_values()

✔️ Respects the logical order: small < medium < large


🎯 5. Use .cat Accessor for Categorical Operations

df['Department'].cat.categories       # List of categories
df['Department'].cat.codes           # Convert to numeric codes

Add or Remove Categories

df['Department'] = df['Department'].cat.add_categories(['Marketing'])
df['Department'] = df['Department'].cat.remove_unused_categories()

✔️ Manages valid category values without affecting the original column values.


🔍 6. Filter and Compare with Categories

df[df['Department'] == 'HR']

✔️ Category comparisons are faster and more memory-efficient than with object dtype.


🧮 7. Group By Categorical Columns

df.groupby('Department').size()

✔️ Grouping by categorical columns is faster and more memory-efficient.


🧼 8. Convert Back to Object or String

df['Department'].astype(str)

✔️ Useful if you need to export or concatenate with regular strings.


📌 Summary – Key Takeaways

Pandas provides powerful categorical data support that boosts performance, enforces validity, and enhances logical operations on labeled columns.

🔍 Key Takeaways:

  • Use astype('category') to convert string columns
  • Define ordered categories for sorting and analysis
  • Use .cat accessor for advanced category management
  • Faster groupby, comparisons, and filtering
  • Saves memory with repetitive strings

⚙️ Real-world relevance: Common in survey analysis, e-commerce data, demographic attributes, and machine learning preprocessing.


❓ FAQs – Handling Categorical Data in Pandas

❓ What’s the difference between object and category types?

  • object: regular Python strings
  • category: stores integer codes + category mapping (faster, smaller)

❓ When should I use ordered categories?
For columns where order matters (e.g., ‘low’ < ‘medium’ < ‘high’).


❓ How do I convert numerical codes back to category labels?

df['Department'].cat.categories[df['Department'].cat.codes]

❓ Can I apply .str methods on categorical columns?
No—you need to convert back to string:

df['Department'].astype(str).str.upper()

❓ Do categorical types improve performance?
✅ Yes—for repeated strings, grouping, comparisons, and memory usage.


Share Now :

Leave a Reply

Your email address will not be published. Required fields are marked *

Share

Pandas Categorical Data Handling

Or Copy Link

CONTENTS
Scroll to Top