6️⃣ 📊 Pandas Statistical Analysis & Aggregation

Estimated reading: 2 minutes 124 views

🔗 Pandas Correlation Analysis – Discover Relationships Between Variables

🧲 Introduction – What Is Correlation in Pandas?

Correlation analysis helps you measure how strongly two variables are related. In Pandas, correlation functions allow you to quantify the degree of linear relationship between numeric columns. This is essential in EDA (Exploratory Data Analysis), feature selection, and predictive modeling.

🎯 In this guide, you’ll learn:

Use .corr() to find correlations between columns
Understand Pearson, Kendall, and Spearman methods
Visualize correlation matrices
Handle missing values and outliers properly

📥 1. Create a Sample DataFrame

import pandas as pd

df = pd.DataFrame({
    'Math': [90, 85, 88, 92, 75],
    'Science': [93, 80, 86, 95, 70],
    'English': [84, 78, 82, 88, 76],
    'Sports': [60, 65, 55, 70, 58]
})

📊 2. Compute Pairwise Correlation with `.corr()`

df.corr()

👉 Output:

           Math  Science  English   Sports
Math       1.00     0.98     0.90     0.82
Science    0.98     1.00     0.89     0.79
English    0.90     0.89     1.00     0.71
Sports     0.82     0.79     0.71     1.00

✔️ Returns Pearson correlation by default.

🔍 3. Correlation Between Two Specific Columns

df['Math'].corr(df['Science'])

✔️ Calculates correlation coefficient between two Series.

🔄 4. Use Other Correlation Methods

Spearman Rank Correlation

df.corr(method='spearman')

✔️ Measures monotonic relationships (non-linear but consistent trends).

Kendall Tau Correlation

df.corr(method='kendall')

✔️ Measures ordinal associations and is robust to outliers.

🧮 5. Correlation with Missing Values

df_missing = df.copy()
df_missing.loc[2, 'English'] = None

df_missing.corr()

✔️ Pandas automatically ignores missing values in correlation calculations.

📈 6. Visualize Correlation Matrix with Seaborn

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

✔️ A heatmap visually shows which features are strongly or weakly correlated.

🧠 7. Interpret Correlation Coefficient Values

Correlation	Strength
0.90–1.00	Very strong
0.70–0.89	Strong
0.50–0.69	Moderate
0.30–0.49	Weak
0.00–0.29	Very weak/None

📌 Summary – Key Takeaways

Correlation analysis in Pandas helps you quantify relationships between numerical variables and identify features that move together—positively or negatively.

🔍 Key Takeaways:

Use .corr() for pairwise correlation (Pearson by default)
Use method='spearman' or 'kendall' for non-linear or ordinal data
Combine with .heatmap() for visual analysis
Pandas handles missing data automatically in correlation
Strong correlation does not imply causation

⚙️ Real-world relevance: Used in machine learning, portfolio analysis, KPI validation, and multicollinearity checks.

❓ FAQs – Correlation in Pandas

❓ What is the default method used by .corr()?
✅ Pearson correlation (linear relationship).

❓ Can I correlate non-numeric columns?
❌ No. .corr() works only on numeric types.

❓ What if I want to remove highly correlated features?
Use:

df.corr().abs().unstack().sort_values(ascending=False)

❓ What does a negative correlation mean?
As one value increases, the other decreases.

❓ Does correlation mean causation?
❌ No. High correlation does not imply one variable causes the other.

« Previous Next »

Share Now :