🔗 Pandas Correlation Analysis – Discover Relationships Between Variables
🧲 Introduction – What Is Correlation in Pandas?
Correlation analysis helps you measure how strongly two variables are related. In Pandas, correlation functions allow you to quantify the degree of linear relationship between numeric columns. This is essential in EDA (Exploratory Data Analysis), feature selection, and predictive modeling.
🎯 In this guide, you’ll learn:
- Use
.corr()to find correlations between columns - Understand Pearson, Kendall, and Spearman methods
- Visualize correlation matrices
- Handle missing values and outliers properly
📥 1. Create a Sample DataFrame
import pandas as pd
df = pd.DataFrame({
'Math': [90, 85, 88, 92, 75],
'Science': [93, 80, 86, 95, 70],
'English': [84, 78, 82, 88, 76],
'Sports': [60, 65, 55, 70, 58]
})
📊 2. Compute Pairwise Correlation with .corr()
df.corr()
👉 Output:
Math Science English Sports
Math 1.00 0.98 0.90 0.82
Science 0.98 1.00 0.89 0.79
English 0.90 0.89 1.00 0.71
Sports 0.82 0.79 0.71 1.00
✔️ Returns Pearson correlation by default.
🔍 3. Correlation Between Two Specific Columns
df['Math'].corr(df['Science'])
✔️ Calculates correlation coefficient between two Series.
🔄 4. Use Other Correlation Methods
Spearman Rank Correlation
df.corr(method='spearman')
✔️ Measures monotonic relationships (non-linear but consistent trends).
Kendall Tau Correlation
df.corr(method='kendall')
✔️ Measures ordinal associations and is robust to outliers.
🧮 5. Correlation with Missing Values
df_missing = df.copy()
df_missing.loc[2, 'English'] = None
df_missing.corr()
✔️ Pandas automatically ignores missing values in correlation calculations.
📈 6. Visualize Correlation Matrix with Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()
✔️ A heatmap visually shows which features are strongly or weakly correlated.
🧠 7. Interpret Correlation Coefficient Values
| Correlation | Strength |
|---|---|
| 0.90–1.00 | Very strong |
| 0.70–0.89 | Strong |
| 0.50–0.69 | Moderate |
| 0.30–0.49 | Weak |
| 0.00–0.29 | Very weak/None |
📌 Summary – Key Takeaways
Correlation analysis in Pandas helps you quantify relationships between numerical variables and identify features that move together—positively or negatively.
🔍 Key Takeaways:
- Use
.corr()for pairwise correlation (Pearson by default) - Use
method='spearman'or'kendall'for non-linear or ordinal data - Combine with
.heatmap()for visual analysis - Pandas handles missing data automatically in correlation
- Strong correlation does not imply causation
⚙️ Real-world relevance: Used in machine learning, portfolio analysis, KPI validation, and multicollinearity checks.
❓ FAQs – Correlation in Pandas
❓ What is the default method used by .corr()?
✅ Pearson correlation (linear relationship).
❓ Can I correlate non-numeric columns?
❌ No. .corr() works only on numeric types.
❓ What if I want to remove highly correlated features?
Use:
df.corr().abs().unstack().sort_values(ascending=False)
❓ What does a negative correlation mean?
As one value increases, the other decreases.
❓ Does correlation mean causation?
❌ No. High correlation does not imply one variable causes the other.
Share Now :
