📈 Pandas Statistical Functions – Analyze Data with Built-in Math Tools

🧲 Introduction – Why Use Statistical Functions in Pandas?

Pandas provides a rich set of statistical functions to perform common operations like mean, median, standard deviation, skewness, and correlation. These functions help you gain insights, trends, and distributions from your data without using external libraries like NumPy or SciPy.

🎯 In this guide, you’ll learn:

How to use built-in statistical functions on Series and DataFrames
Get summary stats, correlations, rankings, and cumulative metrics
Apply functions row-wise, column-wise, or across groups
Handle NaN values in calculations

📥 1. Sample DataFrame

import pandas as pd

df = pd.DataFrame({
    'Math': [85, 92, 88, 79, 95],
    'Science': [89, 94, 86, 82, 91],
    'English': [78, 85, 82, 88, 90]
})

🧮 2. Basic Statistical Functions

df.mean()       # Column-wise mean
df.median()     # Median
df.mode()       # Most frequent value (returns Series)
df.std()        # Standard deviation
df.var()        # Variance
df.min()        # Minimum
df.max()        # Maximum
df.sum()        # Total sum
df.count()      # Non-null count

✔️ Use axis=1 to apply row-wise instead of column-wise.

📊 3. Cumulative Statistics

df.cumsum()     # Cumulative sum
df.cumprod()    # Cumulative product
df.cummax()     # Cumulative maximum
df.cummin()     # Cumulative minimum

✔️ Useful for running totals, financial time series, etc.

📉 4. Skewness and Kurtosis

df.skew()       # Measures data asymmetry
df.kurt()       # Measures "tailedness" of distribution

✔️ Helpful for distribution shape analysis.

🔗 5. Correlation and Covariance

df.corr()       # Pearson correlation between columns
df.cov()        # Covariance matrix

✔️ Used for relationship analysis between numerical variables.

🔢 6. Ranking and Percentiles

df.rank()                 # Assign rank to each value
df['Math'].quantile(0.75)  # 75th percentile

✔️ Percentiles and ranks help in grading, tiering, and percent-based logic.

🧠 7. Apply Stats to Rows

df.mean(axis=1)    # Mean per student
df.max(axis=1)     # Highest score per student

⚙️ 8. Handle Missing Data in Stats

df_with_nan = df.copy()
df_with_nan.iloc[2, 1] = None  # Introduce NaN

df_with_nan.mean(skipna=True)  # Default: skips NaN
df_with_nan.mean(skipna=False) # Returns NaN if any missing

✔️ Use skipna to control how NaNs are treated.

📌 Summary – Key Takeaways

Pandas statistical functions let you analyze and summarize datasets efficiently, from simple averages to advanced distribution metrics.

🔍 Key Takeaways:

Use .mean(), .std(), .sum(), etc. for quick insights
Use .corr() and .cov() for relationships
Use .rank(), .quantile(), .skew() for advanced analytics
Handle NaNs with skipna parameter
Choose axis=0 for column-wise, axis=1 for row-wise stats

⚙️ Real-world relevance: Useful in EDA, machine learning preprocessing, KPI dashboards, and statistical reporting.

❓ FAQs – Pandas Statistical Functions

❓ How do I calculate the mean for each row?
Use:

df.mean(axis=1)

❓ What’s the difference between .mean() and .median()?

.mean() → average (sensitive to outliers)
.median() → middle value (robust to outliers)

❓ Can I calculate correlation between two specific columns?
Yes:

df['Math'].corr(df['Science'])

❓ How do I include/exclude NaN in calculations?
Use skipna=True or skipna=False in most functions.

❓ Can I apply a statistical function across groups?
Yes:

df.groupby('group')['Score'].mean()

« Previous Next »

Share Now :