📊 R – Statistics Intro and Working with Data Sets (with Code Explanation)
🧲 Introduction – Start Statistical Analysis with R
R is a powerful environment built specifically for statistical computing and data analysis. Whether you’re working on regression, hypothesis testing, or data visualization, it all starts with understanding your dataset and basic descriptive statistics.
🎯 In this guide, you’ll learn:
- How to load and inspect datasets in R
- Use basic statistical functions (
mean()
,median()
,summary()
) - Explore built-in datasets and load external data
- Prepare your data for further statistical modeling
🗃️ 1. Loading a Built-In Dataset
R includes many built-in datasets like mtcars
, iris
, airquality
, and more. You can list them using:
data() # Lists all available datasets
✅ Load and View mtcars
data(mtcars)
head(mtcars)
🔍 Explanation:
data(mtcars)
loads the dataset into memoryhead()
shows the first six rowsmtcars
contains information on fuel consumption and design of 32 cars
📋 2. Structure and Summary of Dataset
✅ Check Structure and Summary
str(mtcars) # Structure (types, variables)
summary(mtcars) # Statistical summary (min, max, mean, quartiles)
🔍 Output (partial):
mpg cyl disp
Min. :10.40 Min. :4.000 Min. : 71.1
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
Median :19.20 Median :6.000 Median :196.3
Mean :20.09 Mean :6.188 Mean :230.7
📈 3. Basic Descriptive Statistics
✅ Calculate Mean, Median, Mode
mean(mtcars$mpg) # Average MPG
median(mtcars$mpg) # Middle value
🔍 No base mode()
exists in R; define it:
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
get_mode(mtcars$cyl)
✅ Standard Deviation & Variance
sd(mtcars$mpg) # Standard deviation
var(mtcars$mpg) # Variance
📊 4. Visualizing Basic Stats
hist(mtcars$mpg, col = "lightgreen", main = "Histogram of MPG")
boxplot(mtcars$mpg, main = "Boxplot of MPG", col = "lightblue")
🔍 Why use this?
- Visuals reveal spread, outliers, and distribution skewness
📥 5. Load External Data Set
✅ Load from CSV File
data <- read.csv("your_data.csv")
head(data)
✅ Check for Missing Data
sum(is.na(data)) # Total missing values
colSums(is.na(data)) # Missing values per column
⚙️ 6. Clean and Prepare Data
✅ Rename Columns
names(data) <- c("ID", "Name", "Score")
✅ Filter or Subset
subset(data, Score > 50) # Get records with Score > 50
🧠 Statistical Terms Explained
Term | Meaning |
---|---|
Mean | Average of values |
Median | Middle value when sorted |
Mode | Most frequent value |
Variance | Average squared deviation from the mean |
SD | Square root of variance |
IQR | Interquartile range (Q3 – Q1) |
📌 Summary – Recap & Next Steps
Before jumping into advanced modeling, it’s crucial to explore and understand your data using basic statistics. R offers powerful, readable functions to describe data with both numbers and visuals.
🔍 Key Takeaways:
- Use
mean()
,median()
,summary()
,sd()
for quick insight - Explore data structure with
str()
andhead()
- Visualize spread with
hist()
andboxplot()
- Clean and subset your data before analysis
⚙️ Real-World Relevance:
Used in every field—from clinical trials, market research, and finance to machine learning pre-processing.
❓ FAQs – Statistical Intro in R
❓ How do I get summary stats of all columns in R?
✅ Use summary(dataset)
for min, mean, median, etc.
❓ How can I get the mode in R?
✅ Define a custom function:
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
❓ What is the best dataset for practicing in R?
✅ Built-in datasets like mtcars
, iris
, diamonds
(ggplot2) are perfect for EDA.
❓ How do I visualize variable distributions?
✅ Use:
hist(), boxplot(), density()
❓ How to detect missing values in R?
✅ Use:
sum(is.na(data)) # Total
colSums(is.na(data)) # Per column
Share Now :