📊 R – Statistics Intro and Working with Data Sets (with Code Explanation)
🧲 Introduction – Start Statistical Analysis with R
R is a powerful environment built specifically for statistical computing and data analysis. Whether you’re working on regression, hypothesis testing, or data visualization, it all starts with understanding your dataset and basic descriptive statistics.
🎯 In this guide, you’ll learn:
- How to load and inspect datasets in R
- Use basic statistical functions (mean(),median(),summary())
- Explore built-in datasets and load external data
- Prepare your data for further statistical modeling
🗃️ 1. Loading a Built-In Dataset
R includes many built-in datasets like mtcars, iris, airquality, and more. You can list them using:
data()           # Lists all available datasets
✅ Load and View mtcars
data(mtcars)
head(mtcars)
🔍 Explanation:
- data(mtcars)loads the dataset into memory
- head()shows the first six rows
- mtcarscontains information on fuel consumption and design of 32 cars
📋 2. Structure and Summary of Dataset
✅ Check Structure and Summary
str(mtcars)         # Structure (types, variables)
summary(mtcars)     # Statistical summary (min, max, mean, quartiles)
🔍 Output (partial):
 mpg             cyl             disp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
 Median :19.20   Median :6.000   Median :196.3  
 Mean   :20.09   Mean   :6.188   Mean   :230.7  
📈 3. Basic Descriptive Statistics
✅ Calculate Mean, Median, Mode
mean(mtcars$mpg)      # Average MPG
median(mtcars$mpg)    # Middle value
🔍 No base mode() exists in R; define it:
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
get_mode(mtcars$cyl)
✅ Standard Deviation & Variance
sd(mtcars$mpg)         # Standard deviation
var(mtcars$mpg)        # Variance
📊 4. Visualizing Basic Stats
hist(mtcars$mpg, col = "lightgreen", main = "Histogram of MPG")
boxplot(mtcars$mpg, main = "Boxplot of MPG", col = "lightblue")
🔍 Why use this?
- Visuals reveal spread, outliers, and distribution skewness
📥 5. Load External Data Set
✅ Load from CSV File
data <- read.csv("your_data.csv")
head(data)
✅ Check for Missing Data
sum(is.na(data))        # Total missing values
colSums(is.na(data))    # Missing values per column
⚙️ 6. Clean and Prepare Data
✅ Rename Columns
names(data) <- c("ID", "Name", "Score")
✅ Filter or Subset
subset(data, Score > 50)    # Get records with Score > 50
🧠 Statistical Terms Explained
| Term | Meaning | 
|---|---|
| Mean | Average of values | 
| Median | Middle value when sorted | 
| Mode | Most frequent value | 
| Variance | Average squared deviation from the mean | 
| SD | Square root of variance | 
| IQR | Interquartile range (Q3 – Q1) | 
📌 Summary – Recap & Next Steps
Before jumping into advanced modeling, it’s crucial to explore and understand your data using basic statistics. R offers powerful, readable functions to describe data with both numbers and visuals.
🔍 Key Takeaways:
- Use mean(),median(),summary(),sd()for quick insight
- Explore data structure with str()andhead()
- Visualize spread with hist()andboxplot()
- Clean and subset your data before analysis
⚙️ Real-World Relevance:
Used in every field—from clinical trials, market research, and finance to machine learning pre-processing.
❓ FAQs – Statistical Intro in R
❓ How do I get summary stats of all columns in R?
✅ Use summary(dataset) for min, mean, median, etc.
❓ How can I get the mode in R?
✅ Define a custom function:
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
❓ What is the best dataset for practicing in R?
✅ Built-in datasets like mtcars, iris, diamonds (ggplot2) are perfect for EDA.
❓ How do I visualize variable distributions?
✅ Use:
hist(), boxplot(), density()
❓ How to detect missing values in R?
✅ Use:
sum(is.na(data))   # Total
colSums(is.na(data))  # Per column
Share Now :
