Estimated reading: 3 minutes 49 views

📊 R – Statistics Intro and Working with Data Sets (with Code Explanation)

🧲 Introduction – Start Statistical Analysis with R

R is a powerful environment built specifically for statistical computing and data analysis. Whether you’re working on regression, hypothesis testing, or data visualization, it all starts with understanding your dataset and basic descriptive statistics.

🎯 In this guide, you’ll learn:

How to load and inspect datasets in R
Use basic statistical functions (mean(), median(), summary())
Explore built-in datasets and load external data
Prepare your data for further statistical modeling

🗃️ 1. Loading a Built-In Dataset

R includes many built-in datasets like mtcars, iris, airquality, and more. You can list them using:

data()           # Lists all available datasets

✅ Load and View `mtcars`

data(mtcars)
head(mtcars)

🔍 Explanation:

data(mtcars) loads the dataset into memory
head() shows the first six rows
mtcars contains information on fuel consumption and design of 32 cars

📋 2. Structure and Summary of Dataset

✅ Check Structure and Summary

str(mtcars)         # Structure (types, variables)
summary(mtcars)     # Statistical summary (min, max, mean, quartiles)

🔍 Output (partial):

 mpg             cyl             disp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
 Median :19.20   Median :6.000   Median :196.3  
 Mean   :20.09   Mean   :6.188   Mean   :230.7

📈 3. Basic Descriptive Statistics

✅ Calculate Mean, Median, Mode

mean(mtcars$mpg)      # Average MPG
median(mtcars$mpg)    # Middle value

🔍 No base `mode()` exists in R; define it:

get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
get_mode(mtcars$cyl)

✅ Standard Deviation & Variance

sd(mtcars$mpg)         # Standard deviation
var(mtcars$mpg)        # Variance

📊 4. Visualizing Basic Stats

hist(mtcars$mpg, col = "lightgreen", main = "Histogram of MPG")
boxplot(mtcars$mpg, main = "Boxplot of MPG", col = "lightblue")

🔍 Why use this?

Visuals reveal spread, outliers, and distribution skewness

📥 5. Load External Data Set

✅ Load from CSV File

data <- read.csv("your_data.csv")
head(data)

✅ Check for Missing Data

sum(is.na(data))        # Total missing values
colSums(is.na(data))    # Missing values per column

⚙️ 6. Clean and Prepare Data

✅ Rename Columns

names(data) <- c("ID", "Name", "Score")

✅ Filter or Subset

subset(data, Score > 50)    # Get records with Score > 50

🧠 Statistical Terms Explained

Term	Meaning
Mean	Average of values
Median	Middle value when sorted
Mode	Most frequent value
Variance	Average squared deviation from the mean
SD	Square root of variance
IQR	Interquartile range (Q3 – Q1)

📌 Summary – Recap & Next Steps

Before jumping into advanced modeling, it’s crucial to explore and understand your data using basic statistics. R offers powerful, readable functions to describe data with both numbers and visuals.

🔍 Key Takeaways:

Use mean(), median(), summary(), sd() for quick insight
Explore data structure with str() and head()
Visualize spread with hist() and boxplot()
Clean and subset your data before analysis

⚙️ Real-World Relevance:
Used in every field—from clinical trials, market research, and finance to machine learning pre-processing.

❓ FAQs – Statistical Intro in R

❓ How do I get summary stats of all columns in R?
✅ Use summary(dataset) for min, mean, median, etc.

❓ How can I get the mode in R?
✅ Define a custom function:

get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

❓ What is the best dataset for practicing in R?
✅ Built-in datasets like mtcars, iris, diamonds (ggplot2) are perfect for EDA.

❓ How do I visualize variable distributions?
✅ Use:

hist(), boxplot(), density()

❓ How to detect missing values in R?
✅ Use:

sum(is.na(data))   # Total
colSums(is.na(data))  # Per column

« Previous Next »

Share Now :