Author: saqibkhan

Complex Visualization with ggplot2

We can create faceted plots and combine multiple visualizations.

Faceted Plot

rCopy code# Create faceted plots by age group
ggplot(data, aes(x = height, y = weight)) +
  geom_point(aes(color = age_group), alpha = 0.7) +
  facet_wrap(~ age_group) +
  labs(title = "Height vs Weight by Age Group",
   x = "Height (cm)",
   y = "Weight (kg)") +  theme_minimal()

Combining Plots

You can also use the patchwork package to combine multiple plots:

rCopy code# Install and load patchwork
install.packages("patchwork")
library(patchwork)

# Scatter plot and boxplot
scatter_plot <- ggplot(data, aes(x = height, y = weight)) + geom_point() + theme_minimal()
box_plot <- ggplot(data, aes(x = age_group, y = weight)) + geom_boxplot() + theme_minimal()

# Combine plots
combined_plot <- scatter_plot + box_plot + plot_layout(ncol = 2)
print(combined_plot)

October 30, 2024

Time Series Analysis

Let’s analyze a simple time series dataset using the forecast package.

Step 1: Install and Load Forecast

If you don’t have forecast installed, you can install it:

rCopy codeinstall.packages("forecast")

Then, load the library:

rCopy codelibrary(forecast)

Step 2: Create a Time Series Dataset

We’ll create a sample time series dataset:

rCopy code# Generate a time series dataset
set.seed(101)
ts_data <- ts(rnorm(120, mean = 10, sd = 2), frequency = 12, start = c(2020, 1))
plot(ts_data, main = "Sample Time Series Data", ylab = "Value", xlab = "Time")

Step 3: Decompose the Time Series

rCopy code# Decompose the time series
decomposed <- decompose(ts_data)
plot(decomposed)

Step 4: Forecasting

rCopy code# Fit an ARIMA model and forecast
fit <- auto.arima(ts_data)
forecasted_values <- forecast(fit, h = 12)

# Plot the forecast
plot(forecasted_values, main = "Forecast for Next 12 Months")

October 30, 2024

Machine Learning with Caret

We’ll demonstrate a basic machine learning workflow using the caret package for building a predictive model.

Step 1: Install and Load Caret

If you don’t have caret installed, you can do so with:

rCopy codeinstall.packages("caret")

Then, load the library:

rCopy codelibrary(caret)

Step 2: Create a Sample Dataset

We’ll use the same dataset but add a binary outcome variable to predict:

rCopy code# Adding a binary outcome variable
set.seed(789)
data$outcome <- ifelse(data$weight > 70, "Heavy", "Light")

Step 3: Split the Dataset

Split the data into training and testing sets:

rCopy code# Set seed for reproducibility
set.seed(123)

# Create a training index
train_index <- createDataPartition(data$outcome, p = 0.7, list = FALSE)

# Split data into training and testing sets
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

Step 4: Train a Model

We’ll train a simple logistic regression model:

rCopy code# Train a logistic regression model
model <- train(outcome ~ height + weight, data = train_data, method = "glm", family = "binomial")

# Print the model summary
summary(model)

Step 5: Make Predictions

Use the model to make predictions on the test set:

rCopy code# Make predictions on the test set
predictions <- predict(model, newdata = test_data)

# Confusion matrix to evaluate performance
confusionMatrix(predictions, test_data$outcome)

October 30, 2024

Hypothesis Testing

We can perform a t-test to compare the means of weight between two age groups.

Step 1: Create Age Groups

rCopy code# Create a new variable to classify age groups
data$age_group <- ifelse(data$age > 30, "Above 30", "30 or Below")

Step 2: Conduct a t-test

rCopy code# Perform a t-test to compare weights between the two age groups
t_test_result <- t.test(weight ~ age_group, data = data)

# Display the results
print(t_test_result)

October 30, 2024

What is vector recycling in R?
If we try to perform some operation on two R vectors with different lengths, the R interpreter detects under the hood the shorter one, recycles its items in the same order until the lengths of the two vectors match, and only then performs the necessary operation on these vectors. Before starting vector recycling, though, the R interpreter throws a warning message about the initial mismatch of the vectors’ lengths.

For example, if we try to run the following addition:
```
c(1, 2, 3, 4, 5) + c(1, 2, 3)Powered By 
```
The second vector, due to the vector recycling, will actually be converted into c(1, 2, 3, 1, 2). Hence, the final result of this operation will be c(2, 4, 6, 5, 7).

While sometimes vector recycling can be beneficial (e.g., when we expect the cyclicity of values in the vectors), more often, it’s inappropriate and misleading. Hence, we should be careful and mind the vectors’ lengths before performing operations on them.
October 30, 2024

Advanced Visualization with ggplot2

Let’s create more complex visualizations, such as a boxplot and a density plot.

Boxplot

rCopy code# Boxplot to visualize the distribution of weight by age group
ggplot(data, aes(x = factor(age > 30), y = weight)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot of Weight by Age Group (Above/Below 30)",
   x = "Age &gt; 30",
   y = "Weight (kg)") +  theme_minimal()

Density Plot

rCopy code# Density plot for height
ggplot(data, aes(x = height, fill = ..count..)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Height",
   x = "Height (cm)",
   y = "Density") +
  theme_minimal()

October 30, 2024

Data Manipulation with dplyr

In this example, we’ll use the dplyr package for data manipulation. We’ll filter, summarize, and arrange data.

Step 1: Install and Load dplyr

If you don’t have dplyr installed yet, you can install it with:

rCopy codeinstall.packages("dplyr")

Then, load the library:

rCopy codelibrary(dplyr)

Step 2: Create a Sample Dataset

We’ll continue using the previous dataset or create a new one:

rCopy code# Create a sample dataset
set.seed(456)
data <- data.frame(
  id = 1:100,
  age = sample(18:65, 100, replace = TRUE),
  height = rnorm(100, mean = 170, sd = 10),
  weight = rnorm(100, mean = 70, sd = 15)
)

Step 3: Data Manipulation

Filtering Data: Let’s filter individuals who are above 30 years old.

rCopy code# Filter data for individuals older than 30
filtered_data <- data %>% filter(age > 30)
head(filtered_data)

Summarizing Data: We can calculate the average height and weight for this filtered group.

rCopy code# Summarize to get mean height and weight for individuals older than 30
summary_stats <- filtered_data %>%
  summarize(
mean_height = mean(height),
mean_weight = mean(weight),
count = n()  )
print(summary_stats)

Arranging Data: Sort the dataset by height in descending order.

rCopy code# Arrange data by height in descending order
arranged_data <- data %>% arrange(desc(height))
head(arranged_data)

October 30, 2024

What types of data plots can be created in R?
Being data visualization one of the strong sides of the R programming languages, we can create all types of data plots in R:
- Common types of data plots:
  - Bar plot—shows the numerical values of categorical data.
  - Line plot—shows a progression of a variable, usually over time.
  - Scatter plot—shows the relationships between two variables.
  - Area plot—based on a line plot, with the area below the line colored or filled with a pattern.
  - Pie chart—shows the proportion of each category of categorical data as a part of the whole.
  - Box plot—shows a set of descriptive statistics of the data.
- Advanced types of data plots:
  - Violin plot—shows both a set of descriptive statistics of the data and the distribution shape for that data.
  - Heatmap—shows the magnitude of each numeric data point within the dataset.
  - Treemap—shows the numerical values of categorical data, often as a part of the whole.
  - Dendrogram—shows an inner hierarchy and clustering of the data.
  - Bubble plot—shows the relationships between three variables.
  - Hexbin plot—shows the relationships of two numerical variables in a relatively large dataset.
  - Word cloud—shows the frequency of words in an input text.
  - Choropleth map—shows aggregate thematic statistics of geodata.
  - Circular packing chart—shows an inner hierarchy of the data and the values of the data points
  - etc.
The skill track Data Visualization with R will help you broaden your horizons in the field of R graphics. If you prefer to learn data visualization in R in a broader context, explore a thorough and beginner-friendly career track Data Scientist with R.
October 30, 2024
Statistical Analysis
We can perform a linear regression analysis to understand the relationship between height and weight.
```
rCopy code# Linear regression model
model <- lm(weight ~ height, data = data)

# Display the model summary
summary(model)
```
October 30, 2024

Data Visualization

Using the ggplot2 package, we can create a scatter plot to visualize the relationship between height and weight.

rCopy code# Load ggplot2 package
library(ggplot2)

# Create a scatter plot of height vs weight
ggplot(data, aes(x = height, y = weight)) +
  geom_point(color = 'blue') +
  labs(title = "Scatter Plot of Height vs Weight",
   x = "Height (cm)",
   y = "Weight (kg)") +
  theme_minimal()

October 30, 2024