Author: saqibkhan

  • Complex Visualization with ggplot2

    We can create faceted plots and combine multiple visualizations.

    Faceted Plot

    rCopy code# Create faceted plots by age group
    ggplot(data, aes(x = height, y = weight)) +
      geom_point(aes(color = age_group), alpha = 0.7) +
      facet_wrap(~ age_group) +
      labs(title = "Height vs Weight by Age Group",
    
       x = "Height (cm)",
       y = "Weight (kg)") +
    theme_minimal()

    Combining Plots

    You can also use the patchwork package to combine multiple plots:

    rCopy code# Install and load patchwork
    install.packages("patchwork")
    library(patchwork)
    
    # Scatter plot and boxplot
    scatter_plot <- ggplot(data, aes(x = height, y = weight)) + geom_point() + theme_minimal()
    box_plot <- ggplot(data, aes(x = age_group, y = weight)) + geom_boxplot() + theme_minimal()
    
    # Combine plots
    combined_plot <- scatter_plot + box_plot + plot_layout(ncol = 2)
    print(combined_plot)
  • Time Series Analysis

    Let’s analyze a simple time series dataset using the forecast package.

    Step 1: Install and Load Forecast

    If you don’t have forecast installed, you can install it:

    rCopy codeinstall.packages("forecast")
    

    Then, load the library:

    rCopy codelibrary(forecast)
    

    Step 2: Create a Time Series Dataset

    We’ll create a sample time series dataset:

    rCopy code# Generate a time series dataset
    set.seed(101)
    ts_data <- ts(rnorm(120, mean = 10, sd = 2), frequency = 12, start = c(2020, 1))
    plot(ts_data, main = "Sample Time Series Data", ylab = "Value", xlab = "Time")
    

    Step 3: Decompose the Time Series

    rCopy code# Decompose the time series
    decomposed <- decompose(ts_data)
    plot(decomposed)
    

    Step 4: Forecasting

    rCopy code# Fit an ARIMA model and forecast
    fit <- auto.arima(ts_data)
    forecasted_values <- forecast(fit, h = 12)
    
    # Plot the forecast
    plot(forecasted_values, main = "Forecast for Next 12 Months")
  • Machine Learning with Caret

    We’ll demonstrate a basic machine learning workflow using the caret package for building a predictive model.

    Step 1: Install and Load Caret

    If you don’t have caret installed, you can do so with:

    rCopy codeinstall.packages("caret")
    

    Then, load the library:

    rCopy codelibrary(caret)
    

    Step 2: Create a Sample Dataset

    We’ll use the same dataset but add a binary outcome variable to predict:

    rCopy code# Adding a binary outcome variable
    set.seed(789)
    data$outcome <- ifelse(data$weight > 70, "Heavy", "Light")
    

    Step 3: Split the Dataset

    Split the data into training and testing sets:

    rCopy code# Set seed for reproducibility
    set.seed(123)
    
    # Create a training index
    train_index <- createDataPartition(data$outcome, p = 0.7, list = FALSE)
    
    # Split data into training and testing sets
    train_data <- data[train_index, ]
    test_data <- data[-train_index, ]
    

    Step 4: Train a Model

    We’ll train a simple logistic regression model:

    rCopy code# Train a logistic regression model
    model <- train(outcome ~ height + weight, data = train_data, method = "glm", family = "binomial")
    
    # Print the model summary
    summary(model)
    

    Step 5: Make Predictions

    Use the model to make predictions on the test set:

    rCopy code# Make predictions on the test set
    predictions <- predict(model, newdata = test_data)
    
    # Confusion matrix to evaluate performance
    confusionMatrix(predictions, test_data$outcome)
  • Hypothesis Testing

    We can perform a t-test to compare the means of weight between two age groups.

    Step 1: Create Age Groups

    rCopy code# Create a new variable to classify age groups
    data$age_group <- ifelse(data$age > 30, "Above 30", "30 or Below")
    

    Step 2: Conduct a t-test

    rCopy code# Perform a t-test to compare weights between the two age groups
    t_test_result <- t.test(weight ~ age_group, data = data)
    
    # Display the results
    print(t_test_result)
  • What is vector recycling in R?

    If we try to perform some operation on two R vectors with different lengths, the R interpreter detects under the hood the shorter one, recycles its items in the same order until the lengths of the two vectors match, and only then performs the necessary operation on these vectors. Before starting vector recycling, though, the R interpreter throws a warning message about the initial mismatch of the vectors’ lengths.

    For example, if we try to run the following addition:

    c(1, 2, 3, 4, 5) + c(1, 2, 3)Powered By 

    The second vector, due to the vector recycling, will actually be converted into c(1, 2, 3, 1, 2). Hence, the final result of this operation will be c(2, 4, 6, 5, 7).

    While sometimes vector recycling can be beneficial (e.g., when we expect the cyclicity of values in the vectors), more often, it’s inappropriate and misleading. Hence, we should be careful and mind the vectors’ lengths before performing operations on them.

  • Advanced Visualization with ggplot2

    Let’s create more complex visualizations, such as a boxplot and a density plot.

    Boxplot

    rCopy code# Boxplot to visualize the distribution of weight by age group
    ggplot(data, aes(x = factor(age > 30), y = weight)) +
      geom_boxplot(fill = "lightblue") +
      labs(title = "Boxplot of Weight by Age Group (Above/Below 30)",
    
       x = "Age &gt; 30",
       y = "Weight (kg)") +
    theme_minimal()

    Density Plot

    rCopy code# Density plot for height
    ggplot(data, aes(x = height, fill = ..count..)) +
      geom_density(alpha = 0.5) +
      labs(title = "Density Plot of Height",
    
       x = "Height (cm)",
       y = "Density") +
    theme_minimal()
  • Data Manipulation with dplyr

    In this example, we’ll use the dplyr package for data manipulation. We’ll filter, summarize, and arrange data.

    Step 1: Install and Load dplyr

    If you don’t have dplyr installed yet, you can install it with:

    rCopy codeinstall.packages("dplyr")
    

    Then, load the library:

    rCopy codelibrary(dplyr)
    

    Step 2: Create a Sample Dataset

    We’ll continue using the previous dataset or create a new one:

    rCopy code# Create a sample dataset
    set.seed(456)
    data <- data.frame(
      id = 1:100,
      age = sample(18:65, 100, replace = TRUE),
      height = rnorm(100, mean = 170, sd = 10),
      weight = rnorm(100, mean = 70, sd = 15)
    )
    

    Step 3: Data Manipulation

    1. Filtering Data: Let’s filter individuals who are above 30 years old.
    rCopy code# Filter data for individuals older than 30
    filtered_data <- data %>% filter(age > 30)
    head(filtered_data)
    
    1. Summarizing Data: We can calculate the average height and weight for this filtered group.
    rCopy code# Summarize to get mean height and weight for individuals older than 30
    summary_stats <- filtered_data %>%
      summarize(
    
    mean_height = mean(height),
    mean_weight = mean(weight),
    count = n()
    ) print(summary_stats)
    1. Arranging Data: Sort the dataset by height in descending order.
    rCopy code# Arrange data by height in descending order
    arranged_data <- data %>% arrange(desc(height))
    head(arranged_data)
  • What types of data plots can be created in R?

    Being data visualization one of the strong sides of the R programming languages, we can create all types of data plots in R:

    • Common types of data plots:
      • Bar plot—shows the numerical values of categorical data.
      • Line plot—shows a progression of a variable, usually over time.
      • Scatter plot—shows the relationships between two variables.
      • Area plot—based on a line plot, with the area below the line colored or filled with a pattern.
      • Pie chart—shows the proportion of each category of categorical data as a part of the whole.
      • Box plot—shows a set of descriptive statistics of the data.
    • Advanced types of data plots:
      • Violin plot—shows both a set of descriptive statistics of the data and the distribution shape for that data.
      • Heatmap—shows the magnitude of each numeric data point within the dataset.
      • Treemap—shows the numerical values of categorical data, often as a part of the whole.
      • Dendrogram—shows an inner hierarchy and clustering of the data.
      • Bubble plot—shows the relationships between three variables.
      • Hexbin plot—shows the relationships of two numerical variables in a relatively large dataset.
      • Word cloud—shows the frequency of words in an input text.
      • Choropleth map—shows aggregate thematic statistics of geodata.
      • Circular packing chart—shows an inner hierarchy of the data and the values of the data points
      • etc.

    The skill track Data Visualization with R will help you broaden your horizons in the field of R graphics. If you prefer to learn data visualization in R in a broader context, explore a thorough and beginner-friendly career track Data Scientist with R.

  • Statistical Analysis

    We can perform a linear regression analysis to understand the relationship between height and weight.

    rCopy code# Linear regression model
    model <- lm(weight ~ height, data = data)
    
    # Display the model summary
    summary(model)
  • Data Visualization

    Using the ggplot2 package, we can create a scatter plot to visualize the relationship between height and weight.

    rCopy code# Load ggplot2 package
    library(ggplot2)
    
    # Create a scatter plot of height vs weight
    ggplot(data, aes(x = height, y = weight)) +
      geom_point(color = 'blue') +
      labs(title = "Scatter Plot of Height vs Weight",
    
       x = "Height (cm)",
       y = "Weight (kg)") +
    theme_minimal()