Category: Examples

https://cdn-icons-png.flaticon.com/512/5307/5307812.png

  • Simulation Studies

    Simulation studies are crucial for understanding the behavior of statistical methods. Here’s an example of simulating the Central Limit Theorem.

    Step 1: Set Parameters

    rCopy code# Set parameters
    n <- 30  # Sample size
    num_simulations <- 1000  # Number of simulations
    
    # Set seed for reproducibility
    set.seed(123)
    

    Step 2: Simulate Sample Means

    rCopy code# Simulate sample means from an exponential distribution
    sample_means <- replicate(num_simulations, mean(rexp(n, rate = 1)))
    
    # View the first few sample means
    head(sample_means)
    

    Step 3: Plot the Distribution of Sample Means

    rCopy code# Plot the distribution of sample means
    hist(sample_means, breaks = 30, main = "Distribution of Sample Means",
    
     xlab = "Sample Mean", col = "lightblue", probability = TRUE)
    lines(density(sample_means), col = "red")
  • Functional Programming with R

    Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions. In R, you can use functions like lapply, sapply, and map from the purrr package.

    Step 1: Install and Load purrr

    rCopy codeinstall.packages("purrr")
    library(purrr)
    

    Step 2: Create a Sample List

    rCopy code# Create a list of numeric vectors
    num_list <- list(a = 1:5, b = 6:10, c = 11:15)
    

    Step 3: Use lapply and sapply

    rCopy code# Apply a function to each element using lapply
    squared_list <- lapply(num_list, function(x) x^2)
    print(squared_list)
    
    # Use sapply to simplify the output to a matrix
    squared_matrix <- sapply(num_list, function(x) x^2)
    print(squared_matrix)
    

    Step 4: Use map from purrr

    rCopy code# Use map to apply a function and return a list
    squared_map <- map(num_list, ~ .x^2)
    print(squared_map)
  • Integration with Databases using DBI and RMySQL

    R can connect to databases to perform data analysis on large datasets. Here’s how to connect to a MySQL database.

    Step 1: Install and Load Required Packages

    rCopy codeinstall.packages("DBI")
    install.packages("RMySQL")
    library(DBI)
    library(RMySQL)
    

    Step 2: Connect to the Database

    rCopy code# Connect to the MySQL database
    con <- dbConnect(RMySQL::MySQL(), 
    
                 dbname = "your_database_name",
                 host = "your_host",
                 user = "your_username",
                 password = "your_password")

    Step 3: Query Data

    rCopy code# Query data from a table
    data_db <- dbGetQuery(con, "SELECT * FROM your_table_name LIMIT 100")
    
    # View the queried data
    head(data_db)
    

    Step 4: Disconnect from the Database

    rCopy code# Disconnect from the database
    dbDisconnect(con)
  • Geographic Data Analysis with sf and ggplot2

    Geospatial data analysis is crucial for visualizing and analyzing spatial relationships. We’ll use the sf package for handling spatial data.

    Step 1: Install and Load sf

    rCopy codeinstall.packages("sf")
    install.packages("ggplot2")  # Make sure ggplot2 is installed
    library(sf)
    library(ggplot2)
    

    Step 2: Load Geographic Data

    For this example, you can use built-in datasets or download shapefiles. Here, we’ll use a simple example with the nc dataset from the sf package.

    rCopy code# Load the North Carolina shapefile (included in the sf package)
    nc <- st_read(system.file("shape/nc.shp", package = "sf"))
    
    # Plot the geographic data
    ggplot(data = nc) +
      geom_sf() +
      labs(title = "North Carolina Counties",
    
       x = "Longitude", y = "Latitude") +
    theme_minimal()

    Step 3: Analyze and Visualize Attributes

    rCopy code# Calculate the area of each county and add it as a new column
    nc$area <- st_area(nc)
    
    # Plot with area as fill
    ggplot(data = nc) +
      geom_sf(aes(fill = area)) +
      labs(title = "Area of North Carolina Counties",
    
       fill = "Area (sq meters)") +
    theme_minimal()
  • Advanced Statistical Modeling with Mixed-Effects Models

    Mixed-effects models are useful when dealing with data that have both fixed and random effects. We’ll use the lme4 package for this.

    Step 1: Install and Load lme4

    rCopy codeinstall.packages("lme4")
    library(lme4)
    

    Step 2: Create a Sample Dataset

    rCopy code# Create a sample dataset with random effects
    set.seed(222)
    data_mixed <- data.frame(
      id = rep(1:10, each = 10),
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Introduce a random effect
    data_mixed$y <- data_mixed$y + rep(rnorm(10, mean = 5, sd = 1), each = 10)
    

    Step 3: Fit a Mixed-Effects Model

    rCopy code# Fit a mixed-effects model
    model_mixed <- lmer(y ~ x + (1 | id), data = data_mixed)
    
    # Display the model summary
    summary(model_mixed)
  • Network Analysis with igraph

    Network analysis is essential for understanding relationships in data. We’ll use the igraph package.

    Step 1: Install and Load igraph

    rCopy codeinstall.packages("igraph")
    library(igraph)
    

    Step 2: Create a Sample Graph

    rCopy code# Create a sample graph
    edges <- data.frame(
      from = c("A", "A", "B", "C", "C", "D", "E"),
      to = c("B", "C", "D", "D", "E", "E", "A")
    )
    
    # Create a graph object
    graph <- graph_from_data_frame(edges, directed = TRUE)
    
    # Plot the graph
    plot(graph, vertex.color = "lightblue", vertex.size = 30, edge.arrow.size = 0.5,
    
     main = "Sample Directed Graph")

    Step 3: Analyze the Graph

    rCopy code# Calculate degree centrality
    degree_centrality <- degree(graph)
    print(degree_centrality)
    
    # Identify the largest connected component
    largest_component <- components(graph)$membership
    print(largest_component)
  • Text Analysis with tm and wordcloud

    Text analysis is vital for extracting insights from unstructured data. Here, we’ll analyze a simple text corpus.

    Step 1: Install and Load Required Packages

    rCopy codeinstall.packages("tm")
    install.packages("wordcloud")
    library(tm)
    library(wordcloud)
    

    Step 2: Create a Sample Text Corpus

    rCopy code# Create a sample text corpus
    texts <- c("R is great for data analysis.",
    
           "Data science is an exciting field.",
           "R and Python are popular programming languages.",
           "Data visualization is key to understanding data.")
    # Create a Corpus corpus <- Corpus(VectorSource(texts)) # Preprocess the text (convert to lower case, remove punctuation) corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("en"))

    Step 3: Create a Term-Document Matrix

    rCopy code# Create a term-document matrix
    tdm <- TermDocumentMatrix(corpus)
    tdm_matrix <- as.matrix(tdm)
    word_freqs <- sort(rowSums(tdm_matrix), decreasing = TRUE)
    word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)
    

    Step 4: Generate a Word Cloud

    rCopy code# Create a word cloud
    set.seed(1234)
    wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq, min.freq = 1,
    
          max.words = 100, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))</code></code></pre>
  • Clustering with k-means

    Clustering is a powerful technique for grouping similar data points. We’ll use the k-means algorithm.

    Step 1: Create a Sample Dataset

    rCopy code# Generate a sample dataset
    set.seed(111)
    cluster_data <- data.frame(
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Visualize the data
    plot(cluster_data$x, cluster_data$y, main = "Sample Data for Clustering", xlab = "X", ylab = "Y")
    

    Step 2: Apply k-means Clustering

    rCopy code# Apply k-means clustering
    kmeans_result <- kmeans(cluster_data, centers = 3)
    
    # Add the cluster assignments to the dataset
    cluster_data$cluster <- as.factor(kmeans_result$cluster)
    
    # Plot the clusters
    library(ggplot2)
    ggplot(cluster_data, aes(x = x, y = y, color = cluster)) +
      geom_point() +
      labs(title = "K-means Clustering Result") +
      theme_minimal()
  • Clustering with k-means

    Clustering is a powerful technique for grouping similar data points. We’ll use the k-means algorithm.

    Step 1: Create a Sample Dataset

    rCopy code# Generate a sample dataset
    set.seed(111)
    cluster_data <- data.frame(
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Visualize the data
    plot(cluster_data$x, cluster_data$y, main = "Sample Data for Clustering", xlab = "X", ylab = "Y")
    

    Step 2: Apply k-means Clustering

    rCopy code# Apply k-means clustering
    kmeans_result <- kmeans(cluster_data, centers = 3)
    
    # Add the cluster assignments to the dataset
    cluster_data$cluster <- as.factor(kmeans_result$cluster)
    
    # Plot the clusters
    library(ggplot2)
    ggplot(cluster_data, aes(x = x, y = y, color = cluster)) +
      geom_point() +
      labs(title = "K-means Clustering Result") +
      theme_minimal()
  • Complex Visualization with ggplot2

    We can create faceted plots and combine multiple visualizations.

    Faceted Plot

    rCopy code# Create faceted plots by age group
    ggplot(data, aes(x = height, y = weight)) +
      geom_point(aes(color = age_group), alpha = 0.7) +
      facet_wrap(~ age_group) +
      labs(title = "Height vs Weight by Age Group",
    
       x = "Height (cm)",
       y = "Weight (kg)") +
    theme_minimal()

    Combining Plots

    You can also use the patchwork package to combine multiple plots:

    rCopy code# Install and load patchwork
    install.packages("patchwork")
    library(patchwork)
    
    # Scatter plot and boxplot
    scatter_plot <- ggplot(data, aes(x = height, y = weight)) + geom_point() + theme_minimal()
    box_plot <- ggplot(data, aes(x = age_group, y = weight)) + geom_boxplot() + theme_minimal()
    
    # Combine plots
    combined_plot <- scatter_plot + box_plot + plot_layout(ncol = 2)
    print(combined_plot)