Author: saqibkhan

  • How to parse a date from its string representation in R?

    To parse a date from its string representation in R, we should use the lubridate package of the tidyverse collection. This package offers various functions for parsing a string and extracting the standard date from it based on the initial date pattern in that string. These functions are ymd()ymd_hm()ymd_hms()dmy()dmy_hm()dmy_hms()mdy()mdy_hm()mdy_hms(), etc., where y, m, d, h, m, and s correspond to year, month, day, hours, minutes, and seconds, respectively.

    For example, if we run the dmy() function passing to it any of the strings “05-11-2023”, “05/11/2023” or “05.11.2023”, representing the same date, we’ll receive the same result: 2023-11-05. This is because in all three cases, despite having different dividing symbols, we actually have the same pattern: the day followed by the month followed by the year.

  • Advanced Statistical Modeling with Mixed-Effects Models

    Mixed-effects models are useful when dealing with data that have both fixed and random effects. We’ll use the lme4 package for this.

    Step 1: Install and Load lme4

    rCopy codeinstall.packages("lme4")
    library(lme4)
    

    Step 2: Create a Sample Dataset

    rCopy code# Create a sample dataset with random effects
    set.seed(222)
    data_mixed <- data.frame(
      id = rep(1:10, each = 10),
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Introduce a random effect
    data_mixed$y <- data_mixed$y + rep(rnorm(10, mean = 5, sd = 1), each = 10)
    

    Step 3: Fit a Mixed-Effects Model

    rCopy code# Fit a mixed-effects model
    model_mixed <- lmer(y ~ x + (1 | id), data = data_mixed)
    
    # Display the model summary
    summary(model_mixed)
  • How to create a new column in a data frame in R based on other columns?

    1. Using the transform() and ifelse() functions of the base R:

    df <- data.frame(col_1 = c(1, 3, 5, 7),  col_2 = c(8, 6, 4, 2))
    print(df)
    ​
    # Adding the column col_3 to the data frame df
    df <- transform(df, col_3 = ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
    print(df)Powered By 

    Output:

      col_1 col_2
    1     1     8
    2     3     6
    3     5     4
    4     7     2
      col_1 col_2 col_3
    1     1     8     9
    2     3     6     9
    3     5     4    20
    4     7     2    14Powered By 

    2. Using the with() and ifelse() functions of the base R:

    df <- data.frame(col_1 = c(1, 3, 5, 7),  col_2 = c(8, 6, 4, 2))
    print(df)
    ​
    # Adding the column col_3 to the data frame df
    df["col_3"] <- with(df, ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
    print(df)Powered By 

    Output:

      col_1 col_2
    1     1     8
    2     3     6
    3     5     4
    4     7     2
      col_1 col_2 col_3
    1     1     8     9
    2     3     6     9
    3     5     4    20
    4     7     2    14Powered By 

    3. Using the apply() function of the base R:

    df <- data.frame(col_1 = c(1, 3, 5, 7),  col_2 = c(8, 6, 4, 2))
    print(df)
    ​
    # Adding the column col_3 to the data frame df
    df["col_3"] <- apply(df, 1, FUN = function(x) if(x[1] < x[2]) x[1] + x[2] else x[1] * x[2])
    print(df) Powered By 

    Output:

      col_1 col_2
    1     1     8
    2     3     6
    3     5     4
    4     7     2
      col_1 col_2 col_3
    1     1     8     9
    2     3     6     9
    3     5     4    20
    4     7     2    14Powered By 

    4. Using the mutate() function of the dplyr package and the ifelse() function of the base R:

    df <- data.frame(col_1 = c(1, 3, 5, 7),  col_2 = c(8, 6, 4, 2))
    print(df)
    ​
    # Adding the column col_3 to the data frame df
    df <- mutate(df, col_3 = ifelse(col_1 < col_2, col_1 + col_2, col_1 * col_2))
    print(df)Powered By 

    Output:

      col_1 col_2
    1     1     8
    2     3     6
    3     5     4
    4     7     2
      col_1 col_2 col_3
    1     1     8     9
    2     3     6     9
    3     5     4    20
    4     7     2    14
  • Network Analysis with igraph

    Network analysis is essential for understanding relationships in data. We’ll use the igraph package.

    Step 1: Install and Load igraph

    rCopy codeinstall.packages("igraph")
    library(igraph)
    

    Step 2: Create a Sample Graph

    rCopy code# Create a sample graph
    edges <- data.frame(
      from = c("A", "A", "B", "C", "C", "D", "E"),
      to = c("B", "C", "D", "D", "E", "E", "A")
    )
    
    # Create a graph object
    graph <- graph_from_data_frame(edges, directed = TRUE)
    
    # Plot the graph
    plot(graph, vertex.color = "lightblue", vertex.size = 30, edge.arrow.size = 0.5,
    
     main = "Sample Directed Graph")

    Step 3: Analyze the Graph

    rCopy code# Calculate degree centrality
    degree_centrality <- degree(graph)
    print(degree_centrality)
    
    # Identify the largest connected component
    largest_component <- components(graph)$membership
    print(largest_component)
  • Text Analysis with tm and wordcloud

    Text analysis is vital for extracting insights from unstructured data. Here, we’ll analyze a simple text corpus.

    Step 1: Install and Load Required Packages

    rCopy codeinstall.packages("tm")
    install.packages("wordcloud")
    library(tm)
    library(wordcloud)
    

    Step 2: Create a Sample Text Corpus

    rCopy code# Create a sample text corpus
    texts <- c("R is great for data analysis.",
    
           "Data science is an exciting field.",
           "R and Python are popular programming languages.",
           "Data visualization is key to understanding data.")
    # Create a Corpus corpus <- Corpus(VectorSource(texts)) # Preprocess the text (convert to lower case, remove punctuation) corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("en"))

    Step 3: Create a Term-Document Matrix

    rCopy code# Create a term-document matrix
    tdm <- TermDocumentMatrix(corpus)
    tdm_matrix <- as.matrix(tdm)
    word_freqs <- sort(rowSums(tdm_matrix), decreasing = TRUE)
    word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)
    

    Step 4: Generate a Word Cloud

    rCopy code# Create a word cloud
    set.seed(1234)
    wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq, min.freq = 1,
    
          max.words = 100, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))</code></code></pre>
  • What is the difference between the subset() and sample() functions n R?

    The subset() function in R is used for extracting rows and columns from a data frame or a matrix, or elements from a vector, based on certain conditions, e.g.: subset(my_vector, my_vector > 10).

    Instead, the sample() function in R can be applied only to vectors. It extracts a random sample of the predefined size from the elements of a vector, with or without replacement. For example, sample(my_vector, size=5, replace=TRUE)

  • Clustering with k-means

    Clustering is a powerful technique for grouping similar data points. We’ll use the k-means algorithm.

    Step 1: Create a Sample Dataset

    rCopy code# Generate a sample dataset
    set.seed(111)
    cluster_data <- data.frame(
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Visualize the data
    plot(cluster_data$x, cluster_data$y, main = "Sample Data for Clustering", xlab = "X", ylab = "Y")
    

    Step 2: Apply k-means Clustering

    rCopy code# Apply k-means clustering
    kmeans_result <- kmeans(cluster_data, centers = 3)
    
    # Add the cluster assignments to the dataset
    cluster_data$cluster <- as.factor(kmeans_result$cluster)
    
    # Plot the clusters
    library(ggplot2)
    ggplot(cluster_data, aes(x = x, y = y, color = cluster)) +
      geom_point() +
      labs(title = "K-means Clustering Result") +
      theme_minimal()
  • What is the difference between the str() and summary() functions in R?

    The str() function returns the structure of an R object and the overall information about it, the exact contents of which depend on the data structure of that object. For example, for a vector, it returns the data type of its items, the range of item indices, and the item values (or several first values, if the vector is too long). For a data frame, it returns its class (data.frame), the number of observations and variables, the column names, the data type of each column, and several first values of each column.

    The summary() function returns the summary statistics for an R object. It’s mostly applied to data frames and matrices, for which it returns the minimum, maximum, mean, and median values, and the 1st and 3rd quartiles for each numeric column, while for the factor columns, it returns the count of each level.

  • Clustering with k-means

    Clustering is a powerful technique for grouping similar data points. We’ll use the k-means algorithm.

    Step 1: Create a Sample Dataset

    rCopy code# Generate a sample dataset
    set.seed(111)
    cluster_data <- data.frame(
      x = rnorm(100),
      y = rnorm(100)
    )
    
    # Visualize the data
    plot(cluster_data$x, cluster_data$y, main = "Sample Data for Clustering", xlab = "X", ylab = "Y")
    

    Step 2: Apply k-means Clustering

    rCopy code# Apply k-means clustering
    kmeans_result <- kmeans(cluster_data, centers = 3)
    
    # Add the cluster assignments to the dataset
    cluster_data$cluster <- as.factor(kmeans_result$cluster)
    
    # Plot the clusters
    library(ggplot2)
    ggplot(cluster_data, aes(x = x, y = y, color = cluster)) +
      geom_point() +
      labs(title = "K-means Clustering Result") +
      theme_minimal()
  • What is the use of the next and break statements in R?

    The next statement is used to skip a particular iteration and jump to the next one if a certain condition is met. The break statement is used to stop and exit the loop at a particular iteration if a certain condition is met. When used in one of the inner loops of a nested loop, this statement exits only that inner loop.

    Both next and break statements can be used in any type of loops in R: for loops, while loops, and repeat loops. They can also be used in the same loop, e.g.:

    for(i in 1:10) {
    
    if(i &lt; 5)
        next
    if(i == 8)
        break
    print(i)}</code>Powered By </code></pre>

    Output:

    [1] 5
    [1] 6
    [1] 7