- R is optimized for vectorized operations. Instead of using loops, leverage vectorized functions (like
apply,sapply, or vector arithmetic) for better performance and cleaner code.
Author: saqibkhan
-
Use Vectorized Operations
-
Comment Your Code
- Use comments (
#) to explain your code. This is especially important for complex analyses or when you revisit your code after some time. Clear comments help you and others understand the logic behind your code.
- Use comments (
-
Use the Tidyverse
- The tidyverse is a collection of R packages designed for data science. It includes
dplyr,ggplot2,tidyr,readr, and more, which provide intuitive functions for data manipulation and visualization. Familiarizing yourself with these packages can streamline your data analysis workflow.
- The tidyverse is a collection of R packages designed for data science. It includes
-
List and define the control statements in R.
There are three groups of control statements in R: conditional statements, loop statements, and jump statements.
Conditional statements:
if—tests whether a given condition is true and provides operations to perform if it’s so.if-else—tests whether a given condition is true, provides operations to perform if it’s so and another set of operations to perform in the opposite case.if... else if... else—tests a series of conditions one by one, provides operations to perform for each condition if it’s true, and a fallback set of operations to perform if none of those conditions is true.switch—evaluates an expression against the items of a list and returns a value from the list based on the results of this evaluation.
Loop statements:
for—in for loops, iterates over a sequence.while—in while loops, checks if a predefined logical condition (or several logical conditions) is met at the current iteration.repeat—in repeat loops, continues performing the same set of operations until a predefined break condition (or several break conditions) is met.
Jump statements:
next—skips a particular iteration of a loop and jumps to the next one if a certain condition is met.break—stops and exits the loop at a particular iteration if a certain condition is met.return—exits a function and returns the result.
-
Simulation Studies
Simulation studies are crucial for understanding the behavior of statistical methods. Here’s an example of simulating the Central Limit Theorem.
Step 1: Set Parameters
rCopy code# Set parameters n <- 30 # Sample size num_simulations <- 1000 # Number of simulations # Set seed for reproducibility set.seed(123)Step 2: Simulate Sample Means
rCopy code# Simulate sample means from an exponential distribution sample_means <- replicate(num_simulations, mean(rexp(n, rate = 1))) # View the first few sample means head(sample_means)Step 3: Plot the Distribution of Sample Means
rCopy code# Plot the distribution of sample means hist(sample_means, breaks = 30, main = "Distribution of Sample Means",
lines(density(sample_means), col = "red")xlab = "Sample Mean", col = "lightblue", probability = TRUE) -
What is the difference between the functions apply(), lapply(), sapply(), and tapply()?
While all these functions allow iterating over a data structure without using loops and perform the same operation on each element of it, they are different in terms of the type of input and output and the function they perform.
apply()—takes in a data frame, a matrix, or an array and returns a vector, a list, a matrix, or an array. This function can be applied row-wise, column-wise, or both.lapply()—takes in a vector, a list, or a data frame and always returns a list. In the case of a data frame as an input, this function is applied only column-wise.sapply()—takes in a vector, a list, or a data frame and returns the most simplified data structure, i.e., a vector for an input vector, a list for an input list, and a matrix for an input data frame.tapply()—calculates summary statistics for different factors (i.e., categorical data).
-
Functional Programming with R
Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions. In R, you can use functions like
lapply,sapply, andmapfrom thepurrrpackage.Step 1: Install and Load purrr
rCopy codeinstall.packages("purrr") library(purrr)Step 2: Create a Sample List
rCopy code# Create a list of numeric vectors num_list <- list(a = 1:5, b = 6:10, c = 11:15)Step 3: Use lapply and sapply
rCopy code# Apply a function to each element using lapply squared_list <- lapply(num_list, function(x) x^2) print(squared_list) # Use sapply to simplify the output to a matrix squared_matrix <- sapply(num_list, function(x) x^2) print(squared_matrix)Step 4: Use map from purrr
rCopy code# Use map to apply a function and return a list squared_map <- map(num_list, ~ .x^2) print(squared_map) -
What is the use of the switch() function in R?
The
switch()function in R is a multiway branch control statement that evaluates an expression against items of a list. It has the following syntax:switch(expression, case_1, case_2, case_3....)Powered ByThe expression passed to the
switch()function can evaluate to either a number or a character string, and depending on this, the function behavior is different.1. If the expression evaluates to a number, the
switch()function returns the item from the list based on positional matching (i.e., its index is equal to the number the expression evaluates to). If the number is greater than the number of items in the list, theswitch()function returnsNULL. For example:switch(2, "circle", "triangle", "square")Powered ByOutput:
"triangle"Powered By2. If the expression evaluates to a character string, the
switch()function returns the value based on its name:switch("red", "green"="apple", "orange"="carot", "red"="tomato", "yellow"="lemon")Powered ByOutput:
"tomato"Powered ByIf there are multiple matches, the first matched value is returned. It’s also possible to add an unnamed item as the last argument of the
switch()function that will be a default fallback option in the case of no matches. If this default option isn’t set, and if there are no matches, the function returnsNULL.The
switch()function is an efficient alternative to long if-else statements since it makes the code less repetitive and more readable. Typically, it’s used for evaluating a single expression. We can still write more complex nested switch constructs for evaluating multiple expressions. However, in this form, theswitch()function quickly becomes hard to read and hence loses its main advantage over if-else constructs. -
Integration with Databases using DBI and RMySQL
R can connect to databases to perform data analysis on large datasets. Here’s how to connect to a MySQL database.
Step 1: Install and Load Required Packages
rCopy codeinstall.packages("DBI") install.packages("RMySQL") library(DBI) library(RMySQL)Step 2: Connect to the Database
rCopy code# Connect to the MySQL database con <- dbConnect(RMySQL::MySQL(),dbname = "your_database_name", host = "your_host", user = "your_username", password = "your_password")Step 3: Query Data
rCopy code# Query data from a table data_db <- dbGetQuery(con, "SELECT * FROM your_table_name LIMIT 100") # View the queried data head(data_db)Step 4: Disconnect from the Database
rCopy code# Disconnect from the database dbDisconnect(con) -
Geographic Data Analysis with sf and ggplot2
Geospatial data analysis is crucial for visualizing and analyzing spatial relationships. We’ll use the
sfpackage for handling spatial data.Step 1: Install and Load sf
rCopy codeinstall.packages("sf") install.packages("ggplot2") # Make sure ggplot2 is installed library(sf) library(ggplot2)Step 2: Load Geographic Data
For this example, you can use built-in datasets or download shapefiles. Here, we’ll use a simple example with the
ncdataset from thesfpackage.rCopy code# Load the North Carolina shapefile (included in the sf package) nc <- st_read(system.file("shape/nc.shp", package = "sf")) # Plot the geographic data ggplot(data = nc) + geom_sf() + labs(title = "North Carolina Counties",
theme_minimal()x = "Longitude", y = "Latitude") +Step 3: Analyze and Visualize Attributes
rCopy code# Calculate the area of each county and add it as a new column nc$area <- st_area(nc) # Plot with area as fill ggplot(data = nc) + geom_sf(aes(fill = area)) + labs(title = "Area of North Carolina Counties",
theme_minimal()fill = "Area (sq meters)") +