Category: Interview Questions

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSmU5XsFIGN1SKqOwOVoJrdANk8J2vp87lNuA&s

  • How to remove columns from a data frame in R?

    1. By using the select() function of the dplyr package of the tidyverse collection. The name of each column to delete is passed in with a minus sign before it:

    df <- select(df, -col_1, -col_3)Powered By 

    If, instead, we have too many columns to delete, it makes more sense to keep the rest of the columns rather than delete the columns in interest. In this case, the syntax is similar, but the names of the columns to keep aren’t preceded with a minus sign:

    df <- select(df, col_2, col_4)Powered By 

    2. By using the built-in subset() function of the base R. If we need to delete only one column, we assign to the select parameter of the function the column name preceded with a minus sign. To delete more than one column, we assign to this parameter a vector containing the necessary column names preceded with a minus sign:

    df <- subset(df, select=-col_1)
    df <- subset(df, select=-c(col_1, col_3))Powered By 

    If, instead, we have too many columns to delete, it makes more sense to keep the rest of the columns rather than delete the columns in interest. In this case, the syntax is similar, but no minus sign is added:

    df <- subset(df, select=col_2)
    df <- subset(df, select=c(col_2, col_4))
  • How do you add a new column to a data frame in R?

    1. Using the $ symbol:
    df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
    print(df)
    ​
    df$col_3 <- c(5, 1, 18, 16)
    print(df)Powered By 

    Output:

      col_1 col_2
    1    10     a
    2    11     b
    3    12     c
    4    13     d
      col_1 col_2 col_3
    1    10     a     5
    2    11     b     1
    3    12     c    18
    4    13     d    16Powered By 
    1. Using square brackets:
    df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
    print(df)
    ​
    df["col_3"] <- c(5, 1, 18, 16)
    print(df)Powered By 

    Output:

      col_1 col_2
    1    10     a
    2    11     b
    3    12     c
    4    13     d
      col_1 col_2 col_3
    1    10     a     5
    2    11     b     1
    3    12     c    18
    4    13     d    16Powered By 
    1. Using the cbind() function:
    df <- data.frame(col_1=10:13, col_2=c("a", "b", "c", "d"))
    print(df)
    ​
    df <- cbind(df, col_3=c(5, 1, 18, 16))
    print(df)Powered By 

    Output:

     col_1 col_2
    1    10     a
    2    11     b
    3    12     c
    4    13     d
      col_1 col_2 col_3
    1    10     a     5
    2    11     b     1
    3    12     c    18
    4    13     d    16Powered By 

    In each of the three cases, we can assign a single value or a vector or calculate the new column based on the existing columns of that data frame or other data frames.

  • How to create a data frame in R?

    1. From one or more vectors of the same length—by using the data.frame() function:

    df <- data.frame(vector_1, vector_2)Powered By 

    2. From a matrix—by using the data.frame() function:

    df <- data.frame(my_matrix)Powered By 

    3. From a list of vectors of the same length—by using the data.frame() function:

    df <- data.frame(list_of_vectors)Powered By 

    4. From other data frames:

    • To combine the data frames horizontally (only if the data frames have the same number of rows, and the records are the same and in the same order) —by using the cbind() function:
    df <- cbind(df1, df2)Powered By 
    • To combine the data frames vertically (only if they have an equal number of identically named columns of the same data type and appearing in the same order) —by using the rbind() function:
    df <- rbind(df1, df2)
  • What is a package in R, and how do you install and load packages?

    An R package is a collection of functions, code, data, and documentation, representing an extension of the R programming language and designed for solving specific kinds of tasks. R comes with a bunch of preinstalled packages, and other packages can be installed by users from repositories. The most popular centralized repository storing thousands of various R packages is called Comprehensive R Archive Network (CRAN).

    To install an R package directly from CRAN, we need to pass the package name enclosed in quotation marks to the install.packages() function, as follows: install.packages("package_name"). To install more than one package from CRAN in one go, we need to use a character vector containing the package names enclosed in quotation marks, as follows: install.packages(c("package_name_1", "package_name_2"). To install an R package manually, we need first to download the package as a zip file on our computer and then run the install.packages() function:

    install.packages("path_to_the_locally_stored_zipped_package_file", repos=NULL, type="source")Powered By 

    To load an installed R package in the working R environment, we can use either library() or require() functions. Each of them takes in the package name without quotation marks and loads the package, e.g., library(caret). However, the behavior of these functions is different when they can’t find the necessary package: library() throws an error and stops the program execution, while require() outputs a warning and continues the program execution.

  • How to import data in R?

    The base R provides essential functions for importing data:

    • read.table()—the most general function of the base R for importing data, takes in tabular data with any kind of field separators, including specific ones, such as |.
    • read.csv()—comma-separated values (CSV) files with . as the decimal separator.
    • read.csv2()—semicolon-separated values files with , as the decimal separator.
    • read.delim()—tab-separated values (TSV) files with . as the decimal separator.
    • read.delim2()—tab-separated values (TSV) files with , as the decimal separator.

    In practice, any of these functions can be used to import tabular data with any kind of field and decimal separators: using them for the specified formats of files is only the question of convention and default settings. For example, here is the syntax of the first function: read.table(file, header = FALSE, sep = "", dec = "."). The other functions have the same parameters with different default settings that can always be explicitly overwritten.

    The tidyverse packages readr and readxl provide some other functions for importing specific file formats. Each of those functions can be further fine-tuned by setting various optional parameters.

    readr

    • read_tsv()—tab-separated values (TSV) files.
    • read_fwf()—fixed-width files.
    • read_log()—web log files.
    • read_table()read_csv()read_csv2(), and read_delim()—equivalent to the base R functions.

    readxl

    • read_excel()—Excel files.
    • read_csv()—equivalent to the function from the base R functions.

    To dive deeper into data loading in R, you can go through the tutorial on How to Import Data Into R.

  • List and define some basic data structures in R.

    1. Vector—a one-dimensional data structure used for storing values of the same data type.
    2. List—a multi-dimensional data structure used for storing values of any data type and/or other data structures.
    3. Matrix—a two-dimensional data structure used for storing values of the same data type.
    4. Data frame—a two-dimensional data structure used for storing values of any data type, but each column must store values of the same data type.
  • List and define some basic data types in R.

    There are a few data types in R, including: 

    1. Numeric—decimal numbers.
    2. Integer—whole numbers.
    3. Character—a letter, number, or symbol, or any combination of them, enclosed in regular or single quotation marks.
    4. Factor—categories from a predefined set of possible values, often with an intrinsic order.
    5. Logical—the Boolean values TRUE and FALSE, represented under the hood as 1 and 0, respectively.
  • What are some disadvantages of using R?

    • Non-intuitive syntax and hence a steep learning curve, especially for beginners in programming
    • Relatively slow
    • Inefficient memory usage
    • Inconsistent and often hard-to-read documentation of packages
    • Some packages are of low quality or poorly-maintained
    • Potential security concerns due to its open-source nature
  • What is R, and what are its main characteristics?

    R is a programming language and environment widely used for solving data science problems and particularly designed for statistical computing and data visualization. Its main characteristics include:

    • Open source
    • Interpreted (i.e., it supports both functional and object-oriented programming)
    • Highly extensible due to its large collection of data science packages
    • Functional and flexible (users can define their own functions, as well as tune various parameters of existing functions)
    • Compatible with many operating systems
    • Can be easily integrated with other programming languages and frameworks
    • Allows powerful statistical computing
    • Offers a variety of data visualization tools for creating publication-quality charts
    • Equipped with the command-line interface
    • Supported by a strong online community