Category: Interview Questions

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSmU5XsFIGN1SKqOwOVoJrdANk8J2vp87lNuA&s

  • What is the difference between the with() and within() functions?

    The with() function evaluates an R expression on one or more variables of a data frame and outputs the result without modifying the data frame. The within() function evaluates an R expression on one or more variables of a data frame, modifies the data frame, and outputs the result. Below we can see how these functions work using a sample data frame as an example:

    df <- data.frame(a = c(1, 2, 3),  b = c(10, 20, 30))
    print(df)
    
    with(df, a * b)
    
    print(within(df, c <- a * b))
    

    Output:

      a  b
    1 1 10
    2 2 20
    3 3 30
    
    10  40  90
      a  b  c
    1 1 10 10
    2 2 20 40
    3 3 30 90
  • What is Shiny in R?

    Shiny is an open-source R package that allows the easy and fast building of fully interactive web applications and webpages for data science using only R, without any knowledge of HTML, CSS, or JavaScript. Shiny in R offers numerous basic and advanced features, widgets, layouts, web app examples, and their underlying code to build upon and customize, as well as user showcases from various fields (technology, sports, banking, education, etc.) gathered and categorized by the Shiny app developer community.

  • List and define the various approaches to estimating model accuracy in R.

    Below are several approaches and how to implement them in the caret package of R.

    • Data splitting—the entire dataset is split into a training dataset and a test dataset. The first one is used to fit the model, the second one is used to test its performance on unseen data. This approach works particularly well on big data. To implement data splitting in R, we need to use the createDataPartition() function and set the p parameter to the necessary proportion of data that goes to training.
    • Bootstrap resampling—extracting random samples of data from the dataset and estimating the model on them. Such resampling iterations are run many times and with replacement. To implement bootstrap resampling in R, we need to set the method parameter of the trainControl() function to "boot" when defining the training control of the model.
    • Cross-validation methods
      • k-fold cross-validation —the dataset is split into k-subsets. The model is trained on k-1 subsets and tested on the remaining one. The same process is repeated for all subsets, and then the final model accuracy is estimated.
      • Repeated k-fold cross-validation —the principle is the same as for the k-fold cross-validation, only that the dataset is split into k-subsets more than one time. For each repetition, the model accuracy is estimated, and then the final model accuracy is calculated as the average of the model accuracy values for all repetitions.
      • Leave-one-out cross-validation (LOOCV) —one data observation is put aside and the model is trained on all the other data observations. The same process is repeated for all data observations.

    To implement these cross-validation methods in R, we need to set the method parameter of the trainControl() function to "cv""repeatedcv", or "LOOCV" respectively, when defining the training control of the model.

  • What are correlation and covariance, and how do you calculate them in R?

    Correlation is a measure of the strength and direction of the linear relationships between two variables. It takes values from -1 (a perfect negative correlation) to 1 (a perfect positive correlation). Covariance is a measure of the degree of how two variables change relative to each other and the direction of the linear relationships between them. Unlike correlation, covariance doesn’t have any range limit.

    In R, to calculate the correlation, we need to use the cor() function, to calculate the covariance—the cov() function. The syntax of both functions is identical: we need to pass in two variables (vectors) for which we want to calculate the measure (e.g., cor(vector_1, vector_2) or cov(vector_1, vector_2)), or the whole data frame, if we want to calculate the correlation or covariance between all the variables of that data frame (e.g., cor(df) or cov(df)). In the case of two vectors, the result will be a single value, in the case of a data frame, the result will be a correlation (or covariance) matrix.

  • How to select features for machine learning in R?

    Let’s consider three different approaches and how to implement them in the caret package.

    1. By detecting and removing highly correlated features from the dataset.

    We need to create a correlation matrix of all the features and then identify the highly correlated ones, usually those with a correlation coefficient greater than 0.75:

    corr_matrix <- cor(features)
    highly_correlated <- findCorrelation(corr_matrix, cutoff=0.75)
    print(highly_correlated)
    1. By ranking the data frame features by their importance.

    We need to create a training scheme to control the parameters for train, use it to build a selected model, and then estimate the variable importance for that model:

    control <- trainControl(method="repeatedcv", number=10, repeats=5)
    model <- train(response_variable~., data=df, method="lvq", preProcess="scale", trControl=control)
    importance <- varImp(model)
    print(importance)
    1. By automatically selecting the optimal features.

    One of the most popular methods provided by caret for automatically selecting the optimal features is a backward selection algorithm called Recursive Feature Elimination (RFE).

    We need to compute the control using a selected resampling method and a predefined list of functions, apply the RFE algorithm passing to it the features, the target variable, the number of features to retain, and the control, and then extract the selected predictors:

    control <- rfeControl(functions=caretFuncs, method="cv", number=10)
    results <- rfe(features, target_variable, sizes=c(1:8), rfeControl=control)
    print(predictors(results))
  • What packages are used for machine learning in R?

    • caret—for various classification and regression algorithms.
    • e1071—for support vector machines (SVM), naive Bayes classifier, bagged clustering, fuzzy clustering, and k-nearest neighbors (KNN).
    • kernlab—provides kernel-based methods for classification, regression, and clustering algorithms.
    • randomForest—for random forest classification and regression algorithms.
    • xgboost—for gradient boosting, linear regression, and decision tree algorithms.
    • rpart—for recursive partitioning in classification, regression, and survival trees.
    • glmnet—for lasso and elastic-net regularization methods applied to linear regression, logistic regression, and multinomial regression algorithms.
    • nnet—for neural networks and multinomial log-linear algorithms.
    • tensorflow—the R interface to TensorFlow, for deep neural networks and numerical computation using data flow graphs.
    • Keras—the R interface to Keras, for deep neural networks.
  • What are regular expressions, and how do you work with them in R?

    A regular expression, or regex, in R or other programming languages, is a character or a sequence of characters that describes a certain text pattern and is used for mining text data. In R, there are two main ways of working with regular expressions:

    1. Using the base R and its functions (such as grep()regexpr()gsub()regmatches(), etc.) to locate, match, extract, and replace regex.
    2. Using a specialized stringr package of the tidyverse collection. This is a more convenient way to work with R regex since the functions of stringr have much more intuitive names and syntax and offer more extensive functionality.
  • List and define the control statements in R.

    There are three groups of control statements in R: conditional statements, loop statements, and jump statements.

    Conditional statements:

    • if—tests whether a given condition is true and provides operations to perform if it’s so.
    • if-else—tests whether a given condition is true, provides operations to perform if it’s so and another set of operations to perform in the opposite case.
    • if... else if... else—tests a series of conditions one by one, provides operations to perform for each condition if it’s true, and a fallback set of operations to perform if none of those conditions is true.
    • switch—evaluates an expression against the items of a list and returns a value from the list based on the results of this evaluation.

    Loop statements:

    • for—in for loops, iterates over a sequence.
    • while—in while loops, checks if a predefined logical condition (or several logical conditions) is met at the current iteration.
    • repeat—in repeat loops, continues performing the same set of operations until a predefined break condition (or several break conditions) is met.

    Jump statements:

    • next—skips a particular iteration of a loop and jumps to the next one if a certain condition is met.
    • break—stops and exits the loop at a particular iteration if a certain condition is met.
    • return—exits a function and returns the result.
  • What is the difference between the functions apply(), lapply(), sapply(), and tapply()?

    While all these functions allow iterating over a data structure without using loops and perform the same operation on each element of it, they are different in terms of the type of input and output and the function they perform.

    • apply()—takes in a data frame, a matrix, or an array and returns a vector, a list, a matrix, or an array. This function can be applied row-wise, column-wise, or both.
    • lapply()—takes in a vector, a list, or a data frame and always returns a list. In the case of a data frame as an input, this function is applied only column-wise.
    • sapply()—takes in a vector, a list, or a data frame and returns the most simplified data structure, i.e., a vector for an input vector, a list for an input list, and a matrix for an input data frame.
    • tapply()—calculates summary statistics for different factors (i.e., categorical data).
  • What is the use of the switch() function in R?

    The switch() function in R is a multiway branch control statement that evaluates an expression against items of a list. It has the following syntax:

    switch(expression, case_1, case_2, case_3....)Powered By 

    The expression passed to the switch() function can evaluate to either a number or a character string, and depending on this, the function behavior is different.

    1. If the expression evaluates to a number, the switch() function returns the item from the list based on positional matching (i.e., its index is equal to the number the expression evaluates to). If the number is greater than the number of items in the list, the switch() function returns NULL. For example:

    switch(2, "circle", "triangle", "square")Powered By 

    Output:

    "triangle"Powered By 

    2. If the expression evaluates to a character string, the switch() function returns the value based on its name:

    switch("red", "green"="apple", "orange"="carot", "red"="tomato", "yellow"="lemon")Powered By 

    Output:

    "tomato"Powered By 

    If there are multiple matches, the first matched value is returned. It’s also possible to add an unnamed item as the last argument of the switch() function that will be a default fallback option in the case of no matches. If this default option isn’t set, and if there are no matches, the function returns NULL.

    The switch() function is an efficient alternative to long if-else statements since it makes the code less repetitive and more readable. Typically, it’s used for evaluating a single expression. We can still write more complex nested switch constructs for evaluating multiple expressions. However, in this form, the switch() function quickly becomes hard to read and hence loses its main advantage over if-else constructs.