Wouldn’t it be great if you knew exactly what the hiring manager will ask you at your next R and Data science interview?
Well frankly, we can’t do just that but we can give you the next best thing which a list of the 16 most commonly asked interview questions and the answers you should give.
We’ve gathered these question from interviewers and people who have been on an R or Data Science interview.
Please note that these answers are just recommendations since there are more possible answers that could also be correct.
Let's get started:
1. How do you import data into R?
You can use the R command to import the data in R. There are three ways you can use to import data:
• You can easily import data using Data-> New Data Set.
• You can import the data from a plain text (ASCII) or other files (SPSS, Minitab, etc.) using a package.
• Read a data set either by typing the name of the data set or selecting the data set in the dialog box.
2. What are the different data types in R?
R has a wide variety of data types including scalars, vectors (numerical, character, and logical), matrices, data frames, and lists.
3. What is the difference between a matrix and a data frame?
A data frame can contain heterogeneous inputs and a matrix cannot. You can have a data frame of characters, integers, and even other data frames, but you can't do that with a matrix. A matrix must be the same type.
4. Why do you need apply() family of functions? What is difference between sapply() and lapply()?
These functions allow crossing the data in a number of ways and avoid the explicit use of loop constructs. Use lapply() when you want the output to be a list, and sapply() when you want the output to be a vector or a data frame.
5. How are missing values represented in R? How are missing values replaced in R?
Missing values are represented by the symbol NA (not available). There are two ways to replace values in R. Single imputation will replace missing values with the means of the other values in the variable or you can randomly sample from those values.
6. What are the options to merge two data frames in R?
You can merge two data frames (datasets) horizontally, using the merge function. To merge two data frames (datasets) vertically, use the rbind function.
7. What are the most commonly used functions in R?
substr(x, start=nl, stop=n2)
Statistical Probability Functions
Other Statistical Functions
scale(x, center=true, scale=true)
8. What is the difference between the sort () and order () functions in R?
The sort () function will sort a single variable (a single vector or factor), while the order ( ) function works with observation numbers and thus can be used to sort a data frame or do more complex sorting.
9. What are factor variables in R?
Factor variables are categorical variables that can be either numeric or string variables. The factor stores nominal values as a vector of integers in the range [1... k] where k is the number of unique values in the nominal variable, and an internal vector of character strings (the original values) mapped to these integers.
10. What is the curse of dimensionality and how should one deal with it when building machine-learning models?
The curse of dimensionality refers to various phenomenon that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space.
11. Explain the difference between a compiled computer language and an interpreted computer language?
Interpretation is a technique whereby another program called the interpreter, performs operations on behalf of the program being interpreted in order to run it. If you can imagine reading a program and doing what it says to do step-by-step, say on a piece of scratch paper, that’s just what an interpreter does, as well. A common reason to interpret a program is that interpreters are relatively easy to write. Another reason is that an interpreter can monitor what a program tries to do as it runs, to enforce a policy, let’s say, for security.
Compilation is a technique whereby a program written in one language (the “source language”) is translated into a program in another language (the “object language”), which hopefully means the same as the original program. While doing the translation, it is common for the compiler to also try to transform the program in ways that will make the object program faster (without changing its meaning). A common reason to compile a program is that there are some good ways to run programs in the object language quickly and without the overhead of interpreting the source language along the way.
12. Explain the benefits of test-driven software development or explain the benefits of unit testing.
Benefits of Test- Driven Software Development
1. Maintainable, Flexible, Easily Extensible
2. Unparalleled Test Coverage and Streamlined Codebase
3. Clean Interface
4. Refactoring Encourages Improvements.
5. Executable Documentation
Benefits of Unit Testing
1. Find Problems Easily.
2. Facilitates Change
3. Simplifies Integration
13. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages?
Probabilistic record linkage, sometimes called fuzzy matching (also probabilistic merging or fuzzy merging in the context of merging of databases), takes a different approach to record linkage problems by taking into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity. Recorded pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be non-matches. Pairs that fall between the two thresholds are considered to be "possible matches" and can be dealt with accordingly (e.g. human reviewed, linked, or not linked, depending on the requirements). It’s easier to handle with SQL.
14. How do you convert a factor variable to a numeric variable without losing information?
First possible answer:
You can convert factors to either text or numbers. To do this, you use the functions as.character() or as.numeric(). First, convert your directions vector into a factor called directions.factor.
Second possible answer:
split and split<- are generic functions with the default and data.frame methods. The data frame method can also be used to split a matrix into a list of matrices and the replacement form likewise, provided they are invoked explicitly.
unsplit works with lists of vectors or data frames (assumed to have a compatible structure, as if created by a split). It puts elements or rows back into the positions given by f. In the data frame case, row names are obtained by unsplitting the row name vectors from the elements of value.
15. What’s more important, the predictive power or interpretability of a model?
Modeling is a process of building a useful representation of a system or phenomenon. Models have a purpose, for example to precisely simulate or predict the behavior of a system. Among other qualities, interpretability (aka comprehensibility or understandability) is often recalled, especially in contexts such as data mining/knowledge discovery from data.
The concept of predictive power differs from explanatory and descriptive power (where phenomenon that are already known are retrospectively explained or described by a given theory) in that it allows a prospective test of theoretical understanding.
16. How do you convert a Date character string to a date variable in R?
You can use the as.Date() function to convert character data to dates. The format is as.Date(x,"format"), where x is the character data and “format” gives the appropriate format.
17. Explain the “bias-variance trade-off” and why it is fundamental to machine learning?
In statistics and machine learning, the bias–variance tradeoff (or dilemma) is a problem that simultaneously minimizes two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:
• The bias is an error that occurs from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under fitting).
• The variance is an error that occurs from sensitivity to small fluctuations in the training set. High variance can cause over-fitting such as modeling random noise in the training data, rather than the intended outputs.
Take the full test here.