34 R Questions you must prepare before Data Science Interview

  1. What are the data structure in R which helps in statistical analysis and graphical representation?
    The following are the data structure in R which are widely used:-
    a.) Array
    d.) Data frame
    e.) List
    f.) Tables
  2. What is class() function in R?
    This is a very important function in R which is a character vector giving the names of the classes from which the object inherits.

    > x<- 1:10
    > class(x)
    [1] “integer”

  3. What is a vector?
    Ans.) A vector is a sequence of data elements of the same basic type. Members in a vector are called components.

    >vector_example<- c(2,3,4,5)
    > print(vector_example)
    [1] 2,3,4,5

    > print(length(vector_exmple)
    [1] 4

  4. How can you combine 2 vectors?
    Vectors can be combined from 2 to 1 by using the c() function
    > first <- c(1,2,3,4)
    > second <- (“a”, “b”, “c”)
    > third <- c(first, second)
    > print(third)
    [1] “1” “2” “3” “4” “a” “b” “c”
    The numbers are also shown in the double quote, this is done to maintain the same primitive data type for the new vector being created J
  5. How to perform arithmetic operations on Vectors? Show with some example
    There are many arithmetic operators which are being used in R. Remember, R uses the operators component by component. Let’s look at it with some common operators.

    >x <- c(1,2,3,4)
    >y<- c(4,5,6,7)
    [1] 5 7 9 11
    [1] -3 -3 -3 -3
    >z <- (4,4,4,4,4,4,4)
    [1] 5 6 7 8 5 6 7
    When you have 2 vectors with unequal length and you need to perform an operation on both, then the shorter vector will be used again and again to match the length of both the vectors

  6. Define Index in Vector?
    Ans.) Vector in index is used to give the element at that position of the vector. Few programming language starts the index with 0 and other starts with 1. R counts the index from 1. There are many possibilities while putting an index number i.e.
    a. positive and in range index
    > x<- (1,3,4,5)
    [1] “3”

    b. out of range
    > x <- (2,3,4,5)
    > x[110]
    [1] NA

    c. negative index – It removes this element and replies back with all the left numbers
    >x<- (3,4,5,6,7)
    [1] “3” “4” “6” “7”

    d. range of values
    >x <- (3,4,5,6,7,8)
    [1] “4” “5” “6” “7”

    e. duplicate index
    > x<- (3,4,5,6,7)
    > s[c(2,1,2,3)]
    [1] “4” “3” “4” “5”

    f. logical index – If you want to select a particular group of index number, then you should use logical operators i.e. TRUE and FALSE
    >x<- (2,3,4,5,6)
    [1] “2” “5” “6”

  7. A list, as the name suggests is a number of vectors collected together. Suppose, you have a number vector, a character vector, a Boolean vector and some numbers. You want to combine it into one which obviously won’t have the same data type. So you need to create a list

    > n = c(2, 3, 5)
    > s = c("a", "b", "c", "d", "e")
    > x = list(n, s, b, 3)
    > print(x)
    [1] 2 3 5

    [1] “a” “b” “c” “d” “e”


    [1] 3

  8. What is a Matrices ?
    Ans. )
    A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
    # Matrix creation
    > M=matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=TRUE)
    [1]                    [2]              [3]
    [1]  1                      2                    3
    [2]  4                      5                    6

    nrow = number of rows in the matrix
    ncol = number of columns in the matrix
    byrow = TRUE/FALSE will get you value first by row or column

  9. What is an Array?
    Array is a super set of Matrices. On one hand the matrices can be of 2 dimension but array can be of any number of dimensions.
    > a <- array(c(“car”,”bike”), dim(3,3,2))
    > print(a)

    , , 1
    [,1]     [,2]     [,3]
    [1,] "car"  "bike" "car"
    [2,] "bike" "car"  "bike"
    [3,] "car"  "bike" "car"
    , , 2
    [,1]     [,2]     [,3]
    [1,] "bike" "car"  "bike"
    [2,] "car"  "bike" "car"
    [3,] "bike" "car"  "bike"

    >my_array<- array(1:24, dim=c(3,4,2))
    , , 1
    [,1] [,2] [,3] [,4]
    [1,]  1     4     7    10
    [2,]  2     5     8    11
    [3,]  3     6     9    12
    , , 2
    [,1] [,2]  [,3] [,4]
    [1,]  13   16   19   22
    [2,]  14   17   20   23
    [3,]  15   18   21   24

  10. What is a factor?
    Factors are the r-objects which are created using a vector. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values. It stores the vector along with the distinct values of the elements in the vector as labels.
    Factors are created using the factor() function. The nlevelsfunctions gives the count of levels.

    #First let’s create a vector
    >vector_example<- c(‘a’,’b’,’c’,’a’,’a’)
    #Now create a factor object
    >factor_example<- factor(vector_example)
    [1] a b c a a
    [1] 3

    nlevels gives you the number of distinct values in the vector.

  11. What is the difference between Matrix and an array ?
    Matrix can have only 2 dimensions where as an array can have as many dimensions as you want. Matrix is defined with the help of data, number of rows, number of columns and whether the elements are to be put in row wise or column wise.
    In array you need to give the dimension of the array. An array can be of any number of dimensions and each dimension is a matrix. For example a 3x3x2 array represents 2 matrices each of dimension 3x3.
  12. What is a data frame?
      Data frame is a list of vectors  of equal length. It can consist of any vector with a particular data type and can combine it into one. So, a data frame can have a vector of logical and another of numeric. The only condition being that all the vectors should have the same length.

    #This is how the data frame is created
    >student_profile<- data.frame(
    name<-c(“Amit”, “Sumit”, “Ajay”)
    age <- c(22,23,24)
    class <- c(6,7,8)

    The above code will create 3 columns with the column name as name, age and class.

  13. What is the difference between a matrix and a dataframe?
    A dataframe can contain vectors with different inputs and a matrix cannot. (You can have a dataframe of characters, integers, and even other dataframes, but you can't do that with a matrix.A matrix must be all the same type.)
    So, the data frame can have different vector of character, numbers, logical, etc. and it is still cool. But, for matrix you need only one type of data type. Phewww !!
  14. Define repeat loop.
    Ans. )
    Repeat loop executes a sequence of statement multiple times. It don’t put the condition at the same place where we put the keyword repeat.

    > name <- c(“Pappu”, “John”)
    > temp <- 5
    > repeat {
    temp <- temp+2

    if(temp > 11) {

    So, this will return the name vector 4 times. First it prints the name and then increase the temp to 7 and so on.

  15. Define while loop.
    In the while loop the condition is tested and the control goes into the body only when the condition is true


    > name <- c(“Pappu”, “John”)
    > temp <- 5
    > repeat (temp<11) {
    temp <- temp+2

    The name will be printed 4 times

  16. Define the for loop.
    The for loop are not limited to integers. You can pass character vectors, logical vectors, lists or expressions.
    > x<- LETTERS[1:2]
    for ( i in x) {
    [1] “A”
    [2] “B”
  17. What is the use of sort() function? How to use the function to sort in descending order?
    Elements in a vector can be sorted using the function sort()

    > temp <- c(3,5,2,6,7,1)
    >sort_temp<- sort(temp)
    > print(sort_temp)
    [1] 1 2 3 5 6 7
    >rev_sort<- sort(temp, decreasing = TRUE)
    [1] 7,6,5,3,2,1

    This function also works with the words.

  18. Create a list which holds a vector, a matrix and a list.

    example_list<- list(c(“Kamal”,”Nitin”), matrix(c(1,2,3,4,5,6), nrow = 2), list(“red”,1))

  19. Determine the output of the following function f(2).

    b <- 4
    f <- function(a)
    b <- 3
    b^3 + g(a)
    g <- function(a)
    The global variable b has a value 4. The function f has an argument 2 and the function’s body has the local variable b with the value 3. So function f(2) will return 3^3 + g(2) and g(2) will give the value 2*4 = 8 where 4 is the value of b.
    Thus, the answer is 35

  20. What is the output of runif(10)?
    runif() function is used to generate random values and the argument gives the number of values required. So the above function will generate 10 random values between 0 and 1.
  21. Get all the data of the person having maximum salary.

    max_salary_person<- subset(data, salary == max(salary))
  22. Get all the people who works with TCS and have salary more than 300000

    TCS_data_salary<- subset(data, company == “TCS”  & salary > 300000)
  23. How is data reshaping done in R?
    Data reshaping involves various techniques which is used according to the need. It’s not a procedure you need to follow, but independent methods to remould the data set. Following are the methods used:-
    a. cbind()
    b. rbind()
    c. new_column_name.data_frame_name
    d. merge()
    e. melt()
    f. cast()
  24. How to get outer join, left join, right join, inner join and cross join?

    outer join - merge(x=df1, y=df2, by = “id”, all = TRUE)
    left join - merge(x=df1, y=df2, by=”id”, all.x = TRUE)
    right join - merge(x=df1, y=df2, by = “id”, all.y = TRUE)
    inner join - merge(x=df1, y=df2, by = “id”)
    cross join - merge(x=df1, y=df2, by = NULL)

  25. When you are reshaping the data, you sometimes need to melt the data. Explain melt() function
    Suppose you have a data set which havecompany_name, age, salary, children. So when you want to have the data where you need the data grouped by company_name and then under company_name grouped by age. This whole process is called melting the data and it is performed with melt() function

    new_data_set.previous_data_set<- melt(previous_data_set, id=c(“company_name”,”age”))

  26. What is lapply() function in R?
    lapply() function is used when you want to apply a function to each element of a list in turn and get a list back.
    x<- list(a=1, b=1:3, c=10:100)
    [1] 1
    $b[1] 3


    [1] 91

    You can use other functions like max, min, sum, etc.

  27. What is sapply() function in R?
    sapply() function is used when you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
     is useful sometimes because it will get you a set of values and you can easily perform an operation on it.
    x <-list(a =1, b =1:3, c =10:100)#Compare with above; a named vector, not a list sapply(x, FUN = length)a  b  c   1391 sapply(x, FUN = sum)a    b    c    165005
  28. What is the difference between lapply and sapply?
    If the programmers want the output to be a data frame or a vector, then sapply function is used whereas if a programmer wants the output to be a list then lapply is used
  29. How to apply mean function in R?
    Mean is calculated by taking the sum of numbers and dividing it with the total number of elements. The function mean() is used to apply this in R.
    mean(x, trim=0,na.rm=FALSE)

    The mean() function have 3 arguments
    a.) x contains the vector on which mean is to be applied
    b.) trim = 0, It is used to drop some observations from each end of the sorted array.
    c.) na.rm is used to remove the missing values from the input vector

    If there are missing values in the vector then mean will return NA as a result, so in order to drop the missing values to get a mean, you should put na.rm=TRUE which means remove the missing values.

  30. How to make scatterplot in R?
    Scatterplot is a graph which shows many points plotted in the Cartesian plane. Each point holds 2 values which are present on the x and y axis. The simple scatterplot is plotted using plot() function.
    The syntax for scatterplot is:-

    plot(x, y ,main, xlab, ylab, xlim, ylim, axes)

    x is the data set whose values are the horizontal coordinates
    y is the data set whose values are the vertical coordinates
    main is the tile in the graph
    xlab and ylab is the label in the horizontal and vertical axis
    xlim and ylim are the limits of values of x and y used in the plotting
    axes indicates whether both axis should be there on the plot

    plot(x =input$wt,y=input$mpg,xlab="Weight",ylab="Milage",xlim= c(2.5,5),ylim= c(15,30),main="Weight vsMilage")

  31. Bonus Question
    How to write a countdown function in R?

    timer<- function(time){print(time)while(time!=0)  {Sys.sleep(1)time<- time - 1print(time)  }}countdown(5)
    [1] 5
    [2] 4
    [3] 3
    [4] 2
    [5] 1
  32. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[1]?

  33. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[[1]]?

    [1] 5 12 21 32
  34. What are some of the functions that R have?
    The functions present in R are:-
    a. Mean
    b. Median
    c. Distribution
    d. Covariance
    e. Regression
    f. GAM
    g. GLM
    h. Non-linear
    i. Mixed Effects etc

TheDataMonk has compiled the top 100 R questions which you must prepare before any R interview focusing on a Data Science role. You can buy it on Amazon. Links below
i. Amazon
ii. Amazon India

Or you can mail me at nitinkamal132@gmail.com and get your copy free 🙂

Factspan Interview Question

Factspan is a Bangalore based company started in the year 2012. It is a pure play analytics company with expertise in converting data into actionable insights.

Location - Bangalore
Job Title - Business Analyst
Experience required - 1-3 years
Number of Rounds - 4

Round 1 - Online Test(Elimination round)
Topics - SQL and MS Excel
Sample Questions
1. What is VLOOKUP and HLOOKUP?
VLOOKUP and HLOOKUP are functions in Excel that basically helps you in retrieving a data from a table by matching it with a common value.
HLOOKUP is the exact same function, but looks up data that has been formatted by rows instead of columns.

2. What is Indexing?
The Excel INDEX function returns the value at a given position in a range or array. You can use index to retrieve individual values or entire rows and columns. INDEX is often used with the MATCH function, where MATCH locates and feeds a position to INDEX.
The return type is a value at a given location.
3. How to get the second highest salary from a table?
Refer to SQL Interview Questions
4. Inner Join example
5. Find the error(Mostly related to WHERE vs HAVING and Conditional statement)
Refer to SQL Interview Questions

Round 2 - Face to Face Interview
Topic - 
Past Project Related Questions
My past project was related to customer churn on digital channel
Topic - What all factors are affecting the high customer churn rate on an e-commerce website
Questions asked :-
a. What is the difference between unique visitor and visits?
Unique visitors are the number of distinct visitors which have accessed the website or a particular page. Whereas visits is the number of times the website/page has been hit.
b. What could be the probable reason for high customer churn rate?
Here you have to give ample amount of possible reasons for churn, the examples which I gave were
  i.   Failed payment 
  ii.  Difficult User Interface
  iii. Lack of Engagement
  iv. Lack of Proactive support
  v.  Poor Market fit
c. How did you analyze the problem?
For the basic analysis, we took the total number of visitors for a particular page flow and figured out the pages with maximum fall outs. So, we have 5 pages which started with putting an item in the cart page and flowed to the payment page. We looked into the search queries from each page and the type of questions asked to the support staff while a visitor is on a particular page to zero down on the probable issues.
Once we had the list of issues, we created a list of top 5 issues related to the digital inefficiency of the website. This was reported to the client.
d. Did you use any statistical technique/method to approach the problem?
We tried Natural Language Processing to process the chat and search data.
e. What tool did you use?
We used R to clean the chat data and then applied NLP algorithm to get the top issues.Then there were questions on term-frequency, inverse-term frequency, sentiment analysis, etc. Basically you need to be very thorough with at least one project which you completed recently or were an active part of.

Round 3 - Face to Face
Topic - Case Study and Mathematical Round(Logical, Aptitude, Probability and Mathematical/Statistics Questions). Emphasis is on your problem solving skills
You can practice the case studies and mathematical puzzles here.

Round 4 - VP Face to Face Interview
Topic - HR questions
It was mostly about why do you want to leave the current company, salary and work expectation, etc.

Salary offered - Best in the industry(****)

SQL interview questions for Data Science and Business Analyst role

Before you start with the questions, make sure you have your basics clear. Look for explanation if you don’t have enough context about some syntax or function. We have provided the table schema wherever required, in case the schema is not provided, please assume the table and column names.

1. How to find the third highest salary in an employee table with employee number and employee salary?

SELECT * FROM Employee_Table t1
(SELECT Count(distinct Salary) from Employee_Table t2 WHERE t1.Salary <= t2.Salary);

2. Write a query to find maximum salary of each department in an organisation.

SELECT Department_Name, Max(Salary)
FROM Department_Table
GROUP BY Department_Name

3.What is wrong with the following query?
SELECT Id, Year(PaymentDate) as PaymentYear
FROM Bill_Table
WHERE PaymentYear > 2018;

Though the variable PaymentYear has already been defined in the first line of the query, but this is not the correct logical process order. The correct query will be
SELECT Id, Year(PaymentDate) as PaymentYear
FROM Bill_Table
WHERE Year(PaymentDate) > 2018;

4. What is the order of execution in a query?

The order of query goes like this:-
FROM – Choose and join tables to get the raw data
WHERE – First filtering condition
GROUP BY – Aggregates the base data
HAVING – Apply condition on the base data
SELECT – Return the final data
ORDER BY – Sort the final data
LIMIT – Apply limit to the returned data

5.What is ROW_NUMBER() function?

It assigns a unique id to each row returned from the query ,even if the ids are the same. Sample query:-

SELECT emp.*,
row_number() over (order by salary DESC) Row_Number
from Employee emp;

Employee NameSalaryRow_Number

Even when the salary is the same for Bhargav and Chirag, they have a different Row_Number, this means that the function row_number just gives a number to every row

6. What is RANK() function?

RANK() function is used to give a rank and not a row number to the data set. The basic difference between RANK() and ROW_NUMBER is that Rank will give equal number/rank to the data points with same value. In the above case, RANK() will give a value of 2 to both Bhargav and Chirag and thus will rank Dinesh as 4. Similarly, it will give rank 5 to both Esha and Farhan.

SELECT emp.*,
RANK() over (order by salary DESC) Ranking
from Employee emp;

7. What is NTILE() function?

NTILE() function distributes the rows in an ordered partition into a specific number of groups. These groups are numbered. For example, NTILE(5) will divide a result set of 10 records into 5 groups with 2 record per group. If the number of records is not divided equally in the given group, the function will set more record to the starting groups and less to the following groups.

SELECT emp.*,
NTILE(3) over (order by salary DESC) as GeneratedRank
from Employee emp

This will divide the complete data set in 3 groups from top. So the GeneratedRank will be 1 for Amit and Bhargav, 2 for Chirag and Dinesh: 3 for Esha and Farhan

8. What is DENSE_RANK() ?

This gives the rank of each row within a result set partition, with no gaps in the ranking values. Basically there is no gap, so if the top 2 employees have the same salary then they will get the same rank i.e. 1 , much like the RANK() function. But, the third person will get a rank of 2 in DENSE_RANK as there is no gap in ranking where as the third person will get a rank of 3 when we use RANK() function. Syntax below:-

SELECT emp.*,
DENSE_RANK() OVER (order by salary DESC) DenseRank
from Employee emp;

9. Write a query to get employees name starting with vowels.

FROM Employee
where EmpName like ‘[aeiou]%’

10. Write a query to get employee name starting and ending with vowels.

FROM Employee
where EmpName like ‘[aeiou]%[aeiou]’

11. What are the different types of statements supported in SQL?

There are three types of statements in SQL:-
a. DDL – Data Definition Language
b. DML – Data Manipulation Language
c. DCL – Data Control Language

12. What is DDL?

It is used to define the database structure such as tables. It includes 3 commands:-
a. Create – Create is for creating tables
CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
b. Alter – Alter table is used to modifying the existing table object in the database.
ALTER TABLE table_name
ADD column_name datatype
c. Drop – If you drop a table, all the rows in the table is deleted and the table structure is removed from the database. Once the table is dropped, you can’t get it back

13. What is DML?

Data Manipulation Language is used to manipulate the data in records. Commonly used DML commands are Update, Insert and Delete. Sometimes SELECT command is also referred as a Data Manipulation Language.

14. What is DCL?

Data Control Language is used to control the privileges by granting and revoking database access permission

15. What is the difference between DELETE and TRUNCATE?

Linear Regression in R

Linear Regression is a basic algorithm which is used to predict the value of an outcome variable Y based on one or more input predictor variable.
The mathematical formula of a multi-variable Linear Regression is:-|
Y = m1x1 + m2x2 + m3x3 …. + mnxn
Here Y is the outcome variable and x1,x2,etc. are the input predictor.

The code in R is simple, first of all install the following libraries in your work space:-

Take any time series data to implement and observe how a Linear Regression works. We will be taking a monthly data of number of Pizzas sold in a restaurant, Olivers’, and we will be creating some customized variable for the same.

The table above is a sample data set, you can use any raw time series data set.

Myntra Interview Questions

  1. Company – Myntra
    Year – 2018
    Profile – Business Analyst(2+ Years of Exp.)
    Rounds – 3

    Round 1  
    -Written SQL round – ~50 MCQ questions (Intermediate level)
    – Non-elimination round(unless you do as pathetic as 15/50)
    – Go through www.sqlzoo.net or follow www.thedatamonk.com

    Round 2 
    Technical and Case Study Interview Questions
    1. What is your role in the current company?<Answer Accordingly)
    2. What all tools,technologies and languages you have worked on ?
    -Power BI and Tableau
    3. What is a normal distribution?
    – It is a type of probability distribution
    – A Normal curve is also called Bell curve which looks like the above diagram. Here x-axis is the value that the x variable can take and y-axis is the probability of the values of x
    – A bell curve is symmetric around mean
    – 68% of the total population lies within first standard deviation 
    4. Give a real life example of Normal Distribution.
    – You can take up any example like height of employees in your organization or weight or salary of employees
    5. How to get the second largest salary from a salary table with employee id and salary?(Standard Question)
    SELECT MAX(Salary)
    FROM Table_Name
    WHERE Salary NOT IN (SELECT MAX(Salary) FROM Table_Name);
  2. 6

7)         Differentiate between univariate, bivariate and multivariate analysis.

Day 1 – Starting with Data Science

Before you start exploring Data Science, there are a few things which you should know beforehand. Do keep the following in mind:-

1. A typical data science role requires a good knowledge of the the following domain:-
a. Analytics – This contains the core data science where you play around the data to build analytical model. For this you need to learn either of Python or R
b. Query – You need to know how to retrieve a particular data from millions of rows and hundreds of columns. The query optimization plays a crucial role here(since you are working with millions of rows of data). Proficiency in any query language(SQL) is required in a data scientist role
c. Reporting – After you are done with pulling and analyzing the data and building a model or something, now you need to share the result with everyone. For that you need to know at least one industry standard reporting tool. You can learn any one out of Tableau, Microsoft PowerBI, Qlik Sense, etc.

So all in all, you need to know at least one analytical language, SQL and a reporting tool. In this website, we will learn about various analytics algorithms and some cool SQL hacks

Blog Post Title

What goes into a blog post? Helpful, industry-specific content that: 1) gives readers a useful takeaway, and 2) shows you’re an industry expert.

Use your company’s blog posts to opine on current industry topics, humanize your company, and show how your products and services can help people.