What is Data?

Welcome to the first day of “100 Days Data Science Challenge”. In this series, we will try to make you understand with one topic at a time which should not take more than 2 hours. So, in total, you need to devote 150-200 hours in the next 100 days to understand various data science topics, ranging from beginner to intermediate level.

We will start with the basics.

What is data?
In Latin, Data is the plural of datum. In English Data is both singular and plural. Data is any sort of information, it could be in the form of facts or figures,  numbers, text, images, sound, etc.

Example- “Sachin”,”51″,”greatest”

What is

Regular Expression

We will start with Regular Expressions. RegEx is a way of creating a rule to filter what you want from your data. We will to keep Python as a consistent coding language for this complete exercise.

Many of you must have come across SQL questions where you need to get the data of customers whose name starts with A and in the WHERE condition you write something like,

WHERE Customer_Name LIKE ‘A%’

Well !! This is the basic Regular Expression where you request the query to get you a specific result. The way we write Regular Expression in Python is a bit different. Check out the table below:-

To use Regular Expression, first, you need to “import re” package
And the following 4 functions quite useful for using your regex 1. findall – It returns a complete list of all the matches
2. search – It returns a match object
3. split – Splits the string wherever there is a match
4. sub – It replaces one or many matches of the regex

Following are some important metacharacter and special sequence

RegEx Description
w+ Get all the words
d Digits
s Spaces
S Anything but white spaces
+ One or more occurrences
^ Starts with
$ Ends with
* Zero or more occurences
+ One or more occurrences
| Either Or
[] A set of Character
Special sequence

Let’s get down on some questions to understand the basics of how to write a regex

1. re.split(‘s+’,’My name is Data Monk’)
‘My’ ‘name’ ‘is’ ‘Data’ ‘Monk’ – The above function took the regex s+ to get all the words from the given string and split it 2. end_Sentence = r'[.?!]’
print(re.split(end_Sentence, String)
The above line of codes will split the document wherever a sentence is ending with a full stop, question mark, or an exclamation mark
3. [a-z A-Z 0-9 -.]
This will match all the upper case, lower case, digits, – and . 4. r”[.*]”
Since it contains an asterisk, so it will match anything and everything You can find many more RegEx exercise questions on different websites. Do practice a few 🙂
Let’s continue with our NLP

 

Natural Language Processing Using Python

What is NLP?

NLP stands for Natural Language Processing. To put it in a simple way, NLP is a way of developing applications which can understand human language. Millions of Gigabytes of data is being generated every day and there is a high demand to understand this data in a simpler way. We have different analytical methods to fetch insights from numerical data, and NLP comes into the picture when we are dealing with textual data.

What are the applications of NLP?

NLP is being used heavily in the following domain:-
1. Search Engines – Be it Google or Bing, we write a query and rely on the NLP part of the algorithm to fetch us the best results.
2. Speech Engines – Almost all mobile phones come with an inbuilt speech engine which translates your voice feed into a query and gets you the search results. Example – Siri
3. Social Media – Have you ever wondered how you are receiving relevant ads in your social media feed? Well, that is NLP, decoding your interest and pushing relevant ads on your feed
4. Survey Analysis – You must have filled some surveys, and there are billions of survey copies in the digital world, you need to go through every survey answer to get a gist of customer sentiments, but going through millions of copies is quite cumbersome. So, what do you do? You build an NLP model, feed the content and boom..It’s done.
5. Artificial Intelligence Chat Bot – This is one important application of NLP, where you chat with a bot and it gets what you ask for
6. Spam filters – For every relevant email, there are at least 2 spam emails, but you never worry about it because you never got to see one. All those spam filters are powered by NLP

What are we going to learn in this post?

We will use Python to understand the basics of NLP, regular expression and few algorithms. So, be with us for another 30-40 minutes to understand Natural Language Processing using Python.

Before starting NLP, Please go through Regular Expression.

 

 

 

 

 

Sapient Business Analyst Interview Questions

Publicis Sapient, formerly Sapient, is a digital transformation partner helping established organizations get to their future, digitally-enabled state by fusing strategy, consulting and customer experience with agile engineering. It was founded in 1990.
Location – Bangalore
Job Title – Business Analyst
Experience required – 1-3 years
Number of Rounds – 4
Round 1 – Telephonic Round
The telephonic interview lasted for ~45 minutes where the questions were mostly on the tools and technologies I have worked on in my previous organization. Slowly, the questions shifted to SQL and statistics. Following are the questions which were asked:-
  1. What is the output of SELECT NULL+0?
    NULL
  2. What are the ranking functions in SQL?
    There are mainly 3 types of ranking functions:-
    Rank()
    Row_Number()
    Dense_Rank()
  3. What is a partition by clause and how is it used?
    Partition by clause is used to create to divide the whole data in different parts depending on the column on which it is partitioned. Suppose the data contains 50 rows and have data for 6 States, then if you do a partition by on state, the whole data set will be treated differently on all the 6 partitions. The syntax for partition by in ROW_NUMBER() is given below SELECT *, ROW_NUMBER() OVER (PARTITION BY State ORDER BY population DESC) AS row_num
    FROM Table_Name So, a new column will be added in the result as row_num and it will give a row number to all the state row starting from 1. Once the rows of a particular state is over, then it will again take up another state and will start the counting from there
  4. What is A/B Testing?
    A/B testing is a form of statistical hypothesis testing with two variants leading to the technical term, two-sample hypothesis testing, used in the field of statistics. In simple words, A/B Testing in web analytics is used to compare the performance of 2 web design to get a better design. Suppose you have 2 designs to display an advertisement on your website, one being a picture and other a text or link. So, you can compare the performance of the two design by A/B Testing.
  5. What is regression?
    Regression is a form of predictive modeling technique to determine the strength of the relationship between a dependent and independent variable. One of these variables is called a predictor variable whose value is gathered through experiments. The other variable is called the response variable whose value is derived from the predictor variable.Y=aX+b – Linear regression (X is predictor variable and Y is response variable)
  6. Give some example of regression?
    Regression is used for forecasting, time series modeling and finding the casual effect relationship between the variables. For example, the relationship between rash driving and the number of road accidents by a driver is best studied through regression.
  7. What is a multiple regression?
    Multiple regression is an extension of linear regression into the relationship between more than two variables. In simple linear relation we have one predictor and one response variable, but in multiple regression, we have more than one predictor variable and one response variable.Y=a1x1+a2x2+..+b
  8. What is DENSE_RANK() function?
    DENSE_RANK() again is a ranking function which is very similar to RANK() function. The only difference is that it does not miss any rank even if there are duplicates in the table.
  9. Syntax of DENSE_RANK() function
    SELECT *, DENSE_RANK() OVER (PARTITION BY Column1 ORDER BY Column2 DESC)
    FROM Table_Name
  10. There was a question on self-join where you have to get the employee name and manager name from a table having 3 columns, EmployeeID, EmployeeName, ManagerID
    SELECT e1.Name AS EmployeeName, e2.Name AS ManagerName
    FROM Employee AS e1
    INNER JOIN Employee AS e2
    ON e1.ManagerID = e2.EmplyeeID

There were a few questions on the project you are working on right now

Round 2 – Case Study

The Case Study topic was to recommend two food items to a customer who is new to the restaurant. You can find the complete analysis of this case study and other case studies here

Round 3 – Face to Face Technical Round
This round was mostly about past projects. I had a Natural Language Processing project, so the interview revolved around the same topic. Following questions were asked in this round:- 1. What was the project for?
A. The project was to do sentiment analysis on the survey data filled by online customers. 2. What algorithms/methods did you try?
A. We tried multiple algorithms, starting from TF-IDF, Part-Of-Speech tagging, n-gram, Lemmatization, Stemming, Tokenization, Latent Semantic Indexing, Sentiment Analysis. 3. What all methods do you need to perform in order to convert a keyword into its base form(Normalization)?
A. Lemmatization and Stemming 4. What is N-gram?
A. N-grams are simply all combinations of adjacent words or letters of length n that you can find in your text file.
For example
This is a sentence
N-grams = This is, is a, a sentence 5. What is the use of TF-IDF?
A. TF-IDF stands for Term Frequency and Inverse Document Frequency. TF-IDF is numerical statistics that help to understand the importance of a particular word in a document. Term frequency gets you the number of times a particular word has occurred in a document and Inverse Document Frequency gets you the importance of the words. It helps out in filtering out the most common words like a, an, the, was, etc.. So, you get only the important terms. 6. What is Lemmatization?
Lemmatization takes into account the morphological analysis of the word. It converts a word into its pure root form by looking into the morphological information studies – Third person, singular number, present tense of verb study
Lemma – study
studying – Gerund of the verb study
Lemma – study

As you can see, both the words studies and studying has been narrowed down to the lemma study. 7.

7. Explain the complete flow of your NLP project
A. The brief of the process is given below with some coding examples:-
Step 1 – Get the text dataset
Step 2 – Tokenize the text using get_text() in Python
Step 3 – Split the text using
tokens = [t for t in text.split()]
Step 4 – Get the count of the word frequency using the NLTK package in Python
freq = nltk.FreqDist(tokens)
Step 5 – Remove stop words. Code below
for token in tokens:
          if token in stopwords.words(‘english’):
             clean_tokens.remove(token) Step 6 – Tokenize non-English words
Step 7 – Get synonyms and antonyms using WordNet package from NLTK in Python
Step 8 – Stemming of words. I used PorterStemmer algorithm
stem_Word = PorterStemmer
Step 9 – Once we are done with stemming, go for Lemmatization. WordNet package
lemma = WordNetLemmatizer()
Step 10 – Build a classifier. We can you Logistic Regression to create a baseline model. Later we used Naive Bayes Classification.

There were questions only on the logical part of the process and not on the code implementation. But, it’s always better to infuse coding example wherever you can. The interview lasted for around 1 hour.

Round 4 – Human Resource
Basic questions, like:-
1. Why are you quitting your present job?
2. What are your expectations with the company? and the company’s expectation
3. Salary negotiation
4. Have you ever lead a team?

Salary offered – Best in the industry (5/5) 

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

  1. YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
    Link – The Data Monk Youtube Channel
  2. Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
    Link – The Data Monk website
  3. E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
    Link – The Data E-shop Page
  4. Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
    Link – The Data Monk Instagram page
  5. Mock Interviews/Career Guidance/Mentorship/Resume Making
    Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it


For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

34 R Questions you must prepare before Data Science Interview

  1. What are the data structure in R which helps in statistical analysis and graphical representation?
    Ans.)
    The following are the data structure in R which are widely used:-
    a.) Array
    b.)Matrix
    c.)Vector
    d.) Data frame
    e.) List
    f.) Tables
  2. What is class() function in R?
    Ans.)
    This is a very important function in R which is a character vector giving the names of the classes from which the object inherits.
    Example. > x<- 1:10
    > class(x)
    [1] “integer”
  3. What is a vector?
    Ans.) A vector is a sequence of data elements of the same basic type. Members in a vector are called components.
    Example. >vector_example<- c(2,3,4,5)
    > print(vector_example)
    [1] 2,3,4,5 > print(length(vector_exmple)
    [1] 4
  4. How can you combine 2 vectors?
    Ans.)
    Vectors can be combined from 2 to 1 by using the c() function
    Example.
    > first <- c(1,2,3,4)
    > second <- (“a”, “b”, “c”)
    > third <- c(first, second)
    > print(third)
    [1] “1” “2” “3” “4” “a” “b” “c”
    The numbers are also shown in the double quote, this is done to maintain the same primitive data type for the new vector being created J
  5. How to perform arithmetic operations on Vectors? Show with some example
    Ans.)
    There are many arithmetic operators which are being used in R. Remember, R uses the operators component by component. Let’s look at it with some common operators. >x <- c(1,2,3,4)
    >y<- c(4,5,6,7)
    >x+y
    [1] 5 7 9 11
    >x-y
    [1] -3 -3 -3 -3
    >z <- (4,4,4,4,4,4,4)
    >x+z
    [1] 5 6 7 8 5 6 7
    When you have 2 vectors with unequal length and you need to perform an operation on both, then the shorter vector will be used again and again to match the length of both the vectors
  6. Define Index in Vector?
    Ans.) Vector in index is used to give the element at that position of the vector. Few programming language starts the index with 0 and other starts with 1. R counts the index from 1. There are many possibilities while putting an index number i.e.
    a. positive and in range index
    > x<- (1,3,4,5)
    >x[2]
    [1] “3” b. out of range
    > x <- (2,3,4,5)
    > x[110]
    [1] NA c. negative index – It removes this element and replies back with all the left numbers
    >x<- (3,4,5,6,7)
    >x[-3]
    [1] “3” “4” “6” “7” d. range of values
    >x <- (3,4,5,6,7,8)
    >x[2:5]
    [1] “4” “5” “6” “7” e. duplicate index
    > x<- (3,4,5,6,7)
    > s[c(2,1,2,3)]
    [1] “4” “3” “4” “5” f. logical index – If you want to select a particular group of index number, then you should use logical operators i.e. TRUE and FALSE
    >x<- (2,3,4,5,6)
    > s[c(TRUE,FALSE,FALSE,TRUE,TRUE)
    [1] “2” “5” “6”
  7. A list, as the name suggests is a number of vectors collected together. Suppose, you have a number vector, a character vector, a Boolean vector and some numbers. You want to combine it into one which obviously won’t have the same data type. So you need to create a list > n = c(2, 3, 5)
    > s = c(“a”, “b”, “c”, “d”, “e”)
    > b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
    > x = list(n, s, b, 3)
    > print(x)
    [[1]]
    [1] 2 3 5 [[2]]
    [1] “a” “b” “c” “d” “e” [[3]]
    [1] TRUE FALSE TRUE FALSE FALSE [[4]]
    [1] 3
  8. What is a Matrices ?
    Ans. )
    A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
    Example.
    # Matrix creation
    > M=matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=TRUE)
    print(M)
    [1]                    [2]              [3]
    [1]  1                      2                    3
    [2]  4                      5                    6 where
    nrow = number of rows in the matrix
    ncol = number of columns in the matrix
    byrow = TRUE/FALSE will get you value first by row or column
  9. What is an Array?
    Ans.)
    Array is a super set of Matrices. On one hand the matrices can be of 2 dimension but array can be of any number of dimensions.
    Example.
    > a <- array(c(“car”,”bike”), dim(3,3,2))
    > print(a) , , 1
    [,1]     [,2]     [,3]
    [1,] “car”  “bike” “car”
    [2,] “bike” “car”  “bike”
    [3,] “car”  “bike” “car”
    , , 2
    [,1]     [,2]     [,3]
    [1,] “bike” “car”  “bike”
    [2,] “car”  “bike” “car”
    [3,] “bike” “car”  “bike” >my_array<- array(1:24, dim=c(3,4,2))
    >my_array
    , , 1
    [,1] [,2] [,3] [,4]
    [1,]  1     4     7    10
    [2,]  2     5     8    11
    [3,]  3     6     9    12
    , , 2
    [,1] [,2]  [,3] [,4]
    [1,]  13   16   19   22
    [2,]  14   17   20   23
    [3,]  15   18   21   24
  10. What is a factor?
    Ans.)
    Factors are the r-objects which are created using a vector. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values. It stores the vector along with the distinct values of the elements in the vector as labels.
    Factors are created using the factor() function. The nlevelsfunctions gives the count of levels. Example
    #First let’s create a vector
    >vector_example<- c(‘a’,’b’,’c’,’a’,’a’)
    #Now create a factor object
    >factor_example<- factor(vector_example)
    >print(factor_example)
    [1] a b c a a
    >print(nlevels(factor_example))
    [1] 3 nlevels gives you the number of distinct values in the vector.
  11. What is the difference between Matrix and an array ?
    Ans.)
    Matrix can have only 2 dimensions where as an array can have as many dimensions as you want. Matrix is defined with the help of data, number of rows, number of columns and whether the elements are to be put in row wise or column wise.
    In array you need to give the dimension of the array. An array can be of any number of dimensions and each dimension is a matrix. For example a 3x3x2 array represents 2 matrices each of dimension 3×3.
  12. What is a data frame?
    Ans.)
      Data frame is a list of vectors  of equal length. It can consist of any vector with a particular data type and can combine it into one. So, a data frame can have a vector of logical and another of numeric. The only condition being that all the vectors should have the same length.
    Example. #This is how the data frame is created
    >student_profile<- data.frame(
    name<-c(“Amit”, “Sumit”, “Ajay”)
    age <- c(22,23,24)
    class <- c(6,7,8)
    )
    print(student_profile) The above code will create 3 columns with the column name as name, age and class.
  13. What is the difference between a matrix and a dataframe?
    Ans.)
    A dataframe can contain vectors with different inputs and a matrix cannot. (You can have a dataframe of characters, integers, and even other dataframes, but you can’t do that with a matrix.A matrix must be all the same type.)
    So, the data frame can have different vector of character, numbers, logical, etc. and it is still cool. But, for matrix you need only one type of data type. Phewww !!
  14. Define repeat loop.
    Ans. )
    Repeat loop executes a sequence of statement multiple times. It don’t put the condition at the same place where we put the keyword repeat.
    Example > name <- c(“Pappu”, “John”)
    > temp <- 5
    > repeat {
    print(name)
    temp <- temp+2 if(temp > 11) {
    break
    }
    } So, this will return the name vector 4 times. First it prints the name and then increase the temp to 7 and so on.
  15. Define while loop.
    Ans.)
    In the while loop the condition is tested and the control goes into the body only when the condition is true

    Example

    > name <- c(“Pappu”, “John”)
    > temp <- 5
    > repeat (temp<11) {
    print(name)
    temp <- temp+2
    } The name will be printed 4 times
  16. Define the for loop.
    Ans.)
    The for loop are not limited to integers. You can pass character vectors, logical vectors, lists or expressions.
    Example.
    > x<- LETTERS[1:2]
    for ( i in x) {
    print(i)
    }
    [1] “A”
    [2] “B”
  17. What is the use of sort() function? How to use the function to sort in descending order?
    Ans.)
    Elements in a vector can be sorted using the function sort()
    Example. > temp <- c(3,5,2,6,7,1)
    >sort_temp<- sort(temp)
    > print(sort_temp)
    [1] 1 2 3 5 6 7
    >rev_sort<- sort(temp, decreasing = TRUE)
    [1] 7,6,5,3,2,1 This function also works with the words.
  18. Create a list which holds a vector, a matrix and a list.
    Ans.)
    example_list<- list(c(“Kamal”,”Nitin”), matrix(c(1,2,3,4,5,6), nrow = 2), list(“red”,1))
  19. Determine the output of the following function f(2). b <- 4
    f <- function(a)
    {
    b <- 3
    b^3 + g(a)
    }
    g <- function(a)
    {
    a*b
    }
    Ans.)
    The global variable b has a value 4. The function f has an argument 2 and the function’s body has the local variable b with the value 3. So function f(2) will return 3^3 + g(2) and g(2) will give the value 2*4 = 8 where 4 is the value of b.
    Thus, the answer is 35
  20. What is the output of runif(10)?
    Ans.)
    runif() function is used to generate random values and the argument gives the number of values required. So the above function will generate 10 random values between 0 and 1.
  21. Get all the data of the person having maximum salary.
    Ans.)

    max_salary_person<- subset(data, salary == max(salary))
    print(max_salary_person)
  22. Get all the people who works with TCS and have salary more than 300000
    Ans.)

    TCS_data_salary<- subset(data, company == “TCS”  & salary > 300000)
  23. How is data reshaping done in R?
    Ans.)
    Data reshaping involves various techniques which is used according to the need. It’s not a procedure you need to follow, but independent methods to remould the data set. Following are the methods used:-
    a. cbind()
    b. rbind()
    c. new_column_name.data_frame_name
    d. merge()
    e. melt()
    f. cast()
  24. How to get outer join, left join, right join, inner join and cross join?
    Ans.)
    outer join – merge(x=df1, y=df2, by = “id”, all = TRUE)
    left join – merge(x=df1, y=df2, by=”id”, all.x = TRUE)
    right join – merge(x=df1, y=df2, by = “id”, all.y = TRUE)
    inner join – merge(x=df1, y=df2, by = “id”)
    cross join – merge(x=df1, y=df2, by = NULL)
  25. When you are reshaping the data, you sometimes need to melt the data. Explain melt() function
    Ans.)
    Suppose you have a data set which havecompany_name, age, salary, children. So when you want to have the data where you need the data grouped by company_name and then under company_name grouped by age. This whole process is called melting the data and it is performed with melt() function
    Example. new_data_set.previous_data_set<- melt(previous_data_set, id=c(“company_name”,”age”))
  26. What is lapply() function in R?
    Ans.)
    lapply() function is used when you want to apply a function to each element of a list in turn and get a list back.
    Example.
    x<- list(a=1, b=1:3, c=10:100)
    lapply(x,FUN=length)
    $a
    [1] 1
    $b[1] 3

    $c

    [1] 91

    You can use other functions like max, min, sum, etc.
  27. What is sapply() function in R?
    Ans.)
    sapply() function is used when you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
    Vector
     is useful sometimes because it will get you a set of values and you can easily perform an operation on it.
    Example.
    x <-list(a =1, b =1:3, c =10:100)#Compare with above; a named vector, not a list sapply(x, FUN = length)a  b  c   1391 sapply(x, FUN = sum)a    b    c    165005
  28. What is the difference between lapply and sapply?
    Ans.)
    If the programmers want the output to be a data frame or a vector, then sapply function is used whereas if a programmer wants the output to be a list then lapply is used
  29. How to apply mean function in R?
    Ans.)
    Mean is calculated by taking the sum of numbers and dividing it with the total number of elements. The function mean() is used to apply this in R.
    Syntax
    mean(x, trim=0,na.rm=FALSE) The mean() function have 3 arguments
    a.) x contains the vector on which mean is to be applied
    b.) trim = 0, It is used to drop some observations from each end of the sorted array.
    c.) na.rm is used to remove the missing values from the input vector If there are missing values in the vector then mean will return NA as a result, so in order to drop the missing values to get a mean, you should put na.rm=TRUE which means remove the missing values.
  30. How to make scatterplot in R?
    Ans.)
    Scatterplot is a graph which shows many points plotted in the Cartesian plane. Each point holds 2 values which are present on the x and y axis. The simple scatterplot is plotted using plot() function.
    The syntax for scatterplot is:-

    plot(x, y ,main, xlab, ylab, xlim, ylim, axes)

    Where
    x is the data set whose values are the horizontal coordinates
    y is the data set whose values are the vertical coordinates
    main is the tile in the graph
    xlab and ylab is the label in the horizontal and vertical axis
    xlim and ylim are the limits of values of x and y used in the plotting
    axes indicates whether both axis should be there on the plot plot(x =input$wt,y=input$mpg,xlab=”Weight”,ylab=”Milage”,xlim= c(2.5,5),ylim= c(15,30),main=”Weight vsMilage”)
  31. Bonus Question
    How to write a countdown function in R?
    Ans.)

    timer<- function(time){print(time)while(time!=0)  {Sys.sleep(1)time<- time – 1print(time)  }}countdown(5)
    [1] 5
    [2] 4
    [3] 3
    [4] 2
    [5] 1
  32. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[1]?
    Ans.)

    Error
  33. Vector v is c(1,2,3,4) and list x is list(5:8), what is the output of v*x[[1]]?
    Ans.)

    [1] 5 12 21 32
  34. What are some of the functions that R have?
    Ans.)
    The functions present in R are:-
    a. Mean
    b. Median
    c. Distribution
    d. Covariance
    e. Regression
    f. GAM
    g. GLM
    h. Non-linear
    i. Mixed Effects etc
TheDataMonk has compiled the top 100 R questions which you must prepare before any R interview focusing on a Data Science role. You can buy it on Amazon. Links below
i. Amazon
ii. Amazon India

Or you can mail me at nitinkamal132@gmail.com and get your copy free 🙂

SQL interview questions for Data Science and Business Analyst role

Before you start with the questions, make sure you have your basics clear. Look for explanation if you don’t have enough context about some syntax or function. We have provided the table schema wherever required, in case the schema is not provided, please assume the table and column names.

1. How to find the third highest salary in an employee table with employee number and employee salary?

SELECT * FROM Employee_Table t1
WHERE 3 =
(SELECT Count(distinct Salary) from Employee_Table t2 WHERE t1.Salary <= t2.Salary);

2. Write a query to find maximum salary of each department in an organisation.

SELECT Department_Name, Max(Salary)
FROM Department_Table
GROUP BY Department_Name

3.What is wrong with the following query?
SELECT Id, Year(PaymentDate) as PaymentYear
FROM Bill_Table
WHERE PaymentYear > 2018;

Though the variable PaymentYear has already been defined in the first line of the query, but this is not the correct logical process order. The correct query will be
SELECT Id, Year(PaymentDate) as PaymentYear
FROM Bill_Table
WHERE Year(PaymentDate) > 2018;

4. What is the order of execution in a query?

The order of query goes like this:-
FROM – Choose and join tables to get the raw data
WHERE – First filtering condition
GROUP BY – Aggregates the base data
HAVING – Apply condition on the base data
SELECT – Return the final data
ORDER BY – Sort the final data
LIMIT – Apply limit to the returned data

5.What is ROW_NUMBER() function?

It assigns a unique id to each row returned from the query ,even if the ids are the same. Sample query:-

SELECT emp.*,
row_number() over (order by salary DESC) Row_Number
from Employee emp;

Employee NameSalaryRow_Number
Amit70001
Bhargav60002
Chirag60003
Dinesh50004
Esha30005
Farhan30006

Even when the salary is the same for Bhargav and Chirag, they have a different Row_Number, this means that the function row_number just gives a number to every row

6. What is RANK() function?

RANK() function is used to give a rank and not a row number to the data set. The basic difference between RANK() and ROW_NUMBER is that Rank will give equal number/rank to the data points with same value. In the above case, RANK() will give a value of 2 to both Bhargav and Chirag and thus will rank Dinesh as 4. Similarly, it will give rank 5 to both Esha and Farhan.

SELECT emp.*,
RANK() over (order by salary DESC) Ranking
from Employee emp;

7. What is NTILE() function?

NTILE() function distributes the rows in an ordered partition into a specific number of groups. These groups are numbered. For example, NTILE(5) will divide a result set of 10 records into 5 groups with 2 record per group. If the number of records is not divided equally in the given group, the function will set more record to the starting groups and less to the following groups.

SELECT emp.*,
NTILE(3) over (order by salary DESC) as GeneratedRank
from Employee emp

This will divide the complete data set in 3 groups from top. So the GeneratedRank will be 1 for Amit and Bhargav, 2 for Chirag and Dinesh: 3 for Esha and Farhan

8. What is DENSE_RANK() ?

This gives the rank of each row within a result set partition, with no gaps in the ranking values. Basically there is no gap, so if the top 2 employees have the same salary then they will get the same rank i.e. 1 , much like the RANK() function. But, the third person will get a rank of 2 in DENSE_RANK as there is no gap in ranking where as the third person will get a rank of 3 when we use RANK() function. Syntax below:-

SELECT emp.*,
DENSE_RANK() OVER (order by salary DESC) DenseRank
from Employee emp;

9. Write a query to get employees name starting with vowels.

SELECT EmpID,EmpName
FROM Employee
where EmpName like ‘[aeiou]%’

10. Write a query to get employee name starting and ending with vowels.

SELECT EmpID,EmpName
FROM Employee
where EmpName like ‘[aeiou]%[aeiou]’

11. What are the different types of statements supported in SQL?

There are three types of statements in SQL:-
a. DDL – Data Definition Language
b. DML – Data Manipulation Language
c. DCL – Data Control Language

12. What is DDL?

It is used to define the database structure such as tables. It includes 3 commands:-
a. Create – Create is for creating tables
CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ….
);
b. Alter – Alter table is used to modifying the existing table object in the database.
ALTER TABLE table_name
ADD column_name datatype
c. Drop – If you drop a table, all the rows in the table is deleted and the table structure is removed from the database. Once the table is dropped, you can’t get it back

13. What is DML?

Data Manipulation Language is used to manipulate the data in records. Commonly used DML commands are Update, Insert and Delete. Sometimes SELECT command is also referred as a Data Manipulation Language.

14. What is DCL?

Data Control Language is used to control the privileges by granting and revoking database access permission

15. What is the difference between DELETE and TRUNCATE?

16. What are the important SQL aggregate functions?
a. AVG()
b. COUNT()
c. MAX()
d. MIN()
e. SUM()

17. What is Normalization? How many Normalization forms are there?

Normalization is used to organize the data in such manner that data redundancy will never occur in the database and avoid insert, update and delete anomalies.
There are 5 forms of Normalization
First Normal Form (1NF): It removes all duplicate columns from the table. Creates a table for related data and identifies unique column values
Second Normal Form (2NF): Follows 1NF and creates and places data subsets in an individual table and defines the relationship between tables using a primary key
Third Normal Form (3NF): Follows 2NF and removes those columns which are not related through primary key
Fourth Normal Form (4NF): Follows 3NF and do not define multi-valued dependencies. 4NF also known as BCNF

18. How to fetch only common records between two tables?

SELECT * FROM Employee
INTERSECT
SELECT * FROM Employee1

19. Full Outer Join is a combination of which of the following:-
a. Left Outer and Left Inner Join
b. Left Outer and Right Inner Join
c. Left Outer and Right Outer Join
d. Left Outer and Right Outer Join

A. Full Outer Join is a combination of Left Outer and Right Outer Join in SQL

20. Right Outer Join is similar to:-
a. Right Inner Join
b. Left Inner Join
c. Left Outer Join
d. Right Outer Join

A. Right Outer Join is similar to Left Outer Join in SQL

If you want to practice SQL questions before an interview or to hone up your SQL skills, you can either practice at www.sqlzoo.net or you can buy 112 Questions to Crack Business Analyst Interview from Amazon.