Data Science Archives - The Data Monk

Daily Data Science Quiz

Python questions for Data Science Quiz
Today we will talk only about Python, in fact it’s less about the language but more about the logic which is explained in these questions. Even if you don’t know the concept, try to google or derive the logic.

Don’t, please don’t think that these questions are not asked in Data Science interviews.
Data Science Quiz

If you have not created your account, you might be missing out on competing with 2400+ Analytics Aspirants

We had a brief discussion without mentors and we realized that Python has recently picked up in the interviews. It’s not that Python questions were not asked earlier, but it was not up to this level.
You can very well expect an online coding pad where you have to write and compile ready to go Python and SQL questions. Also, the basic Programming questions are back in action eg. Palindrome, sort without function, anagram, factorial, the pattern of asterisk, etc.
Since we will soon be referring you guys on the basis of your points and rank to some of the companies, thus we want you to prepare for the Python round specifically.

Python questions for Data Science

Day 9 Questions

Get the output of this Python code
Sum of Digits without using default functions
Create an even and odd list from one single list
Unfold a list within a list to one single list
Calculate Factorial of a number

Extra Important Links

How to become a Data Scientist?
Strategies to get a call from Top 50 Analytics Companies

SQL 300 Solved Interview Questions
R 20 Solved Interview Questions
Python 300 Solved Interview Questions
Machine Learning 200 Solved Interview Questions
Statistics 200 Solved Interview Questions
Completely solved 100 Case Studies and Guesstimates

Daily Quiz Repository

Daily Quiz Day 1 Questions
Daily Quiz Day 2 Questions
Daily Quiz Day 3 Questions
Daily Quiz Day 4 Questions
Daily Quiz Day 5 Questions
Daily Quiz Day 6 Questions
Daily Quiz Day 7 Questions
Daily Quiz Day 8 Questions

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

Daily Quiz to crack Data Science Interview | Day 10

Case Study for Analytics interviews
Case Study for Analytics interviews

Day 10 Quiz

SQL – Concepts of RDBMS, Dataware House and Database
Python – Map and Hash Map
Statistics – How do you create a sample data of 1000 rows from a population of 1 Million rows and 100 columns?
Case Study – What are the factors to consider if you work in the sales department of Samsung and you want to start a store in one of the most crowded malls of Bangalore?
Machine Learning – Regularization in Python

Daily Quiz Repository

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

Feature Engineering in Data Science

Have you ever wondered why two different people gets different accuracy while using the same algorithm?
We all know that XGBoost can help us get a very good result in our Hackathons, but then also only few people achieve a decent rank using the same algorithm, why?

Well !! The answer is feature engineering i.e. creating more features/data points from the fixed number of given data set.

Feature engineering is the art of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful

Let’s take some examples:-
We had this Titanic Dataset (most used data set in Data Science domain)
Problem Statement – Given the name, age, class, sex, cabin type, and number of family members traveling in Titanic. Can you predict which passenger survived and which did not?
It’s obviously a supervised learning questions and you already have a data set with the output.
All you need to do is to predict for a test data set

We are not going too deep into the solution. You can find the solution here.

What we want to discuss is the opportunities to create new columns.
I have seen people using the following types of columns in the data set. Before reading forward, remember it’s not about how good the new data point is? It’s about whether you can think out of the box.

Columns created by different solution submitter:-
1- Title of the passenger(Dr.,Mr.,Miss,etc.)
2-Creating blocks of ages rather than using actual age
3-With or without wife – Binary variable which suggests whether the person was with or without his wife
4-Number of children traveling
5-Number of alphabets in the name – Yes people did use the length of the name to try and test if this was useful. Not good enough, but brave enough 🙂

Why to create more variables when we already have a handful?
The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train that model. If you are able to create new features that help in providing more information to the model about the target variable, it’s performance will go up
Spend a considerable amount of time in pre-processing and feature engineering. You need to concentrate a lot on this since this can make a huge difference in the scores.

Better features means flexibility.

You can choose “the wrong models” (less than optimal) and still get good results. Most models can pick up on good structure in data. The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand and easier to maintain. This is very desirable.

Better features means simpler models.

With well engineered features, you can choose “the wrong parameters” (less than optimal) and still get good results, for much the same reasons. You do not need to work as hard to pick the right models and the most optimized parameters.

With good features, you are closer to the underlying problem and a representation of all the data you have available and could use to best characterize that underlying problem

How to do feature engineering?

The feature engineering process might look as follows:

Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
Devise features: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
Select features: Use different feature importance scorings and feature selection methods to prepare one or more “views” for your models to operate upon.
Evaluate models: Estimate model accuracy on unseen data using the chosen features.

You can start with any hackathon at Analytics Vidhya and try to create more and more columns, feed your algorithm with these variables and evaluate the model.

Keep Learning 🙂

The Data Monk

Correlation and Collinearity explained in layman terms

Correlation tells you have two numerical variables relate to each other. It will tell you whether data points that have a higher than average value for one variable will also likely have a higher than average value for the other variable (positive correlation) or smaller than average (negative correlation) or if there is no such relationship (correlation close to zero).

Some examples:

Height of a person and weight of a person have a high correlation: tall people tend to be heavier than shorter people. The value will be positive, and
Height of a person and the number formed by the last 4 digits of their phone number are uncorrelated (the correlation will be close to 0) because they are independent of each other.
The traffic density (number of cars driving at a given time) is negatively correlated with the average speed (if there is more traffic, there will be longer queues at traffic lights, and more people taking turns or moving in and out of traffic).

What is the difference between collinearity and correlation?

Correlation means two variables vary together, if one changes so does the other. Correlation gives no indication of strength, just how noisy this relationship is and its direction.

Correlation is an operator, meaning that we can talk about the correlation between height and weight. The correlation can be positive, negative, or 0.

Collinearity is a phenomenon related to regression, in which some of the predictor variables are highly correlated among themselves. This makes one or more of these variables redundant in our analysis. For example: if you wish to regress “Household expenditure” on “Household income” and “Tax paid in the last year”, the income and tax paid will be highly correlated (or there will be collinearity in this setup). It would be best to regress “Expenditure” on either “income” or “tax paid”.

If in multiple regression analysis, one of the predictors is linearly associated/dependent on other predictor, then this issue is known as collinearity.

For example, let’s consider the linear model

Y = αx1 + β1×1 + β2×2 … (1)

If predictor x1 can be expressed as linear combination of x2, say, x1 = 3*x2

Then this is known as collinearity among the predictors. Note that there will be perfect (or very high) correlation between the predictors as opposed to the assumption of linear regression model (All predictors are assumed to be independent).

Essentially it means that one of the independent variables is not really necessary to the model because its effect/impact on the model is already captured by some of the other variables. This variable is not contributing anything extra to the predictions and can be removed. If we have true collinearity (perfect correlation as in the example above), the one of the predictor is automatically deleted by some of the software’s like R, other shows an error or warning for the same.

The effects of collinearity are seen in the variances of the parameter estimates, not in the parameter estimates themselves.

How could I test whether a calculated correlation coefficient between two variables is meaningful or not?

The correlation coefficient R lies between -1 to +1.

In general if |R| >= 0.75, we say that the variables are highly correlated. And similarly, poor and moderate correlation for |R| <= 0.25 and 0.25 <= |R| <= 0.75 respectively.

The coefficient of determination R squared measures the ratio of explained variation (in one variable due to the change on other) to total variation.

For example, if R = 0.8 (high correlation), then R squared = 0.64.

Hence only 64 % of variation in one variable is due to the other variable. Rest of the variation (36 %) is caused by other factors.

So it is suggested to interpreted your result after calculating R squared.

Also, great care should be taken (using ROL/ expert opinion/ judgement based on common sense) while making decision on the basis of above mentioned two measures.

As sometimes, we may get a high value of R and R squared between two variables just by chance. For example correlation between amount of rain in a particular city in last one year and number of deaths due to cancer in that city.

For more details, see non- sense correlation/ spurious correlation.

Keep Learning 🙂

The Data Monk

Data Science vs Big Data vs Data Analytics vs Business Analyst

We often come across few terms which sounds no different but are poles apart. The same goes with Data Science, Big Data,Data Analytics, and Business Analyst. So if you are confused about the role which an employer is offering you, then this article is for you.

Data Science vs Big Data vs Data Analytics vs Business Analyst

Data Science deals with a lot of mathematics. This domain makes sure that you are sound in statistics and model implementation.

Requirement – Good in mathematics, Complete hands on Python/R, expertise in at least a couple of algorithms (Predictive modelling, NLP,Clustering, Neural Network, etc.). A degree in Mathematics/Statistics definitely helps. One of the best Data Scientist of India Rohan Rao has done his maters in Statistics from IIT Bombay

Things to do to become a Data Scientist – Conc. on Algorithms and Hackthons. Make your own winning combination and don’t forget to use XGBoost 😛

Big Data Specialist- It is a humongous amount of data stored in one place. A big data specialist knows the technology which would collapse in the future. He/She makes scalable infrastructure to cater high volume of data

Requirement – A Big Data Specialist should have good amount of experience in handling Multi TB per day data. This definitely comes with experience and you can’t learn it in a classroom course. A Big Data Specialist should have sound knowledge of building data pipelines, deploying algorithms/solution curated by Data Scientists and make the life of Data Analyst easier 😛

Data Analyst – A Data Analyst works on providing valuable insights to the business. SQL is the bread and butter of a Data Analyst. He is responsible to write optimised and efficient codes to cater the Business Requests.

A Data Analyst should have a decent knowledge of Data Science algorithms which would help him in understanding the data and providing meaningful insights. A little amount of mathematics never hurts.

Requirement – SQL, Python/R, PowerBI/Tableau,Statistics

Business Analyst – The job of a business analyst is to consume the solution provided by DA,DS, and BDS. He should have a decent knowledge of SQL and MS Excel in order to churn the numbers. Above all he should be able to consume the insights and take decisions based on the data

Requirement – Knack to solve complex business problems, SQL, MS Excel, and good communication skill

Salary wise (Person with 3 years of experience)

Data Science > Big Data Specialist > Data Analyst ~ Business Analyst
(This is too subjective and highly debatable)

Do look into the Job Description and Profile offered before going for an interview 🙂

Keep Learning 🙂
The Data Monk

Affine Analytics Interview Questions | Day 17

Company – Affine Analytics
Location – Bangalore
Position – Senior Business Analyst
Experience – 3+ years

Compensation – Best in the industry

Affine Analytics Interview Questions

Number of Rounds – 4

I received a call from the Technical HR who scheduled the telephonic round for the next day

Round 1 – Telephonic Round (Mostly SQL and Project)
I was asked to introduce myself and then the discussion went towards my recent project at Mu Sigma. We had a good discussion on Regression Techniques, a bit on statistics.

The project description was followed by few questions on SQL (the answers to these questions are present in various articles on the website, links are at the end of the interview)

1. What is the order of SQL query execution?
2. You have two tables with one column each. The table A has 5 values and all the values are 1 i.e. 1,1,1,1,1 and Table B has 3 values and all the values are 1 i.e. 1,1,1.

How many rows will be there if we do the following
1. Left Join
2. Right Join
3. Inner Join
Answer:
https://thedatamonk.com/day-4-sql-intermediate-questions/

3. A quick guesstimate on number of Iphones sold in India per year

Hint in the below link – https://thedatamonk.com/guesstimate-3-what-are-the-number-of-smartphones-sold-in-india-per-year/

4. What is a RANK() function? How is it different from ROW_NUMBER()?
https://thedatamonk.com/question/affine-analytics-interview-questions-what-is-a-rank-function-how-is-it-different-from-row_number/

5. How to fetch only even rows from a table?

Link to Question 4 and 5 – https://thedatamonk.com/day-11-sql-tricky-questions/
https://thedatamonk.com/question/affine-analytics-interview-questions-how-to-fetch-only-even-rows-from-a-table/

6. What are the measures of Central Tendency
https://thedatamonk.com/day-14-basic-statistics/

The telephonic round went for around 1 hour:-
Introduction – 10 minutes
Project – 30 minutes
Questions – 20 minutes

I was shortlisted for the further rounds.
All together the face-to-face interviews were divided into 3 rounds
Round 1 – SQL and R/Python
Round 2 – Statistics
Round 3 – Case Study and HR questions

Round 1
There were ~20 questions on SQL and some questions on Language.
Below are the questions which I remember:-
1. Optimising a SQL code
2. Doing a sum on a column with Null values
Hint – Use Coalesce
3. How to find the count of duplicate rows?
https://thedatamonk.com/question/affine-analytics-interview-questions-how-to-get-3-min-salaries/
4. Use of Lag function
Link – https://thedatamonk.com/day-5-sql-advance-concepts/
5. Life cycle of a project
6. How to find the second minimum salary?
7. How to get 3 Min salaries?
8. DDL, DML, and DCL commands
https://thedatamonk.com/day-13-sql-theoretical-questions/

There were few more questions on Joins and optimising inner query codes.
Overall difficulty level- 8/10

There were 5 questions on Python/R –

Loading a csv/text file
Writing code of Linear Regression (As it was mentioned on my resume)
Doing a right join in either of the language
Removing null value from a column
https://thedatamonk.com/question/affine-analytics-interview-questions-removing-null-value-from-a-column/

Round 3 – Statistics

How to calculate IQR?
What is positive skewness and negative skewness?
https://thedatamonk.com/question/affine-analytics-interview-questions-what-is-positive-skewness-and-negative-skewness/
What are the two types of regression?
What is multiple linear regression?
What is Logistic Regression?
What is p-value and give an example?

These questions were discussed in detail and I power the explanation with real life examples.

https://thedatamonk.com/day-18-statistics-interview-questions/

Bonus tips – Do look for good examples

Round 4 – Case Study and HR Questions

How many laptops are sold in Bangalore in a Day ?
https://thedatamonk.com/guesstimate-2-how-many-laptops-are-sold-in-bangalore-in-a-day/

Business Case Study – There is a mobile company which is very popular in Other Asian countries. The company is planning to open it’s branch in the most popular mall of Bangalore.
What should be the strategy of the company?
How can you use freely available data to plan the marketing of the campaigns?
How can you use Digital marketing to create campaigns for the company?

https://thedatamonk.com/question/affine-analytics-interview-questions-business-case-study/

These questions were followed by:-
Why do you want to change the company?
How is the work in your current organisation?

I got the confirmation in 2 working days.

This was it

Amazon Interview Questions
Sapient Interview Questions

Full interview question of these round is present in our book What do they ask in Top Data Science Interview Part 2: Amazon, Accenture, Sapient, Deloitte, and BookMyShow

You can get your hand on our ebooks, we also have a 10 e-book bundle offer at Rs.549 where you get a total of 1400 questions.
Comment below or mail at contact@thedatamonk.com for more information

1. The Monk who knew Linear Regression (Python): Understand, Learn and Crack Data Science Interview
2. 100 Python Questions to crack Data Science/Analyst Interview
3. Complete Linear Regression and ARIMA Forecasting project using R
4. 100 Hadoop Questions to crack data science interview: Hadoop Cheat Sheet
5. 100 Questions to Crack Data Science Interview
6. 100 Puzzles and Case Studies To Crack Data Science Interview
7. 100 Questions To Crack Big Data Interview
8. 100 Questions to Learn R in 6 Hours
9. Complete Analytical Project before Data Science interview
10. 112 Questions To Crack Business Analyst Interview Using SQL
11. 100 Questions To Crack Business Analyst Interview
12. A to Z of Machine Learning in 6 hours
13. In 2 Hours Create your first Azure ML in 23 Steps
14. How to Start A Career in Business Analysis
15. Web Analytics – The Way we do it
16. Write better SQL queries + SQL Interview Questions
17. How To Start a Career in Data Science
18. Top Interview Questions And All About Adobe Analytics
19. Business Analyst and MBA Aspirant’s Complete Guide to Case Study – Case Study Cheatsheet
20. 125 Must have Python questions before Data Science interview
21. 100 Questions To Understand Natural Language Processing in Python
22. 100 Questions to master forecasting in R: Learn Linear Regression, ARIMA, and ARIMAX
23. What do they ask in Top Data Science Interviews
24. What do they ask in Top Data Science Interviews: Part 1

Keep Learning

10 Questions, 10 Minutes – 1/100

This is something which has been on my mind since a long time. We will be picking 10 questions per day and would like to simplify it.
We will make sure that the complete article is covered in 10 minutes by the reader. There will be 100 posts in the coming 3 months.

The articles/questions will revolve around SQL, Statistics, Python/R, MS Excel, Statistical Modelling, and case studies.

The questions will be a mix of these topics to help you prepare for interviews

You can also contribute by framing 10 questions and sending it to contact@thedatamonk.com or messaging me on Linkedin.

The questions will be updated late in the night ~1-2 a.m. and will be posted on Linkedin as well.

Let’s see how many can we solve in the next 100 posts

1. Write the syntax to create a new column using Row Number over the Salary column

SELECT *, ROW_NUMBER() OVER (Order By Salary) as Row_Num
FROM Employee

Output

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	2
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	5

2. What is PARTITION BY clause?
PARTITION BY clause is used to create a partition of ranking in a table. If you partition by Salary in the above table, then it will provide a ranking based on each unique salary. Example below:-

SELECT *, ROW_NUMBER() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	2
124	Aman	40000	1
123	Amit	50000	1
453	Sumit	50000	2

3. What is a RANK() function? How is it different from ROW_NUMBER()?
– RANK() function gives ranking to a row based on the value on which you want to base your ranking. If there are equal values, then the rank will be repeated and the row following the repeated values will skip as many ranks as there are repeated values row. Confused?? Try out the example below:-

SELECT *, RANK() OVER (ORDER BY Salary) as Row_Num
FROM Employee

Output

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	4

As you can see, the rank 2 has been skipped because there were two employees with the same Salary and the result is ordered in ascending order by default.

4. What is Dense Ranking?
– DENSE_RANK() is similar to the RANK() function but it does not skip any rank, so if there are two equal values then both will be termed as 1, the third value will be termed as 3 and not 2.

Syntax:-
SELECT *, DENSE_RANK() OVER (PARTITION BY Salary ORDER BY Salary) as Row_Num
FROM Employee

Output:-

Emp. ID	Name	Salary	Row_Num
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	3
123	Amit	50000	4
453	Sumit	50000	4
432	Nihar	60000	6

5. What is NTILE() function?
-NTILE() is similar to percentile NTILE(3) will divide the data in 3 parts.

SELECT *, NTILE() OVER (ORDER BY Salary) as Ntile
FROM Employee

The number of rows should be 6/3 = 2, therefore we need to divide the 2 rows for each percentile

Emp. ID	Name	Salary	Ntile
232	Rakshit	30000	1
543	Rahul	30000	1
124	Aman	40000	2
123	Amit	50000	2
453	Sumit	50000	3
432	Nihar	60000	3

6. How to get the second highest salary from a table?
Select MAX(Salary)
from Employee
Where Salary NOT IN (SELECT MAX(Salary) from Employee)

7. Find the 3rd Maximum salary in the employee table
-Select distinct sal
from emp e1
where 3 = ((select count(distinct sal) from emp e2 where e1.sal <= e2.sal);

8. Get all employee detail from EmployeeDetail table whose “FirstName” not start with any single character between ‘a-p’
– SELECT *
FROM EmployeeDetail
WHERE FirstName like ‘[^a-p]%’

9. How to fetch only even rows from a table?
-The best way to do it is by adding a row number using ROW_NUMBER() and then pulling the alternate row number using row_num%2 = 0

Suppose, there are 3 columns in a table i.e. student_ID, student_Name, student_Grade. Pull the even rows

SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY student_ID) as row_num FROM student) x
WHERE x.row_num%2=0

10. How to fetch only odd rows from the same table?
-Simply apply the x.row_num%2 <> 0 to get the odd rows

SELECT *
FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY student_ID) as row_num FROM student) x
WHERE x.row_num%2 <> 0

Let us know if you think I need to change any answer here.

Keep Learning 🙂

The Data Monk

Statistics Interview Questions

Q1. What is a Sample?

A. A data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

Q2. Define Population.

A. In statistics, population refers to the total set of observations that can be made. For example, if we are studying the weight of adult women, the population is the set of weights of all the women in the world

Q3. What is a Data Point?

A. In statistics, a data point (or observation) is a set of one or more measurements on a single member of a statistical population.

Q4. Explain Data Sets.

A. Data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.

Q5. What is meant by the term Inferential Statistics?

A. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

Q6. Give an example of Inferential Statistics

A. You asked five of your classmates about their height. On the basis of this information, you stated that the average height of all students in your university or college is 67 inches.

Q7. What is Descriptive Statistics?

A. Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread).

Q8. What is the range of data?

A1. It tells us how much the data is spread across in a set. In other words, it is defined as the difference between the highest and the lowest value present in the set.

X=[2 3 4 4 3 7 9]

Range(x)%return (9-2)=7

Q9. Define Measurement.

A. Data can be classified as being on one of four scales:

nominal
ordinal
interval
ratio

Each level of measurement has some important properties that are useful to know. For example, only the ratio scale has meaningful zeros.

Q10. What is a Nominal Scale?

A. Nominal variables (also called categorical variables) can be placed into categories. They don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to have an order then you probably have ordinal variables instead

Q11. What is an Ordinal Scale?

A. The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.

Q12. What is an Interval Scale?

A. An interval scale has ordered numbers with meaningful divisions. Temperature is on the interval scale: a difference of 10 degrees between 90 and 100 means the same as 10 degrees between 150 and 160. Compare that to high school ranking (which is ordinal), where the difference between 1st and 2nd might be .01 and between 10th and 11th .5. If you have meaningful divisions, you have something on the interval scale.

Q13. Explain Ratio Scale.

A. The ratio scale is exactly the same as the interval scale with one major difference: zero is meaningful. For example, a height of zero is meaningful (it means you don’t exist). Compare that to a temperature of zero, which while it exists, it doesn’t mean anything in particular.

Q14. What do you mean by Bayesian?

A. Bayesians condition on the data observed and considered the probability distribution on the hypothesis. Bayesian statistics provides us with mathematical tools to rationally update our subjective beliefs in light of new data or evidence.

Q15. What is Frequentist?

A. Frequentists condition on a hypothesis of choice and consider the probability distribution on the data, whether observed or not. Frequentist statistics uses rigid frameworks, the type of frameworks that you learn in basic statistics, like:

Q16. What is P-Value??

A. In statistical significance testing, it is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

Q17. What is a Confidence Interval?

A. A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method.

Q18. Explain Hypothesis Testing.

A. Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to infer the result of a hypothesis performed on sample data from a larger population.

Q19. What is likelihood?

A. The probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the observed outcomes.

Q20. What is sampling?

A. Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern.

Q21. What are Sampling Methods?

A. There are 4 sampling methods:

Simple Random
Systematic
Cluster
Stratified

Q22. What is Mode?

A. The mode of a data sample is the element that occurs the most number of times in the data collection.

X=[1 2 4 4 4 4 5 5]

Mode(x)% return 3

Q23. What is Median?

A. It is describes as the numeric value that separates the lower half of sample of a probability from the upper half. It can b easily calculated by arranging all the samples from highest to lowest (or vice-versa) and picking the middle one.

X=[2 4 1 3 4 4 3]

X=[1 2 3 3 4 4 4]

Median(x)% return 3

Q24. What is meant by Quartile?

A. It is a type of quantile that divides the data points into four or less equal parts(quarters). Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest.

Q25. What is Moment?

A. It is the quantitative measure of the shape of a set of points. It comprises of a set of statistical parameters to measure a distribution. Four moments are commonly used:

Mean
Skewness
Variance
Kurtosis

Q26. What is the Mean of data?

A. The statistical mean refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points.

X=[1 2 3 3 6]

Sum=1+2+3+3+6=15

Mean(x)%return (sum/5)=3

Q27. Define Skewness.

A. Skewness is a measure of the asymmetric of the data around the sample mean. It it is negative, the data are spread out more to the left side of the mean than to the right. The vice-versa also stands true.

Q28. What is Variance?

A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. Variance is the average of the squared distances from each point to the mean.

Q29. Define Standard Deviation.

A. In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Q30. What is Kurtosis?

A. Kurtosis is a measure of how outlier-prone a distribution is. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

Q31. What is meant by Covariance?

A. Covariance measures the directional relationship between the returns on two assets. A positive covariance means that asset returns move together while a negative covariance means they move inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying the correlation between the two variables by the standard deviation of each variable. It gives the measure of how much two variable change together.

Q32. What is Alternative Hypothesis?

A. The Alternative hypothesis (denoted by H1 ) is the statement that must be true if the null hypothesis is false.

Q33. Explain Significance Level.

A. The probability of rejecting the null hypothesis when it is called the significance level α , and very common choices are α = 0.05 and α = 0.01.

Q34. Do you know what is Binary search?

A. For binary search, the array should be arranged in ascending or descending order. In each step, the algorithm compares the search key value with the key value of the middle element of the array. If the keys match, then a matching element has been found and its index, or position, is returned. Otherwise, if the search key is less than the middle element’s key, then the algorithm repeats its action on the sub-array to the left of the middle element or, if the search key is greater, on the sub-array to the right.

Q35. Explain Hash Table.

A. A hash table is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.

Q36. What is Null Hypothesis?

A. The null hypothesis (denote by H0 ) is a statement about the value of a population parameter (such as mean), and it must contain the condition of equality and must be written with the symbol =, ≤, or ≤.

Q37. When You Are Creating A Statistical Model How Do You Prevent Over-fitting?

A. It can be prevented by cross-validation

Q38. What do you mean by Cross-vlidation?

A. Cross-validation, it’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice

Q39. What is Linear regression?

A. A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship.

Q40. What are the assumptions required for linear regression?

A. There are four major assumptions:

There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
The errors or residuals of the data are normally distributed and independent from each other
There is minimal multi-co linearity between explanatory variables
Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

Q41. What is Multiple Regression?

A. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable. A dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires two or more predictor variables, and this is why it is called multiple regression.

Q42. What is a Statistical Interaction?

A. Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

Q43. What is an example of a data set with a non-Gaussian distribution?

A.The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.

Q44. Define Correlation.

A. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

For example: height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5’5” is less than the average weight of people 5’6”, and their average weight is less than that of people 5’7”, etc.

Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

Q45. What is primary goal of A/B Testing?

A. A/B testing refers to a statistical hypothesis with two variables A and B. The primary goal of A/B testing is the identification of any changes to the web page for maximizing or increasing the outcome of interest. A/B testing is a fantastic method for finding the most suitable online promotional and marketing strategies for the business.

Q46. What is meaning of Statistical Power of Sensitivity?

A. The statistical power of sensitivity refers to the validation of the accuracy of a classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is basically Predicted True Events/Total Events.

Q47. Explain Over-fitting.

A. In the case of over-fitting, the model is highly complex, like having too many parameters which are relative to many observations. The overfit model has poor predictive performance, and it overreacts to many minor fluctuations in the training data.

Q48. Explain Under-fitting

A. In the case of under-fitting, the underlying trend of the data cannot be captured by the statistical model or even the machine learning algorithm. Even such a model has poor predictive performance.

Q49. What is Long Format Data?

A. In the long format, every row makes a one-time point per subject. The data in the wide format can be recognized by the fact that the columns are basically represented by the groups.

Q50. What is Wide Format Data?

A. In the wide format, the repeated responses of the subject will fall in a single row, and each response will go in a separate column.

Complete path to master SQL before interview

We have interviewed a lot of candidates and found out that SQL is still something which is very less explored by people who want to get deep into this domain.

Remember – Data Science is not about all about SQL, but it’s a bread and butter for most of the jobs irrespective of your profile.

This is a small post which will cater around the ways to master your SQL skills.

I am assuming that you are a complete noob in SQL, skip according to your expertise

1. Start with either w3school or tutorials point.
It should not take more than 8-10 hours for you to complete the tutorial(irrespective of your Engineering branch/ current domain)

2. Go for SQLZoo. Solve all the questions.
If you get stuck, then try this link which have all the solved questions. Should take you somewhere 15 hours

3. Once this is done, create an account on HackerRank and try there SQL course.
Try all the easy questions first and then slowly move to the medium level questions. It should not take you more than 20 hours, earn a 4 star at least before moving ahead and do follow the discussion panel

If you are good with the above 3, then do try our four pages(This is not a self promotion, but we have hand picked some of the important questions which you should definitely solve before your interview)

https://thedatamonk.com/day-3-basic-queries-to-get-you-started/
https://thedatamonk.com/day-4-sql-intermediate-questions/
https://thedatamonk.com/day-5-sql-advance-concepts/
https://thedatamonk.com/day-6-less-asked-sql-questions/

You are already interview ready, send me a mail at nitinkamal132@gmail.com or contact@thedatamonk.com to get a free copy of our ebook or purchase it on Amazon

https://www.amazon.in/Questions-Crack-Business-Analyst-Interview-ebook/dp/B01K16FLC4
https://www.amazon.in/Write-better-queries-interview-Questions-ebook/dp/B076BXFGW1

You are not done yet, complete that HackerRank Hard Questions as well.

This will suffice the knowledge which you need to crack any Data Science SQL interview round.

For Python,R, and statistics we will have separate post.

Keep Learning 🙂
The Data Monk

Supply Chain Analytics

We all have a fair idea about the supply chain. In a layman term, we can say that the supply chain analytics helps in improving the operational efficiency and effectiveness by providing “data-driven” decisions at the operational and strategic level.

There are four types of analytics which one can perform to boost the performance of a supply-chain enabled business:-

Kartik had a mild migraine a few days ago, he ignored it and continued with his daily routine. After a few days, he found out that the pain is getting worse with time, He consulted Dr.Nikhil, who first asked for his medical history/reports i.e. Weight, Sugar-level, Bloop pressure, etc.
Then he looked into the reports and tried to diagnose the reason behind this abrupt pain.
The ache was there all the time which made Doctor believe that it is bound to happen in the future, so after looking at all the major points, Nikhil prescribed some medicine to Kartik.

What is what?

Reports ~ KPIs of the industry and business i.e. Descriptive Analytics
Diagnosis ~ Looking for reasons for the numbers present in the report i.e. Diagnostic analytics
Prediction of future pain to Kartik ~ Predictive analytics
Prescribing medicine ~ Looking at all the past behavior and KPIs, we do Prescriptive analytics

1. Descriptive analytics – So, you have some historic data and you need to find the performance of KPIs, this type of analysis is called descriptive analysis. The descriptive analysis helps us in finding answers to questions like How many products were sold, the performance of products in different stores, the performance of stores in terms of revenue, etc.

Basically, it gives you a gist of the current state of a company

2. Diagnostic analytics – On one hand, the descriptive analysis tells you about the KPIs of the company, whereas the diagnostic analytics tells you a lot about the underlying issue. If the descriptive analysis tells you that
Product A is not performing well in the Whitefield Store of Walmart, then the diagnostic analysis will aim at finding the underlying reasons for the same.

3. Predictive analytics –

“Do you understand the difference between Forecasting and prediction?”
Forecasting is the use of historic data which holds some pattern, to give a number for the future i.e. you are basically extrapolating the past pattern to get the numbers for the future. Whereas prediction is a more vague term which takes the changes of future in the account.

When I go through the last 40 months of data to estimate the number of joints rolled by Aman in the next month, then this is a case of forecasting. But, if I read the palm of Ramesh and tells him his future by considering the present and future behavior of the stars, then it’s a prediction.

Predictive analytics uses statistical techniques to estimate the likelihood of future events such as stock-outs or movements in your product’s demand curve. It provides the foresight for focused decision making that avoids likely problems in the future.

4. Prescriptive Analytics – Now it’s the time to have an overview of all the analytics components and provide solutions which can improve the performance of the business. This is done by prescriptive analytics.

Descriptive talks about the KPIs, diagnostic tries to find out the reason behind these numbers, predictive wants to know the performance of the business by looking at the historic and futuristic actions, prescriptive provides the final prescriptions !!

Components of Supply Chain Analytics:-

Overall, supply chain analytics can be divided into 5 parts:-

1. Inventory Management – This part looks after the “store-house” of a company. The major parts of analytics here are

a. Inventory Cost Optimization
b. Optimal Stocking
c. Stock-out Prediction

2. Sourcing – How to full fill the demand

a. Optimized Order Allocation
b. Arrival time optimization
c. Sourcing cost analysis

3. Vendor Management – How to optimize vendors for your company

a. Fault Rate Analysis
b. Profitability Analysis
c. Vendor Scorecard

4. Returns Management – What to do if a product is returned?

a. Returns tracking
b. Salvaging optimization
c. Cost Recovery Analysis

5. Network Planning – How to optimize the transport network to maximize profit?

a. Trailer Utilization
b. Freight Cost Optimization
c. Vehicle Routing

What are the five stages of Supply Chain?

You can divide the whole Supply chain process in 5 stages
a. Plan – Every company needs a strategy on how to manage the resources in order to achieve their customers demand for their products and services
b. Source – To create their products, companies need to be very careful when choosing suppliers to deliver their goods and services needed
c. Make – In manufacturing the supply chain manager should always schedule the activities that are needed for the production, packaging, testing and preparation for delivery.
d. Deliver – This part is mainly referred to as logistics by the supply chain management. In this case companies coordinate receipts of orders, pick carriers to get products to customers and develop a network of warehouses.
e. Return – In many companies this is usually where the problem is – in the supply chain. The planners should create a flexible and responsible network for receiving a flaw and excess products sent back to them (from customers).

Common Terminologies in Supply Chain

1. Back Ordering – When you don’t have product in your inventory and the product has already been ordered by a customer. In this case you give the order to a supplier. This is called Back-Ordering

2. Blanket Order – It is a large purchase order registered by the end user which the supplier has to supply in a span of few days where the dates are not fixed. It’s just like saying “I need 5000 Light candles before October 31^st”. This will ensure a large order aiming for a good amount of discount before a festive or high demand season

3. Consignment – This term has more than one meaning. Most often it means the act of placing your goods in the care of a third-party warehouse owner (known as the consignee) who maintains them for a fee. In addition to storing the goods, the consignee may sell or ship them to customers on your behalf for a fee. As a separate meaning, consignments can also refer to individual shipments made to the consignee.

4. Drop Shipment – You create a website and listed few things which are present in a nearby store. As soon as an order is placed on your website, you give the order to the nearby mart to deliver it to the customer’s place. Your profit is the difference between price paid by the customer and delivery+product cost of the mart. Here you do not need an inventory, in fact you do not need any store house or capital investment to start an e-commerce business

5. Groupage – This is a method of grouping multiple shipments from different sellers (each with its own bill of lading) inside a single container. This is done when individual shipments are less than the container load or in other words are not big enough by themselves to fill up an entire container. This way, the freight cost is split between these sellers.

6. JIT – Just-in-time is an inventory optimization method where every batch of items arrives ”just in time” to fulfil the needs of the next stage, which could be either a shipment or a production cycle.

7. Landed Cost – The total cost of ownership of an item. This includes the cost price, shipping charges, custom duties, taxes and any other charges that were borne by the buyer.

8. Waybill: A document prepared by the seller, on behalf of the carrier, that specifies the shipment’s point of origin, the details of the transacting parties (the buyer and seller), the route, and the destination address.

You can look for more definitions and KPIs related to Supply chain. But, this is a decent way to start the exploration.

We will deal with implementing a simple Supply Chain problem using PuLP in Python in our next article.

Keep Learning 🙂

The Data Monk

Tag: Data Science

Daily Data Science Quiz

The Data Monk services

The Data Monk e-books

Daily Quiz to crack Data Science Interview | Day 10

Day 10 Quiz

Daily Quiz Repository

The Data Monk services

The Data Monk e-books

Feature Engineering in Data Science

Correlation and Collinearity explained in layman terms

Data Science vs Big Data vs Data Analytics vs Business Analyst

Affine Analytics Interview Questions | Day 17

10 Questions, 10 Minutes – 1/100

Statistics Interview Questions

Complete path to master SQL before interview

Supply Chain Analytics

Subscribe to our newsletter