Statistics Interview Questions

Today we will counter statistics interview question, A Data Science interview will definitely have a good round of statistics. Interviewer will rarely ask you a very mathematical question. But, statistics interview questions will have lots of small but tricky questions.

Before we get started, do create your profile on the website. The ‘Login’ area is at your extreme left 🙂

You can get previous days questions here

Day 1- https://thedatamonk.com/data-science-interview-question-day1/
Day 2-https://thedatamonk.com/sql-interview-questions/
Day 3-https://thedatamonk.com/joins-in-sql/

Day 4 questions

statistics interview questions

Performance of Linear Regression Model – https://thedatamonk.com/question/what-are-the-metrics-to-measure-the-performance-of-your-linear-regression-model-most-asked-question/
R Squared error – https://thedatamonk.com/question/define-r-squared-error-in-simple-terms/
Multi-Co-linearity – https://thedatamonk.com/question/what-is-the-need-to-remove-multicolinearity/
Loss Function – https://thedatamonk.com/question/define-loss-function-in-the-simplest-way-possible/
Best Fit Line for Linear Regression – https://thedatamonk.com/question/what-is-the-best-fit-line-in-linear-regression/
Ridge Regression vs Linear Regression – https://thedatamonk.com/question/what-makes-ridge-regression-different-from-linear-regression/
ROC curve – https://thedatamonk.com/question/define-roc-in-layman-terms/
Sample and Population Variance – https://thedatamonk.com/question/difference-in-formula-between-sample-and-population-variance/
Different Statistical tests – https://thedatamonk.com/question/what-us-the-difference-between-chi-square-z-test-and-t-test/
Test used for less than 30 sampling unit – https://thedatamonk.com/question/which-test-to-use-when-you-have-less-than-30-sampling-units-amazon-online-interview-question/

These are definitely some of the most asked questions in a statistics interview questions

The most important topics to work on Statistics are
– What is the meaning of p-value
– What is the difference between R squared and Adjusted R-squared error
– Different types of tests in statistics
– Knowing the approach to calculate the tests in MS Excel
– ROC-AUC curve
– Assumptions of Linear Regression
– Line of Fit
– How to measure the effectiveness of your model

We will cover all of the above plus many more interview questions to make sure your interviewer is also using our website to ask questions

The Data Monk Interview Books â€“ Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms â€“ Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies â€“ Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions â€“ For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions â€“ Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Statistics Interview Questions for Data Science

Here we have a set of 17 statistics interview questions that you should understand before your data science interviews. These are very basic Statistics questions which will check your elementary knowledge

Statistics Interview Question

15+ correct = Very strong fundamentals
10-15 correct = At par with the concepts, you should try to complete one basic book on statistics
<10 correct = Go through at least one book and cover 80-100 MCQs

Let’s get to the statistics interview questions

1. The mean of a distribution is 20 and the standard deviation is 5. What is the value of the coefficient of variation?

A.
Variation
 = (Standard Deviation/Mean)*100
= (5/20)*100
= 25%

2. When the mean is less than mode and median, then what type of distribution is it?

A. Negatively Skewed

3. Which of the following describe the middle part of a group of numbers?
a. Measure of Variability
b. Measure of Central Tendency
c. Measure of Association
d. Measure of Shape

A.
Measure of Central Tendency

4. According to the empirical rule, approximately what percent of the data should lie within μ±2σ?

A. 95% of the data should lie between μ±2σ

5. The sum of the deviations about the mean is always:
a. Range
b. Zero
c. Total Deviation
d. Positive

A.
Zero

6. The middle value of an ordered array of numbers is the

  1. Mode
  2. Mean
  3. Median
  4. Standard Deviation

    Ye toh kar lo 🙂

7. Height of employees is a :-
a. Continuous value
b. Qualitative value
c. Discrete value
d. None of these


8. Which of these is a measure of dispersion:-
a. Mean
b. Median
c. Quartile
d. Standard Deviation


A. Standard deviation is a measure of dispersion

9. The variance of a dataset is 144, what is the Standard Deviation?

A. Standard deviation is square root of Variance, so the Standard deviation will be 12


10. Which of these is a qualitative data:-
a. Weight of family members
b. Salary
c. Feedback of 100 customers about your website
d. Number of burgers sold in India


A. Feedback of 100 customers about your website, rest all are discrete

11. Which of these is/are measure of central tendency?
a. Median
b. Mean
c. Mode
d. Mid range
e. Mid hinge


A. All of these are measures of central tendency

12. What divides a data set in a group of 10 parts?
a. Deciles
b. Percentile
c. Quartile
d. Standard Deviation

A. Deciles divide the complete dataset in a group of 10 parts

13. What is Mid-range?

A.
The arithmetic mean of the maximum and minimum values of a dataset is called mid-range

14. What is Mid hinge?
A.
The arithmetic mean of the two quartiles is called mid hinge

15. What is Inter Quartile Range?
a. 0-50th percentile
b. 25-50th percentile
c. 25-75th percentile
d. 50-100th percentile

A.
25-75th percentile is called IQR i.e. Inter Quartile Range

16. What is a cap in a box-plot
A.
An upper cap contains the values which falls between 75th percentile and 75th Percentile+1.5*IQR. Similarly lower cap contains the values which falls between 25th Percentile and 25th Percentile-1.5*IQR

17. What values are termed as an outlier in a box plot?
A.
Any value which is more than upper cap and less than the lower cap will fall under the definition of an outlier

You can also checkout Data Camp for Statistics courses

If you are wondering how to study and be interview ready for SQL, Python/R, Statistics, Machine Learning, Case Study and Guesstimates, then you can have a look at our 7 min-read article – How to make a career in Data Science

The Data Monk Interview Books â€“ Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms â€“ Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies â€“ Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions â€“ For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions â€“ Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

What is Stationarity in Time Series?

Stationarity in Time Series
The first step for any time series analysis is to make the data set stationary. Everyone knows that stationarity means a near to constant mean and variance across time.
Stationarity in Time Series

Stationarity in Time Series
Stationarity in Time Series

The red line above shows an increasing trend and the blue line is the result of the de-trending series. De-trending means to fit a regression line and then subtract it using original data

Stationarity does not mean that the series does not change over time, just the way it changes does not itself change over time.

The reason why we need a stationary data is simple – It’s easier to analyze and predict a data set with stationarity. If a series is consistently increasing over time (like the one above), then the sample mean and variance will also grow with the size of the sample, and your model or the proposed time series solution will always underestimate the mean and variance in the future periods.

How you check the stationarity of a series?
In general, we use Augmented Dickey Fuller Test or KPSS test to check the stationarity of the series. Here we will discuss only the ADF test, KPSS phir kabhi

ADF is a statistical significance test (a test which involves null and alternate hypothesis) and it falls under the category of ‘unit root test’. Now, what is a unit root test?

Yt is the value of the time series at time ‘t’ and Xe is an exogenous variable (a separate explanatory variable, which is also a time series).

The presence of a unit root means the time series is non-stationary. Besides, the number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary.

A time series is a process that can be written in its components which contains ‘roots’. For example:

v(t)=c+a1 v(t−1) + ϵt − 1

The coefficient a1 is a root. You can interpret this process/formula as ‘the value of today depends on the value of yesterday and some randomness we can’t predict’. We expect this process to always converge back to the value of c.

Try this is out with an example:
suppose c=0 and a1=0.5.

If yesterday v(t−1) the value was 100, then we expect that today the value will be around 50. Tomorrow, we expect the value to be 25 and so on.

You see that this series will ‘come home’, in this case meaning it will converge back to the value of cc.

When one of the roots is a unit, i.e. equal to 1 (in this example when a1=1), then this series will not recover to its origin. You can see this by using the example given above.
That is why the concepts of unit roots and unit root tests are useful: it gives us insights into whether the time series will recover to its expected value. If this is not the case, then the process will be very susceptible to shocks and hard to predict and control.

What is the significance of p-value in ADF test?
A high p-value, suppose 0.87 indicates that the possibility of the series to be non-stationary is 87%.
We do multiple differencing in the dataset to make it stationary

adf.test(diff(time_series))

In the above snippet, we are doing one differentiation of the time series data and then testing the stationarity using the adf test in R
You can also try a double differentiation or a difference after log to check the stationarity(if the noise is high)

adf.test(diff(log(time_series))

A rule of thumb – Don’t over differentiate i.e. don’t apply 6-7 differentiation to fix the noise in order to decrease the p-value for a stationary dataset.

In the case of a first-difference, we are literally getting the difference between a value and the one for the time period immediately previous to it. If you are going for a high number of differentiation then it clearly means that your data has too much noise to cater to a time series pattern

Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

Bottom line :-
-the value of today depends on the value of yesterday and some randomness we can’t predict
-Stationarity is useful to identify the pattern in order to predict values
-You do a difference of order one, two, three, etc. to get to a stationary value for the dataset
-Do an ADF or a KPSS test to check if the series is stationary
-Uske baad chill 🙂

The Data Monk e-books

Tired of online courses costing 2 to 8 lakh and taking more than a year to complete?
Tired of going through 500+ hours of videos at a super slow pace?
We at The Data Monk believe that you have to start and complete things as quickly as possible. We believe in a target-based study where we break a topic into 100 questions and make sure that if you cover these questions you will surely be able to crack the interview questions. Rest all theory and practical can ONLY be learned while working in an organization.


Pick any of our books from our e-shop page and complete it in 6-8 hours, learn the 100 questions and write it in your resume. We guarantee you that you will nail 8 out of 10 interviews

We also have 3 bundles at a price that is affordable to everyone. We are a group of people placed in the best of the product-based companies and we take 100+ interviews per week. Do we know what is being asked and what is not? So, just grab any of the following book bundles and give not more than 30 days to LEARN all the questions. We guarantee you that you will become a very strong candidate in any analytics interview

Set A – [3rd/4th year/ and 0 to 3 years of experience]

Crack any analytics or data science interview with our 1400+ interview questions which focus on multiple domains i.e. SQL, R, Python, Machine Learning, Statistics, and Visualization. – https://thedatamonk.com/product/books-to-crack-analytics-interview/

Set B – [0-5 Years of Experience]

1200+ Interview Questions on all the important Machine Learning algorithms (including complete Python code) Ada Boost, CNN, ANN, Forecasting (ARIMA, SARIMA, ARIMAX), Clustering, LSTM, SVM, Linear Regression, Logistic Regression, Sentiment Analysis, NLP, K-Mean – https://thedatamonk.com/product/machine-learning-interview-questions/

Set C – [0-7 Years of Experience]

2000+ interview questions that include 100 questions each on 12 most asked Machine Learning Algorithms, Python, Numpy and Pandas – 300 Interview Questions, Pandas,PCA,AWS,Data Preprocessing,Case Studies, and many more
https://thedatamonk.com/product/the-data-monk-e-book-bundle/

Note – Set C contains all the questions of Set B


Youtube Channel – The Data Monk

Unlike any other youtube channel, we do not teach basic stuff, we teach only topics that are asked in interviews. If the interviewer asks about p-value, we will have a video on that topic,
If the interviewer is interested in asking the sequence of execution of SQL commands then we will give you an overview of all the commands but stress so much on the question that can answer it comfortably in the interview. We definitely recommend you to follow our youtube channel for any topic that you are interested in or weak at
.

If you wish to get all the study material and topics to cover for an interview at one place, then you can subscribe to our channel. We have covered the complete syllabus of
Get all the youtube videos playlist on our youtube Channel – The Data Monk

Code in Python for Data Science â€“ Understand one algorithm at a time in 30 minutes (theory and python code)
Company-wise Data Science Interview Questions â€“ 15 videos on how to crack analytics interview
Complete Numpy Tutorial â€“ 14 videos on all the functions and questions on Numpy
Complete Python Pandas Tutorial â€“ 15 videos to completely cover Pandas
SQL Complete Playlist â€“ 20 highly recommended videos to cover all the interview questions
Case Study and Guesstimates Complete Playlist â€“  Real-life interview case study asked in 2021
Statistics– 10 videos to completely cover Statistics for interviews


Lastly,
If you are in dire need of any help, be it book-wise or guidance-wise, then you can definitely connect with me on Linkedin. We will try to help as much as possible

Missing Value Treatment – Mean, Median, Mode, KNN Imputation, and Prediction

Missing Value treatment is no doubt one of the most important parts of the whole process of building a model. Why?
Because we can’t afford to eliminate rows wherever there is a missing value in any of the columns. We need to tackle it in the best possible way. There are multiple ways to deal with missing values, and these are my top four methods:-

1. Mean – When do you take an average of a column? There is a saying which goes like this, “When a Billionaire walks in a small bar, everyone becomes a millionaire”
So, avoid using Mean as a missing value treatment technique when the range is too high. Suppose there are 10,000 employees with a salary of Rs.40,000 each and there are 100 employees with a salary of Rs. 1,00,000 each. In this case you can consider using the mean for missing value treatment.

But, if there are 10 employees with 8 employees earning Rs.40,000 and one of them earning Rs. 10,00,00. Now, here you should avoid using mean for missing value treatment. You can use mode !!

2. Median – Median is the middle term when you write the terms in ascending or descending order. Think of one example where you can use this? The answer is at the bottom of the article

3. Mode – Mode is the maximum occurring number. As we discussed in point one, we can use Mode where there is a high chance of repetition.

4. KNN Imputation – This is the best way to solve a missing value, here n number of similar neighbors are searched. The similarity of two attributes is determined using a distance function.

In one of the Hackathon, I had to impute or treat the missing value of age, so I tried the following way out( in R)

new_dataset <- knnImputation(data = df,k=8)

k-nearest neighbour can predict both qualitative & quantitative attributes but it consumes a lot of time and processor

install.packages(“imputeTS”)
library(imputeTS)
x <- ts(c(12,23,41,52,NA,71,83,97,108))

na.interpolation(x)

na.interpolation(x, option = “spline”)

na.interpolation(x, option = “stine”)



5. Bonus type – Prediction
This is another way of fixing the missing values. You can try linear regression/time series analysis or any other method to fill in the missing values using prediction

Median – You can use median where there is low variance in age


Came across KNN Imputation, so thought of sharing the same !!

Keep Learning 🙂
The Data Monk

Story of Bias, Variance, Bias-Variance Trade-Off

Why do we predict?
We predict in order to identify the trend of the future by using our sample data set. Whenever we create a model, we try to create a formula out of our sample data set. And the aim of this formula is to satisfy all the possible conditions of the universe.
Mathematicians and Statisticians all across the globe try to create a perfect model that can answer future questions.

Thus we create a model, and this model is bound to have some error. Why? Because we can’t cover all the possible combinations to fit in one formula. The error or difference between the actual and predicted value is called prediction error.

Bias – It is the difference between the average prediction of the model with the actual values. A model with HIGH bias will create a very simple model and it will be far away from the actual values in both train and test data set

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and SVM.

Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression

Variance – Variance refers to the spread of our data. A model with high variance will be so specific in its training dataset that it tries to cover all the points while training the data which results in high training accuracy but low test accuracy

Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

A simple model with high bias (left) a complicated model with high variance in the left

As you can see, the line in the left tries to cover all the points, so it creates a complicated model which is very accurate in the training data set.

Let’s see how an under fitting, over fitting, and good model looks like

As you can see, A high variance occurs in a model that tries to create a complicated formula on the training data set.
A high bias model is very generic. Matlab aiwaiey kuch v average bna diya

If you want to understand the mathematics behind these errors, then below is the formula

The above formula has 3 terms, the first term is the bias square, second is the variance and third is the irreducible error.
No matter what, you can’t remove the irreducible error. It is the measure of noise in the data and you can’t have a noiseless data set.

When you have a very limited dataset then there is a high chance of getting a under-fitting data set(High Bias and Low Variance)
When you have very noisy data then the model tries to fit in a complicated model which might result in over-fitting on the training dataset(High Variance and Low Bias)

What is the bias-variance trade-off?
The trade-off between bias and variance is done to minimize the overall error(formula above)

Error = Reducible Error+Irreducible Error
Reducible error = (Bias)^2 + Variance

Estimated Mean Square Error(Pic. from Quora)

Let’s try to ease out the formula for Bias and Variance
Bias =Estimation of target-target
Variance of estimates = (Target – Estimated target)^2
 The variance error measure how much our target function would differ if a new training data was used.

To keep all the errors positive, we have bias square, variance(which itself is a squared value) and irreducible error squared

The bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

How do we actually try to make bias-variance trade-off?
There are multiple methods for B-V Trade-off
-Separate training and testing dataset
-Cross-Validation
-Good Performance metrics
-Fitting model parameters

Keep Learning 🙂
The Data Monk

Multicollinearity in Simple Terms

We all know the definition of multi-collinearity i.e. when 2 or more explanatory variable in multi regression model are highly linearly related then it’s called multicollinearity

Example –
Age and Selling price of a Car
Education and Annual Income
Height and Weight

Why should we remove multicollinearity from our model?
Example, You are watching WWE and Batista is thrashing Undertaker. Now you know that Batista is better.
But suppose it’s a Royal Rumble where 5 wrestlers are beating Undertaker simultaneously. Now, you can’t say which one is attacking with what intensity and thus you can’t say which wrestler among the five are better.

Thus when you have multiple variables which are correlated, the model is unable to give proper weightage about the impact of each variable. So, we need to remove redundant variables

What all methods are used to remove multi-collinearity?
There are two methods to do the same:-
1. VIF – It stands for Variance Inflation Factor. During regression analysis, VIF assesses whether factors are correlated to each other (multicollinearity), which could affect p-values and the model isn’t going to be as reliable

Factor with high VIF should be removed. A VIF of 1 suggests no correlation

2. PCA – Principal component analysis (PCA) is a technique used to emphasise variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualise.

How so we deal with Multicollinearity in our model?
1. You can use feature engineering to convert the two variables into one and then use this variable

2. Use VIF/PCA to eliminate one of the variables
You should eliminate the one which is not strongly correlated with the target variable

I think this is much about Multi-collinearity. Let me know if you have any questions

Keep Learning 🙂
The Data Monk

Cross Validation and varImp in R

I was onto our next book – Linear,Ridge, LAASO, and Elastic Net Algorithm explained in layman terms with code in R , when we thought of covering the simple concepts which are quite helpful while creating models.

Cross Validation is one simple concept which definitely improves the performance of your model. A lot of you must be using this to create a k-fold cross validation

Let’s quickly go through this relatively simple concept and there is no better way than starting with code

cv <- trainControl(method="repeatedcv",
number=10,
repeats = 5,
verboseIter = T
)

Here we are creating a variable which holds a property i.e. whenever this variable ‘cv’ is called, it will ask the model definition to divide the dataset in 10 equal parts and train the model on 9 parts while testing on the last one i.e. Train on N-1 data points

repeats = 5 means the above process will repeat 5 times i.e. this 9-1 split train and test is done 5 times.

What would you do with this regressive training?
We will compute different Root Mean Square Error, R Square and Mean Absolute Error, and will then decide the best model.

And this is how we use it in a Ridge model

ridge <- train(medv~.,
              BD,
              method = 'glmnet',
              tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=10)),
              trControl=cv
              )

So, here we are creating a Ridge Regression model, predicting the value of medv on the dataset BD and the package/function is glmnet, the tuning parameter tells the model that it’s a ridge model(alpha=0) and a total of 10 numbers ranging from 0.0001 and 1 (Equally spaced)

After all this we specify the model to use the cross validation with trControl parameter

The next function which I love while creating models is varImp. This is a simple function which finds out the most important variables in a set of variables. I think it’s a part of the caret package(do check)

varImp(Lasso, scale = F)

Here we have at least 3 and at max 4 important variables to consider in the model. You can also plot the same using the below function

plot(varImp(Lasso,scale=F)

Just a short article covering a couple of concepts.

Keep Learning 🙂

The Data Monk

Ridge vs LASSO vs Elastic Net Regression

Ridge and LASSO are two important regression models which comes handy when Linear Regression fails to work.

This topic needed a different mention without it’s important to understand COST function and the way it’s calculated for Ridge,LASSO, and any other model.

Let’s first understand the cost function

Cost function is the amount of damage you are going to incur if your prediction goes wrong.

In the layman’s term, suppose you run a pizza shop and you are predicting some values for the number of pizzas sold in the coming 12 months. There would definitely be a delta between the actual and predicted value in your ‘Testing data set’, right?
This is denoted by

Sum of Square of Errors = |predicted-actual|^2

i.e. there is 0 loss when you hit the correct prediction, but there is always
a loss whenever there is a variance.

This is your basic definition of cost function.

Linear, LASSO, Ridge, xyz, every algorithm tries to reduce the penalty i.e. Cost function score

When we talk about Ridge regression, it involves one more point in the above mentioned cost function


Ridge regression C.F. = Sum of Square of Error (SSE)
= |predicted-actual|^2 + lambda*(Beta)^2
The bold part represents L2 Regularization
LASSO Regression C.F. = Sum of Square of Error(SSE)
= |predicted-actual|^2 + lambda*Beta
The bold part represents L1 Regularization
Elastic Net Regression =
|predicted-actual|^2+[(1-alpha)*Beta^2+alpha*Beta]

when alpha = 0, the Elastic Net model reduces to Ridge, and when it’s 1, the model becomes LASSO, other than these values the model behaves in a hybrid manner.
V.V.I. Lines of wisdom below

Beta is called penalty term, and lambda determines how severe the penalty is. And Beta is nothing but the slope of the linear regression line.
So you can see that we are increasing the SSE by adding penalty term, this way we are making the present model worse by ourself 😛

The only difference between L1 and L2 Regularisation or Ridge and LASSO Regression is the cost function. And the difference itself is quite evident i.e. (Beta)^2 vs Beta

You already know what alpha is, right? The Prediction variance square

Now lambda is the

LASSO – Lease Absolute Shrinkage and Selection Operator

Why do we need any other regression model?

Say, you have two points in a co-ordinate (assume these two points as your training dataset i.e. only two data points in your training dataset), you can easily draw a line passing through these two line.
A linear regression does the same, but now if you have to test this LR with 7 data points in your test dataset. Take a look at the diagram below

in the above pic, two circle represents the two data points in the training dataset for your LR model, now this model has perfect accuracy on training dataset, but in testing dataset you have 7 different variables where your model will suffer with a large amount of prediction error.
Prediction error is nothing but the perpendicular distance between predicted and actual.

In this case, other regression comes to the rescue by changing the cost function.

Remember, till now cost function was just the Sum of Square of the difference between predicted and actual, correct?

Now we modify the line of regression in such a way that it is less accurate on training dataset but gives a better result in test dataset. Basically we compromised with the accuracy in the training dataset.

Now the line looks something like the one below

We compromise on the training but nails in the testing part 😛

Now we know that we need to reduce the training model’s accuracy, but How do we lose model’s accuracy?
By reducing the coefficient value of the features learnt while creating the model. Iterating the same as mentioned above
Beta is called penalty term, and lambda determines how severe the penalty is. And Beta is nothing but the slope of the linear regression line.
So you can see that we are increasing the SSE by adding penalty term

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

The Lasso method on its own does not find which features to shrink. Instead, it is a combination of Lasso and Cross Validation (CV) which allows us to determine the best Lasso parameter.

These regression helps reduce variance by shrinking parameters and making our prediction more sensitive to them.

Remember, when you have less data points, your training dataset in Linear Regression might show a good accuracy, but not a good prediction on the testing dataset. In that case do try Ridge, LASSO, and Elastic Net regression.

We will soon be publishing an article containing complete code covering all these algorithms via a Hackathon solution or on a open source dataset.

Post your questions, if you have any

Keep Learning 🙂
The Data Monk

Linear, LASSO, Elastic Net, and Ridge Regression in Layman terms (R) with complete code – Part 1

Linear,LASSO, Elastic Net, and Ridge Regression are the four regression techniques which are helpful to predict or extrapolate the prediction using the historic data.

Linear doesn’t have any inclination towards the value of lambda.

LASSO takes lambda as 1 and Ridge takes it as 0, Elastic Net is the middle way and the value of lambda varies between 0 to 1.

In this article We will try to help you understand how to build different models from scratch with ready to use code. You don’t even have to download any dataset as the data is already available in R.

The data is called Boston Housing Data and the aim is to predict the price of House in Boston using the following parameters

CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s

Access the data i.e. store the data in your local and then explore the basic of the dataset. I always try 5-6 commands to get a gist of the dataset
?DataSet – To the know the column definitions (only in open source dataset)
head(dataset) – To see the first 5 rows of all the columns
str(dataset) – To get data type and first few values
summary(dataset) – To get the mean median percentile max min of each columns, basically you understand the range of numerical data

Before Loading Boston Housing Data, I personally import a few libraries which might or might not help in the analysis..I am Lazy as fuck !!

install.packages("mlbench")
install.packages("psych")
library(caret)
library(dplyr)
library(xgboost)
library(Matrix)
library(glmnet)
library(psych)
library(mlbench)

Understand the basics of the dataset, but first import the data set

data("BostonHousing")
BD <- BostonHousing

Now BD have the complete data set, you can explore the dataset’s column definition by the following code

?BostonHousing

Let's look at the head of the data set
head(BD)

While exploring multiple things, I came across one of the packages in R which has an awesome correlation function pairs.panels(dataset[])
Correlation requires only numeric variables

pairs.panels(BD[c(-4,-14])
The above code will get you all the correlation and scatter plot which will help you understand the distribution as well as correlation between variables. The matrix looks something like the one below

Do try this visualisation, this might look a bit cluttered, but it’s actually gold

If you are not comfortable with the above plot and are more into conventional form of looking at correlation then try the cor() function

cor(BD[c(-4,-14)])

Eliminate collinearity, but why?
Okay, say you want to predict the salary of employees and there is a high correlation between the age and number of working years in the dataset. In this case having both the variable in the model does not make sense as both symbolises the same thing.

High Correlation leads to multicollinearity and thus overfitting

Now, let’s start with Linear Regression Model. The complete code is provided at the end of the tutorial

sam = sample
train and test command creates a division of 70:30 for train and test
Always create a Cross Validation parameter, Here I am creating one with 10 parts and 5 repeats.

#We have 387 observations in train and 119 observations in Test
#Create Cross Validation parameter, in CV training data is split into n #number of parts and each one is trained, after this model is created using #n-1 number of parts and then error is estimated from 1 part, this is #repeated x times. You can use verboseIter to monitor the progress while #the code is running. verboseIter is optional

cv <- trainControl(method="repeatedcv",
                    number=10,
                    repeats = 5,
                    verboseIter = T
                    )

In short, you are creating a parameter to divide a dataset into 10 parts and keep 9 to train and 1 to test it and you are doing it 5 times to eliminate the chances of random bias.
verboseIter = T gives a good experience when you see your code doing some fancy stuff. Take a slow-mo and put it on Instagram 😛

set.seed(34)
linear <- train(medv ~.,
BD,
method='lm',
trControl = cv)linear$results
linear
summary(linear)

We will do all the EDAs in some other tutorial. In this article we are only focusing on covering the explanation and code of each Regression types

This was the basic Linear Regression, we will evaluate all the models at the end of the series. First let’s create all the models

Next is Ridge Regression

set.seed(123)
ridge <- train(medv~.,
BD,
method = 'glmnet',
tuneGrid = expand.grid(alpha=0,
lambda = seq(0.0001,1,length=10)),
trControl=cv)


We will cover only Linear and Ridge Regression here.
In the next article we will cover LASSO and Elastic Net.
The third article will have the complete evaluation, picking up the best model, and predicting the test cases

One Hot Encoding – Feature Engineering

So, I just started solving the latest Hackathon on Analytics Vidhya, Women in the loop . Be it a real-life Data Science problem or a Hackathon, one-hot encoding is one of the most important part of your data preparation.

If you don’t know about it yet, then you are definitely missing out on something which can boost your rank.

One hot encoding is a representation of categorical variables as binary vectors. What this means is that we want to transform a categorical variable or variables to a format that works better with classification and regression algorithms.

This is how One Hot Encoding works

How not to do a categorical division?
Basically, if you have a column with Course Details like. Data Science, Software Development, Testing, etc. and you want to use these categorical variable in your model, then the best way to do is to make a column with binary variable with all the variables. So, you will have Data Science, Software Development, Testing will be new columns with values as 0, 1, 2, etc.

Now the problem is that 2>1>0 and the model might treat it as this way. So, to get things sorted you need to specify this to the model that ‘bro, these are all categorical numbers and you dare not treat it as numbers’

What to do?
Create new column as binary column. So, Data Science, Software Development, Testing, etc. with 0 and 1. This whole process is called One Hot Encoding.

Example below

There was some JSON error while directly posting the code, so pasting the screenshot

Sales is the name of the column which we need to predict, splitting the sample into 8:2 and putting it into train and test
Initial column names, here Course Domain and Course Type are the two columns which need One Hot Encoding Treatment
ohe <- c("Course_Domain","Course_Type")
train_data = as.data.frame(train_data)

Put the name of the variables which need OHE treatment at one place and convert the training_data into data frame
dummies_train = dummyVars(~ Course_Domain+Course_Type , data = train_data)

df_ohe = as.data.frame(predict(dummies,newdata = train_data))

Here we are creating and converting the variables into dummy variables. Let's see how the columns are names in the data frame df_ohe
colnames(df_ohe)
[1] "Course_Domain.Business" "Course_Domain.Development"
[3] "Course_Domain.Finance & Accounting" "Course_Domain.Software Marketing"
[5] "Course_Type.Course" "Course_Type.Degree"
[7] "Course_Type.Program"

So, all the variables in the two column were given a new name and each have the value 0 or 1..Awesome !!

df_train_ohe = cbind(train_data[,-c(which(colnames(train_data) %in% ohe))],df_ohe)
colnames(df_train_ohe)
The new list of columns in your training data set are below
colnames(df_train_ohe)
[1] "ID" "Day_No"
[3] "Course_ID" "Short_Promotion"
[5] "Public_Holiday" "Long_Promotion"
[7] "User_Traffic" "Competition_Metric"
[9] "Sales" "Course_Domain.Business"
[11] "Course_Domain.Development" "Course_Domain.Finance & Accounting"
[13] "Course_Domain.Software Marketing" "Course_Type.Course"
[15] "Course_Type.Degree" "Course_Type.Program"

You started with 11 variables, and now you have 16 columns, feed it in your XGB or Linear Regression..By the way, you still have 7 more days for the Hacathon..Try it 🙂

Keep Learning 🙂

The Data Monk