Linear Regression Interview Questions

Today is Day 19 and we will clear all the Interview Questions related to Linear Regression.
Linear Regression Interview Questions is the first part of the three days series. All the questions of today will revolve around the assumptions of Linear Regression.
Go through the below Linear Regression Interview Questions

Linear Regression Interview Questions

What is Linear Regression – https://thedatamonk.com/question/linear-regression-what-is-a-linear-regression/
What are the assumptions of Linear Regression – https://thedatamonk.com/question/linear-regression-what-are-the-assumptions-of-linear-regression/
What is Auto-correlation – https://thedatamonk.com/question/linear-regression-what-is-auto-correlation-in-linear-regression-assumption/
What is multi-variate normality – https://thedatamonk.com/question/linear-regression-what-is-multivariate-normality-in-assumptions-of-linear-regression/
What is Homoscedacity – https://thedatamonk.com/question/linear-regression-what-is-homoscedasticity-in-assumptions-of-linear-regression/
What is Multicollinearity – https://thedatamonk.com/question/linear-regression-what-is-multicollinearity/
What is VIF – https://thedatamonk.com/question/linear-regression-what-is-variance-inflation-factor-how-is-it-used/
What is Correlation Matrix – https://thedatamonk.com/question/linear-regression-what-is-correlation-matrix/
Is it necessary to satisfy all the assumptions of Linear Regression – https://thedatamonk.com/question/linear-regression-is-it-necessary-to-satisfy-all-the-assumptions-of-linear-regression/
What is Error Term – https://thedatamonk.com/question/linear-regression-what-is-error-term/


The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Machine Learning Interview Questions – Updated 2022

Today is Day 9 and we will try to solve the top 10 Machine Learning Interview questions. These questions will revolve around statistics and algorithms with a couple of questions

Before we get started, do create your profile on the website. The ‘Login’ area is at your extreme left ?

You can get previous days questions here

Day 1- Overview – https://thedatamonk.com/data-science-interview-question-day1/
Day 2- SQL – https://thedatamonk.com/sql-interview-questions/
Day 3- Joins in SQL –https://thedatamonk.com/joins-in-sql/
Day 4 – Statistics – https://thedatamonk.com/statistics-interview-question/
Day 5 – Machine Learning – https://thedatamonk.com/machine-learning-interview-question/
Day 6 – Forecasting – https://thedatamonk.com/forecasting-interview-questions/
Day 7 – ARIMA – https://thedatamonk.com/arima-interview-questions/
Day 8 – Python – https://thedatamonk.com/python-interview-questions/

Machine Learning Interview Questions

Machine Learning Interview Questions

These questions will not take more than 20 minutes to answer, but if you are able to find solution to these questions and jot it down in the comment box then you will surely be able to answer these in the interviews.

So, Start answering 🙂

Trade of between Bias and Variance – https://thedatamonk.com/question/what-is-the-trade-off-between-bias-and-variance/
Supervised vs Unsupervised Algorithms – https://thedatamonk.com/question/supervised-and-unsupervised-difference/
Explain any algorithm in less than 1 minute – https://thedatamonk.com/question/explain-your-favourite-ml-algorithm-in-less-than-a-minute/
Missing Value Treatment – https://thedatamonk.com/question/what-to-do-if-one-of-my-column-with-integer-value-is-having-more-than-30-missing-values/
F1 score – https://thedatamonk.com/question/whats-the-f1-score/
Overfitting in a model – https://thedatamonk.com/question/how-do-you-ensure-youre-not-overfitting-with-a-model/
Evaluation of Logistic Regression model – https://thedatamonk.com/question/how-would-you-evaluate-a-logistic-regression-model/
Variable identification- https://thedatamonk.com/question/i-have-10-independent-variable-how-to-identify-the-important-variable-for-my-linear-regression-model/
SQL most asked output question – https://thedatamonk.com/question/sql-output-interview-question/
Low Bias High Variance Question – https://thedatamonk.com/question/low-bias-high-variance/

The Data Monk e-books

Tired of online courses costing 2 to 8 lakh and taking more than a year to complete?
Tired of going through 500+ hours of videos at a super slow pace?
We at The Data Monk believe that you have to start and complete things as quickly as possible. We believe in a target-based study where we break a topic into 100 questions and make sure that if you cover these questions you will surely be able to crack the interview questions. Rest all theory and practical can ONLY be learned while working in an organization.


Pick any of our books from our e-shop page and complete it in 6-8 hours, learn the 100 questions and write it in your resume. We guarantee you that you will nail 8 out of 10 interviews

We also have 3 bundles at a price that is affordable to everyone. We are a group of people placed in best of the product-based companies and we take 100+ interviews per week. Do we know what is being asked and what is not? So, just grab any of the following book bundles and give not more than 30 days to LEARN all the questions. We guarantee you that you will become a very strong candidate in any analytics interview

Set A – [3rd/4th year/ and 0 to 3 years of experience]

Crack any analytics or data science interview with our 1400+ interview questions which focus on multiple domains i.e. SQL, R, Python, Machine Learning, Statistics, and Visualization. – https://thedatamonk.com/product/books-to-crack-analytics-interview/

Set B – [0-5 Years of Experience]

1200+ Interview Questions on all the important Machine Learning algorithms (including complete Python code) Ada Boost, CNN, ANN, Forecasting (ARIMA, SARIMA, ARIMAX), Clustering, LSTM, SVM, Linear Regression, Logistic Regression, Sentiment Analysis, NLP, K-Mean – https://thedatamonk.com/product/machine-learning-interview-questions/

Set C – [0-7 Years of Experience]

2000+ interview questions that include 100 questions each on 12 most asked Machine Learning Algorithms, Python, Numpy and Pandas – 300 Interview Questions, Pandas,PCA,AWS,Data Preprocessing,Case Studies, and many more
https://thedatamonk.com/product/the-data-monk-e-book-bundle/

Note – Set C contains all the questions of Set B


Youtube Channel – The Data Monk

Unlike any other youtube channel, we do not teach basic stuff, we teach only topics that are asked in interviews. If the interviewer asks about p-value, we will have a video on that topic,
If the interviewer is interested in asking the sequence of execution of SQL commands then we will give you an overview of all the commands but stress so much on the question that can answer it comfortably in the interview. We definitely recommend you to follow our youtube channel for any topic that you are interested in or weak at
.

If you wish to get all the study material and topics to cover for an interview at one place, then you can subscribe to our channel. We have covered the complete syllabus of
Get all the youtube videos playlist on our youtube Channel – The Data Monk

Code in Python for Data Science – Understand one algorithm at a time in 30 minutes (theory and python code)
Company-wise Data Science Interview Questions – 15 videos on how to crack analytics interview
Complete Numpy Tutorial – 14 videos on all the functions and questions on Numpy
Complete Python Pandas Tutorial – 15 videos to completely cover Pandas
SQL Complete Playlist – 20 highly recommended videos to cover all the interview questions
Case Study and Guesstimates Complete Playlist –  Real-life interview case study asked in 2021
Statistics– 10 videos to completely cover Statistics for interviews


Lastly,
If you are in dire need of any help, be it book-wise or guidance-wise, then you can definitely connect with me on Linkedin. We will try to help as much as possible

ARIMA interview questions | Day 7

Today is Day 6 and we will have 10 ARIMA interview questions for you. These ARIMA interview questions are of intermediate level and are asked very often in the interviews

ARIMA interview questions

Before we get started, do create your profile on the website. The ‘Login’ area is at your extreme left ?

You can get previous days questions here

Day 1- Overview – https://thedatamonk.com/data-science-interview-question-day1/
Day 2- SQL – https://thedatamonk.com/sql-interview-questions/
Day 3- Joins in SQL –https://thedatamonk.com/joins-in-sql/
Day 4 – Statistics – https://thedatamonk.com/statistics-interview-question/
Day 5 – Machine Learning – https://thedatamonk.com/machine-learning-interview-question/
Day 6 – Forecasting – https://thedatamonk.com/forecasting-interview-questions/

ARIMA is a classic Time series algorithm which works on
Trend
Seasonality
Moving Average

Apart from ARIMA there are other similar and efficient algorithms like
– ARIMAX
– SARIMA
– Holt Winters
– Long Short Term Memory

These ARIMA interview questions will help you understand Time Series in a better way. If you think you know nothing about Time Series, then try to google it or you can find good blogs at Analytics vidhya as well.

This is one of those algorithms which will help you in a lot of Hackathons.

So, solve the following

R Squared error – https://thedatamonk.com/question/how-does-the-value-of-r-squared-and-adjusted-r-squared-error-change-when-you-add-new-variable-in-your-model/
Low R Squared error – https://thedatamonk.com/question/what-is-better-a-low-r-squared-or-a-high-r-squared/
p,d,q in ARIMA – https://thedatamonk.com/question/what-is-the-acceptable-value-range-for-pd-and-q-in-arima/
LSTM – https://thedatamonk.com/question/explain-long-short-term-memory-algorithm-in-brief/
Holt Winters – https://thedatamonk.com/question/explain-holt-winters-in-brief/
SARIMA – https://thedatamonk.com/question/what-is-sarima-and-how-is-it-different-from-arima/
ARIMAX vs ARIMA – https://thedatamonk.com/question/how-is-arimax-different-from-arima/
How to get value of p,d,q – https://thedatamonk.com/question/how-to-get-the-value-of-d-in-pdq-in-arima/
PACF graph – https://thedatamonk.com/question/how-to-read-pacf-graph/
ACF graph – https://thedatamonk.com/question/how-to-read-acf-graph-what-is-lag-in-acf-graph/

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Forecasting Interview Questions | Day 6

Forecasting is one of the popular domains in Data Science. If you are exploring Data Science then you should definitely look into this domain.
Today is Day 6 and we will have 10 Forecasting interview questions for you. These forecasting interview questions are of intermediate level and are asked very often in the interviews

Before we get started, do create your profile on the website. The ‘Login’ area is at your extreme left ?

You can get previous days questions here

Day 1- https://thedatamonk.com/data-science-interview-question-day1/
Day 2-https://thedatamonk.com/sql-interview-questions/
Day 3-https://thedatamonk.com/joins-in-sql/
Day 4 – https://thedatamonk.com/statistics-interview-question/
Day 5 – https://thedatamonk.com/machine-learning-interview-question/

Forecasting interview questions
Forecasting using Telescope

There are multiple ways to do forecasting, you can use Linear Regression, You can use XGBoost, or any other regression algorithm.
Today we will concentrate on ARIMA time series model. If you are completely unfamiliar of Time Series or ARIMA model, then comment your email id below and I will send you a book ‘Master Forecasting in R’

Following are the 10 Forecasting interview Questions

Forecast Number of Pizzas(Case Study Hike Messenger) – https://thedatamonk.com/question/how-to-forecast-number-of-pizza-which-will-be-sold-in-pizza-hut-in-next-week/
Degree of Freedom – https://thedatamonk.com/question/what-is-degree-of-freedom-in-arima/
BIC in ARIMA- https://thedatamonk.com/question/what-is-bic-in-arima-model/
ARIMA in simplest terms – https://thedatamonk.com/question/define-arima-in-simplest-terms/
AIC in ARIMA – https://thedatamonk.com/question/what-is-aic-in-arima/
ADF Test for Time Series – https://thedatamonk.com/question/what-is-adf-test-in-time-series-analysis/
Cyclicity and Seasonality- https://thedatamonk.com/question/difference-between-cyclicity-and-seasonality/
Seasonality – https://thedatamonk.com/question/explain-seasonality/
Trend – https://thedatamonk.com/question/explain-the-trend-in-forecasting/
ARIMA sample code – https://thedatamonk.com/question/write-sample-code-of-arima-in-python-or-r/

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Machine Learning Interview Questions | Day 5

For Day 5 we will concentrate on Machine Learning Interview Questions. The following 10 questions are mostly asked by an interviewer to know the level of clarity in basics of ML.
We have covered 10 questions on SQL, Statistics, and case studies in the first 4 days. The below mentioned Machine Learning Interview Questions will help you in clearing your base.

Before we get started, do create your profile on the website. The ‘Login’ area is at your extreme left ?

You can get previous days questions here

Day 1- https://thedatamonk.com/data-science-interview-question-day1/
Day 2-https://thedatamonk.com/sql-interview-questions/
Day 3-https://thedatamonk.com/joins-in-sql/
Day 4 – https://thedatamonk.com/statistics-interview-question/

Machine Learning Interview Questions
Day 5

Type 1 Error – https://thedatamonk.com/question/explain-type-1-error-in-simplest-terms/
Type 2 error- https://thedatamonk.com/question/explain-type-2-error-in-simple-terms/
What is worse, type 1 error or type 2 error – https://thedatamonk.com/question/what-is-worse-type-1-error-or-type-2-error/
Confusion Matrix – https://thedatamonk.com/question/you-have-created-a-model-for-fire-alarm-explain-confusion-matrix-with-this-example/
Recall – https://thedatamonk.com/question/explain-recall-in-simple-terms/
Precision – https://thedatamonk.com/question/explain-precision-in-the-simplest-terms/
Regression – https://thedatamonk.com/question/when-to-use-linear-and-when-yo-use-logistic-regression/
Decision trees – https://thedatamonk.com/question/explain-decision-tree-in-simple-terms/
Choose K in k-means – https://thedatamonk.com/question/what-are-the-methods-to-choose-the-value-of-k-in-k-means-algorithm/
Number of trees in Random Forest – https://thedatamonk.com/question/how-to-select-number-of-trees-in-random-forest/

Some of the most asked ML algorithms are
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. Random Forest
5. K-Means
6. KNN
7. XGBoost
8. Natural Language Processing
9. ARIMA

Everything mentioned above is covered in our 2300+ interview questions book with solution. Do check-out.

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

What is Stationarity in Time Series?

Stationarity in Time Series
The first step for any time series analysis is to make the data set stationary. Everyone knows that stationarity means a near to constant mean and variance across time.
Stationarity in Time Series

Stationarity in Time Series
Stationarity in Time Series

The red line above shows an increasing trend and the blue line is the result of the de-trending series. De-trending means to fit a regression line and then subtract it using original data

Stationarity does not mean that the series does not change over time, just the way it changes does not itself change over time.

The reason why we need a stationary data is simple – It’s easier to analyze and predict a data set with stationarity. If a series is consistently increasing over time (like the one above), then the sample mean and variance will also grow with the size of the sample, and your model or the proposed time series solution will always underestimate the mean and variance in the future periods.

How you check the stationarity of a series?
In general, we use Augmented Dickey Fuller Test or KPSS test to check the stationarity of the series. Here we will discuss only the ADF test, KPSS phir kabhi

ADF is a statistical significance test (a test which involves null and alternate hypothesis) and it falls under the category of ‘unit root test’. Now, what is a unit root test?

Yt is the value of the time series at time ‘t’ and Xe is an exogenous variable (a separate explanatory variable, which is also a time series).

The presence of a unit root means the time series is non-stationary. Besides, the number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary.

A time series is a process that can be written in its components which contains ‘roots’. For example:

v(t)=c+a1 v(t−1) + ϵt − 1

The coefficient a1 is a root. You can interpret this process/formula as ‘the value of today depends on the value of yesterday and some randomness we can’t predict’. We expect this process to always converge back to the value of c.

Try this is out with an example:
suppose c=0 and a1=0.5.

If yesterday v(t−1) the value was 100, then we expect that today the value will be around 50. Tomorrow, we expect the value to be 25 and so on.

You see that this series will ‘come home’, in this case meaning it will converge back to the value of cc.

When one of the roots is a unit, i.e. equal to 1 (in this example when a1=1), then this series will not recover to its origin. You can see this by using the example given above.
That is why the concepts of unit roots and unit root tests are useful: it gives us insights into whether the time series will recover to its expected value. If this is not the case, then the process will be very susceptible to shocks and hard to predict and control.

What is the significance of p-value in ADF test?
A high p-value, suppose 0.87 indicates that the possibility of the series to be non-stationary is 87%.
We do multiple differencing in the dataset to make it stationary

adf.test(diff(time_series))

In the above snippet, we are doing one differentiation of the time series data and then testing the stationarity using the adf test in R
You can also try a double differentiation or a difference after log to check the stationarity(if the noise is high)

adf.test(diff(log(time_series))

A rule of thumb – Don’t over differentiate i.e. don’t apply 6-7 differentiation to fix the noise in order to decrease the p-value for a stationary dataset.

In the case of a first-difference, we are literally getting the difference between a value and the one for the time period immediately previous to it. If you are going for a high number of differentiation then it clearly means that your data has too much noise to cater to a time series pattern

Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

Bottom line :-
-the value of today depends on the value of yesterday and some randomness we can’t predict
-Stationarity is useful to identify the pattern in order to predict values
-You do a difference of order one, two, three, etc. to get to a stationary value for the dataset
-Do an ADF or a KPSS test to check if the series is stationary
-Uske baad chill 🙂

The Data Monk e-books

Tired of online courses costing 2 to 8 lakh and taking more than a year to complete?
Tired of going through 500+ hours of videos at a super slow pace?
We at The Data Monk believe that you have to start and complete things as quickly as possible. We believe in a target-based study where we break a topic into 100 questions and make sure that if you cover these questions you will surely be able to crack the interview questions. Rest all theory and practical can ONLY be learned while working in an organization.


Pick any of our books from our e-shop page and complete it in 6-8 hours, learn the 100 questions and write it in your resume. We guarantee you that you will nail 8 out of 10 interviews

We also have 3 bundles at a price that is affordable to everyone. We are a group of people placed in the best of the product-based companies and we take 100+ interviews per week. Do we know what is being asked and what is not? So, just grab any of the following book bundles and give not more than 30 days to LEARN all the questions. We guarantee you that you will become a very strong candidate in any analytics interview

Set A – [3rd/4th year/ and 0 to 3 years of experience]

Crack any analytics or data science interview with our 1400+ interview questions which focus on multiple domains i.e. SQL, R, Python, Machine Learning, Statistics, and Visualization. – https://thedatamonk.com/product/books-to-crack-analytics-interview/

Set B – [0-5 Years of Experience]

1200+ Interview Questions on all the important Machine Learning algorithms (including complete Python code) Ada Boost, CNN, ANN, Forecasting (ARIMA, SARIMA, ARIMAX), Clustering, LSTM, SVM, Linear Regression, Logistic Regression, Sentiment Analysis, NLP, K-Mean – https://thedatamonk.com/product/machine-learning-interview-questions/

Set C – [0-7 Years of Experience]

2000+ interview questions that include 100 questions each on 12 most asked Machine Learning Algorithms, Python, Numpy and Pandas – 300 Interview Questions, Pandas,PCA,AWS,Data Preprocessing,Case Studies, and many more
https://thedatamonk.com/product/the-data-monk-e-book-bundle/

Note – Set C contains all the questions of Set B


Youtube Channel – The Data Monk

Unlike any other youtube channel, we do not teach basic stuff, we teach only topics that are asked in interviews. If the interviewer asks about p-value, we will have a video on that topic,
If the interviewer is interested in asking the sequence of execution of SQL commands then we will give you an overview of all the commands but stress so much on the question that can answer it comfortably in the interview. We definitely recommend you to follow our youtube channel for any topic that you are interested in or weak at
.

If you wish to get all the study material and topics to cover for an interview at one place, then you can subscribe to our channel. We have covered the complete syllabus of
Get all the youtube videos playlist on our youtube Channel – The Data Monk

Code in Python for Data Science – Understand one algorithm at a time in 30 minutes (theory and python code)
Company-wise Data Science Interview Questions – 15 videos on how to crack analytics interview
Complete Numpy Tutorial – 14 videos on all the functions and questions on Numpy
Complete Python Pandas Tutorial – 15 videos to completely cover Pandas
SQL Complete Playlist – 20 highly recommended videos to cover all the interview questions
Case Study and Guesstimates Complete Playlist –  Real-life interview case study asked in 2021
Statistics– 10 videos to completely cover Statistics for interviews


Lastly,
If you are in dire need of any help, be it book-wise or guidance-wise, then you can definitely connect with me on Linkedin. We will try to help as much as possible

Missing Value Treatment – Mean, Median, Mode, KNN Imputation, and Prediction

Missing Value treatment is no doubt one of the most important parts of the whole process of building a model. Why?
Because we can’t afford to eliminate rows wherever there is a missing value in any of the columns. We need to tackle it in the best possible way. There are multiple ways to deal with missing values, and these are my top four methods:-

1. Mean – When do you take an average of a column? There is a saying which goes like this, “When a Billionaire walks in a small bar, everyone becomes a millionaire”
So, avoid using Mean as a missing value treatment technique when the range is too high. Suppose there are 10,000 employees with a salary of Rs.40,000 each and there are 100 employees with a salary of Rs. 1,00,000 each. In this case you can consider using the mean for missing value treatment.

But, if there are 10 employees with 8 employees earning Rs.40,000 and one of them earning Rs. 10,00,00. Now, here you should avoid using mean for missing value treatment. You can use mode !!

2. Median – Median is the middle term when you write the terms in ascending or descending order. Think of one example where you can use this? The answer is at the bottom of the article

3. Mode – Mode is the maximum occurring number. As we discussed in point one, we can use Mode where there is a high chance of repetition.

4. KNN Imputation – This is the best way to solve a missing value, here n number of similar neighbors are searched. The similarity of two attributes is determined using a distance function.

In one of the Hackathon, I had to impute or treat the missing value of age, so I tried the following way out( in R)

new_dataset <- knnImputation(data = df,k=8)

k-nearest neighbour can predict both qualitative & quantitative attributes but it consumes a lot of time and processor

install.packages(“imputeTS”)
library(imputeTS)
x <- ts(c(12,23,41,52,NA,71,83,97,108))

na.interpolation(x)

na.interpolation(x, option = “spline”)

na.interpolation(x, option = “stine”)



5. Bonus type – Prediction
This is another way of fixing the missing values. You can try linear regression/time series analysis or any other method to fill in the missing values using prediction

Median – You can use median where there is low variance in age


Came across KNN Imputation, so thought of sharing the same !!

Keep Learning 🙂
The Data Monk

Story of Bias, Variance, Bias-Variance Trade-Off

Why do we predict?
We predict in order to identify the trend of the future by using our sample data set. Whenever we create a model, we try to create a formula out of our sample data set. And the aim of this formula is to satisfy all the possible conditions of the universe.
Mathematicians and Statisticians all across the globe try to create a perfect model that can answer future questions.

Thus we create a model, and this model is bound to have some error. Why? Because we can’t cover all the possible combinations to fit in one formula. The error or difference between the actual and predicted value is called prediction error.

Bias – It is the difference between the average prediction of the model with the actual values. A model with HIGH bias will create a very simple model and it will be far away from the actual values in both train and test data set

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and SVM.

Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression

Variance – Variance refers to the spread of our data. A model with high variance will be so specific in its training dataset that it tries to cover all the points while training the data which results in high training accuracy but low test accuracy

Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

A simple model with high bias (left) a complicated model with high variance in the left

As you can see, the line in the left tries to cover all the points, so it creates a complicated model which is very accurate in the training data set.

Let’s see how an under fitting, over fitting, and good model looks like

As you can see, A high variance occurs in a model that tries to create a complicated formula on the training data set.
A high bias model is very generic. Matlab aiwaiey kuch v average bna diya

If you want to understand the mathematics behind these errors, then below is the formula

The above formula has 3 terms, the first term is the bias square, second is the variance and third is the irreducible error.
No matter what, you can’t remove the irreducible error. It is the measure of noise in the data and you can’t have a noiseless data set.

When you have a very limited dataset then there is a high chance of getting a under-fitting data set(High Bias and Low Variance)
When you have very noisy data then the model tries to fit in a complicated model which might result in over-fitting on the training dataset(High Variance and Low Bias)

What is the bias-variance trade-off?
The trade-off between bias and variance is done to minimize the overall error(formula above)

Error = Reducible Error+Irreducible Error
Reducible error = (Bias)^2 + Variance

Estimated Mean Square Error(Pic. from Quora)

Let’s try to ease out the formula for Bias and Variance
Bias =Estimation of target-target
Variance of estimates = (Target – Estimated target)^2
 The variance error measure how much our target function would differ if a new training data was used.

To keep all the errors positive, we have bias square, variance(which itself is a squared value) and irreducible error squared

The biasvariance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

How do we actually try to make bias-variance trade-off?
There are multiple methods for B-V Trade-off
-Separate training and testing dataset
-Cross-Validation
-Good Performance metrics
-Fitting model parameters

Keep Learning 🙂
The Data Monk

Multicollinearity in Simple Terms

We all know the definition of multi-collinearity i.e. when 2 or more explanatory variable in multi regression model are highly linearly related then it’s called multicollinearity

Example –
Age and Selling price of a Car
Education and Annual Income
Height and Weight

Why should we remove multicollinearity from our model?
Example, You are watching WWE and Batista is thrashing Undertaker. Now you know that Batista is better.
But suppose it’s a Royal Rumble where 5 wrestlers are beating Undertaker simultaneously. Now, you can’t say which one is attacking with what intensity and thus you can’t say which wrestler among the five are better.

Thus when you have multiple variables which are correlated, the model is unable to give proper weightage about the impact of each variable. So, we need to remove redundant variables

What all methods are used to remove multi-collinearity?
There are two methods to do the same:-
1. VIF – It stands for Variance Inflation Factor. During regression analysis, VIF assesses whether factors are correlated to each other (multicollinearity), which could affect p-values and the model isn’t going to be as reliable

Factor with high VIF should be removed. A VIF of 1 suggests no correlation

2. PCA – Principal component analysis (PCA) is a technique used to emphasise variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualise.

How so we deal with Multicollinearity in our model?
1. You can use feature engineering to convert the two variables into one and then use this variable

2. Use VIF/PCA to eliminate one of the variables
You should eliminate the one which is not strongly correlated with the target variable

I think this is much about Multi-collinearity. Let me know if you have any questions

Keep Learning 🙂
The Data Monk

Cross Validation and varImp in R

I was onto our next book – Linear,Ridge, LAASO, and Elastic Net Algorithm explained in layman terms with code in R , when we thought of covering the simple concepts which are quite helpful while creating models.

Cross Validation is one simple concept which definitely improves the performance of your model. A lot of you must be using this to create a k-fold cross validation

Let’s quickly go through this relatively simple concept and there is no better way than starting with code

cv <- trainControl(method="repeatedcv",
number=10,
repeats = 5,
verboseIter = T
)

Here we are creating a variable which holds a property i.e. whenever this variable ‘cv’ is called, it will ask the model definition to divide the dataset in 10 equal parts and train the model on 9 parts while testing on the last one i.e. Train on N-1 data points

repeats = 5 means the above process will repeat 5 times i.e. this 9-1 split train and test is done 5 times.

What would you do with this regressive training?
We will compute different Root Mean Square Error, R Square and Mean Absolute Error, and will then decide the best model.

And this is how we use it in a Ridge model

ridge <- train(medv~.,
              BD,
              method = 'glmnet',
              tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=10)),
              trControl=cv
              )

So, here we are creating a Ridge Regression model, predicting the value of medv on the dataset BD and the package/function is glmnet, the tuning parameter tells the model that it’s a ridge model(alpha=0) and a total of 10 numbers ranging from 0.0001 and 1 (Equally spaced)

After all this we specify the model to use the cross validation with trControl parameter

The next function which I love while creating models is varImp. This is a simple function which finds out the most important variables in a set of variables. I think it’s a part of the caret package(do check)

varImp(Lasso, scale = F)

Here we have at least 3 and at max 4 important variables to consider in the model. You can also plot the same using the below function

plot(varImp(Lasso,scale=F)

Just a short article covering a couple of concepts.

Keep Learning 🙂

The Data Monk