Here we will talk about concepts which will let you understand Linear Regression and will help you tackle questions in your interview. I personally know a handful of candidates who have bosted themselves as a master of Linear Regression but have failed to answer basic questions(Read – Me :P)
That’s why We have tried to consolidate all the learnings from my peer group.
To understand a Data Science algorithm you need to cover at least these three things:-
2. Code and Mathematical Formulas
3. Evaluating Performance of your model
Regression is a method of modeling a dependent value based on independent variables. This model is used extensively to solve a basic forecasting problem.
Do, You want to know what was the next question asked by the panel(Lowe’s Technical Interview Round)?
What is the difference between forecasting and prediction?
These two do look the same, but there is a striking difference between them. While forecasting is based on the historical data and it’s mostly about projecting a line of future values by extending the historical trend.
Prediction is a judgment. It takes into account changes which are taking place in the future.
When you say that it will rain tomorrow, by looking at the historical data then that’s forecasting, but reading palm and telling your future is an example of prediction. So, be ready for all types of questions.
Let’s get back to Linear regression. In a layman term, we can say that Regression is a method to predict the outcome of a variable in the best possible way given the past data and its outcomes.
Example – Can you forecast the salary of the age group 40-45 using the given data?
You can guess that the Salary would be somewhere in the range of $4,000 – $4,500 looking at the already given outcomes. This is how a basic Linear Regression works.
The core of Linear Regression is to understand how the outcome variable is dependent on the independent variables.
What are the assumptions of Linear Regression?
This is one of the most asked questions in an interview which revolves around Linear Regression. There are five major assumptions of a Linear Regression:-
1. Linear Relationship
2. Low or No Multicollinearity
3. No autocorrelation
4. Multivariate Normality
You don’t have to learn these points, you need to understand each of these before diving in the implementation part.
1. Linear Relationship – A linear regression assumes that there is a straight line relationship between X and y in the equation given below
Y = Bo + B1X + Epsilon
Y = Bo + B1X is nothing but an equation of straight line
To do any statistical inference, we need to make some assumptions about the error term which is represented as Epsilon. The first assumption comes into the picture where we assume three things for this random error terms:-
a. Mean of the error is 0
b. Error is normally distributed
c. Error has a constant variance
Read the following line at least thrice to understand it
” Error Term ~ N(0, Variance) = This shows that every error term is normally distributed and have a mean of 0 and a constant variance”
Remember the equation Y = Bo+B1X+Epsilon ??
Now we can re-write the equation as Y ~ N(Bo+B1X,Variance)
This means that the Y is normally distributed with a mean of Bo+B1X and a constant variance.
So the first assumption goes like this “The dependent variable is a linear combination of the independent variable and the error term”
You can check the relationship between the dependent and independent variable by a scatter plot
2. Multivariate Normality
The second assumption states that every Linear combination of Y with the independent variables needs to have a univariate Normal distribution. Multivariate normality is not a fancy term, it is just a generalization of the one-dimensional normal distribution to higher dimensions
3. Low or No Multicollinearity
This one is easy to understand. Suppose I want to find out the selling price of a car and your independent variables are age of the car, Kilometers, health of engine, etc.
Now, we know that the number of Kilometers and age of the car will have high correlation(generally). The number of kilometers traveled by the car will increase with the age of the car.
Linear Regression states that using two variables with high correlation will complicate the model and will be of no use. So you need to chuck one of these. We will talk about two ways in which you can decide on removing one of the variables from age and kilometer
1. Variance Inflation Factor (VIF)
VIF > 100 is a direct indication of high multicollinearity. In a layman term, remove the variable with high VIF value
2. Correlation Matrix – Plot a correlation matrix to understand the strength of correlation between two variables. Take one out of two variable at a time and check the performance of the model
4. Low or No Correlation between the data
In the last point we talked about the multicollinearity between the independent variables. Now, we want to check if there is a correlation between the data itself.
In simple language, if the value of f(x+1) is dependent on f(x) then the data is having a correlation. A classic example is the share price where the price is dependent on the previous value. You can check the correlation of the data by either a scatter plot or a Durbin-Watsom test. The null hypothesis of Durbin-Watsom test is that “the residuals are not linearly correlated”
If 1.5<d<2.5 then the values are not auto-correlated.
See, if the data itself is correlated then it would be hard to know the impact of other variables on Y. So it is assumed that there is no or little correlation between the data
If the residuals are equal across the regression line, then there is no homoscedasticity.
We will look into the implementation part in the next article which will be followed by the evaluation of performance.
Linear Regression Part 2 -Implementation of LR
Linear Regression Part 3 – Evaluation of the model
Keep Learning 🙂
The Data Monk