Share
What is the need to remove multicolinearity ?
Question
What would happen if we don’t remove it?
solved
0
Statistics
55 years
11 Answers
1945 views
Grand Master 0
Answers ( 11 )
Multicollinearity means that there is an association between two variables or we can say that one variable is been explained by another variable. Using two colinear variables in the model would mean that we are overfitting the model as we are using 2-3 variables to explain the variances in output which could be explained by only one among them. Not removing collinear features from the model would give us a great R-squared value (in the train set) however, there would be very less significant variables or in case of linear regression, the variables which should have positive (negative) parameters associated to it’s could have negative (positive) parameters associated.
We can refer to the “Multicollinearity” chapter of “Basic Econometrics” by D. Gujarati it explains in very simple language https://www.amazon.in/Basic-Econometrics-Damodar-Gujarati/dp/0071333452
Suppose you are having two variables X1 and X2 where X1 = 2X2 and you are predicting some variable Y. It can be seen that X1 and X2 are linearly related. So instead of taking both the variable you can take one of them because another can be written in the form this.
This kind of problem called multicollinearity when two independent features are highly correlated or one explained by another variable.
In this situation if we didn’t remove one of them then the parameters of these variables will be unstable and will be changing interchangeable.
Let’s take linear regression, if we have multicollinearity in data then rank of the matrix will be less than number of feature which means we can’t find inverse of that matrix (but still you will get a solution if you use any statistical software which use psuedo inverse)
Multi collinearity is not an issue if prediction is the only end goal.
however, regression also deals with analyzing the impact of the individual variables
on the final output. This is where multicollinearity comes into the picture. If two independent variables are
highly correlated with each other it may have a effect on thecoeeficient which they produce.
Removing multicolinearity is also one of the basic requirements of linear regression. apart from that if multicolinearity is present then our model is unstable and may not give accurate results
Its very important to reduce the multicollinearity as it can significantly reduce the model performance and we may not know it.It can also reduce features which will result in less complex model and also the overhead to store these features will be less.
Understanding the multicollinearity Conceptually :
Imagine you went to watch a rock bond concert . There are two singers , a drummer , a key board player , and two guitarists. You can easily differentiate between the voice of a Singers as one is male and the other female but you can face trouble in determining who is playing the better Guitar.
Both guitarists are playing on the same tone ,same pitch and at the same speed. if you could remove one of them then it wouldn’t be a problem since both are almost same.
The benefit of removing one guitarist is cost cutting and fewer members in the team. In machine learning , it is fewer features for training which leads to less complex model.
Here both the guitarists are collinear. If one plays the guitar slowly then another guitarist also plays the guitar slowly.If one plays the guitar faster then the other also plays faster.
Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation.
Multicollinearity between variable means independent variables are correlated with each other.
As the number of variable increases the R-Square also increases due to multicollinearity and due to which the gap between R-square and Adj R-sq increases( Adj R-square id used to remove the artificial R-sq which increases by the increase of variables) and it reduces the performance of the model.
1) Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model. For example, mileage and price of a car, study time and leisure time, etc.
2) This can occur because of poor data creation methods, adding new variables like BMI depending on old variables, etc
3) Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable and may reduce model performance
Example: Y = W0+W1*X1+W2*X2
Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2 constant. But since X1 and X2 are highly correlated, changes in X1 would also cause changes in X2 and we would not be able to see their individual effect on Y.
This makes the effects of X1 on Y difficult to distinguish from the effects of X2 on Y
MultiCollinearity Occurs when Independent variables in the Regression model are related to each other. This correlation is a problem because the independent variable should independent. If the degree of correlation is high enough, It can cause some problems when we fit the model and interpret the results.
Multicollinearity causes the following two basic types of problems:
1) The coefficient estimates can swing wildly based on which other independent variables are in the model. The coefficients become very sensitive to small changes in the model.
2) Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.
Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable. For example, let’s assume that in the following linear equation:
Y = W0+W1*X1+W2*X2
Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2 constant. But since X1 and X2 are highly correlated, changes in X1 would also cause changes in X2 and we would not be able to see their individual effect on Y.
This makes the effects of X1 on Y difficult to distinguish from the effects of X2 on Y.
Multicollinearity may not affect the accuracy of the model as much. But we might lose reliability in determining the effects of individual features in your model and that can be a problem when it comes to interpretability.
Moderate multicollinearity may not be problematic. However, severe multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.
Multicollinearity means that two or more than 2 independent input variables/features are highly correlated and one can be explained with the help of other.
Using these variables as an input to the model would mean that we are over-fitting the model to calculate the variation in the output predicted variable.
Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable and may reduce model performance.
Thus, to predict the output variable, we should remove these highly correlated features and only keep one of them so that R squared value is accurate.