How many regression techniques do you know?
These two are the building blocks of your Data Science career, there are N number of Regression models, but we will try to cover the top 5 models which can come handy to you.
First of all, What on earth is regression?
In a layman term, Regression is a way to extrapolate your dataset to predict values on the basis of independent variables.
If you know the attendance, class test marks, last year records of a student, then you can predict his/her performance in the current semester.
If you know the weight, BMI, sugar level, etc. of a patient, then you can predict if he/she will suffer from Diabetes in the future.
In the same way, you can predict a lot of things, using regression you can observe the relationship between independent and dependent variable of your model. And you can also check the correlation between the variables.
This simple method will help you to evaluate important variables and then use it in your model directly.
Linear Regression is the king of all the regression model to predict continuous values.
Okay, so Linear Regression predicts continuous values and Logistic Regression predicts probability values. So, Linear is used to predict some value(like the cost of an apartment, number of upcoming tickets, etc.) and Logistic is used to classify things(like, if a person will suffer from some disease, probability of a client to buy insurance, etc.)
Linear Regression is a simple Y = mX + C line where we are predicting the values of Y on the basis of various X and a constant C
Assumptions of Linear Regression (Interview Question):-
1. Linear relationship between independent and dependent variable
2. Less or no Multicollinearity
3. No auto-correlation – Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).
The linear regression analysis requires all variables to be multivariate normal
I have the freedom of speech 😛
Dekho, Linear Regression main tumhe predict krna hota hai ek value using the equation Y=mX+C, yahan X main saare independent variable aa jaate hain ex. m1X1,m2X2, etc. C toh constant hin hai. Ye equation ek line deti hai aur issi ko extrapolate kr ke tumhe future values milti hai. Ye X1, X2 aapas main correlated nai hone chahiye matlab aisa na ho ke X1 aur X2 dono variable ka impact same hin ho Y pe. R square, Adjusted R square, Accuracy, etc. se ye maluum hota hai ki model kitna sahi bna hai. Baaki iss pe ek pura article hai website pe, khoj ke padh lena
Let’s start with a simple definition – Logistic Regression is mostly used for classification and that too for binary classification. Suppose you want to classify if an advertisement will receive a hit or not. You will have a lot of independent variable, you will take a handful of relevant variables and build your Logistic Regression model which will further provide you a probability between 0 to 1 for each advertisement. In a normal case if the probability is between 0 to 0.5 then you change it to 0 else 1
-Logistic regression doesn’t require linear relationship between dependent and independent variables. It can handle various types of relationships because it applies a non-linear log transformation to the predicted odds ratio
–no multi collinearity
-It’s a multinomial logistic regression if the dependent variable is multiclass and not a binary class i.e. you are predicting for more the two variables
– It requires large sample sizes because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square
How to check the performance of a Logistic Regression model?
1. Confusion matrix, a model as accurate as 99% can be of no use to the system, why? Think about it dumbo 😛
2. Likelihood Test – A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors.
Now you are done with Linear and Logistic Regression, time to check some other regression techniques.
Ridge regression is used to analyze multiple regression lines that suffers from multicollinearity
Let’s Suppose that our regression equation is
Y = XB + e
where Y is the dependent variable, X is the independent variables, B is the regression coefficients to be estimated, and e is nothing but the errors/residuals.
In ridge regression, we first subtract the mean from the variable and then divide by their S.D. , this way the variables are standardized.
Ridge regression is a method that seeks to reduce the MSE by adding some bias and, at the same time, reducing the variance.
From an equation standpoint, you can think of this ordinary least squares as a method that seeks to find the coefficients that minimize the sum of the squares of the residuals. Ridge regression adds an additional term that needs to be minimized so when you are performing ridge regression you are minimizing the sum of the squares of the residuals as well as adding in a constraint on the sum of the squares of the regression coefficients. This second term, the sum of the squares of the regression coefficient is how the bias is introduced into the model.
Freedom of speech
Linear Regression ka equation hai Y = mX+c, lekin isme ek aur component aata hai i.e. error, so the equation becomes Y =mX+c+error, ye error v simple nai hota hai, isse prediction error khte hain aur ye error biasness aur variance ke kaaran hota hai. bias is nothing but the assumptions made during the point of time when the model is fitted to make an accurate prediction. Aur variance toh wahi hota hai jo predicted line se saare data point ka distance hota hai. Inhi dono part pe Ridge regression kaam krta hai. Utna hard hai nai, do baar pdhoge toh samjh aa jaeyga, aur nai aaya toh light le ke aage badho, sab thode na samjhna hota hai 😛
To solve bias and variance we have Least Square Sum and Lambda summation of Beta
4. Stepwise Regression
Stepwise Regression is fairly simple to understand. Basically there are two types of stepwise Regression:-
1. Forward Stepwise Regression – Here the model starts with zero predictor variable and gradually keep adding the variables while checking the performance of the model. It keeps on adding variables till it covers all of the variables
2. Backward Stepwise Regression – Start with all the variables and keeps on reducing the variables till zero variable
5. Lasso Regression
Lasso is useful because it prevents overfitting on training dataset. The best part is that it penalizes the L1 normalization of weights found by the line of regression. L1 norm is the sum of the magnitudes of the vectors in a co ordinate space. It is the most natural way of measure distance between vectors, that is the sum of absolute difference of the components of the vectors. In this norm, all the components of the vector are weighted equally.
One approach is to add λ∑Mi=1|wi|λ∑i=1M|wi| to the regression error.
The second approach is to minimize the regression error subject to the constraint ∑Mi=1|wi|≤η∑i=1M|wi|≤η
Both approaches can be shown to be equivalent using Lagrange multipliers.
Lasso is great for reducing the feature space, because when λ is sufficiently large, then many of the weights (wi) are driven to zero.
One disadvantage is that using Lasso doesn’t give closed-form analytic solutions and requires quadratic programming to solve.
Supporting Link – https://www.quora.com/What-is-the-LASSO-technique-for-regression
In the upcoming articles we will have detailed coding examples in R.
Try applying Logistic,Linear, and Stepwise Regression by your own 🙂
Keep Learning 🙂
The Data Monk