Linear, LASSO, Elastic Net, and Ridge Regression are the four regression techniques which are helpful to predict or extrapolate the prediction using the historic data.
Linear doesn’t have any inclination towards the value of lambda.
LASSO takes lambda as 1 and Ridge takes it as 0, Elastic Net is the middle way and the value of lambda varies between 0 to 1.
In this article We will try to help you understand how to build different models from scratch with ready to use code. You don’t even have to download any dataset as the data is already available in R.
The data is called Boston Housing Data and the aim is to predict the price of House in Boston using the following parameters
CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
Access the data i.e. store the data in your local and then explore the basic of the dataset. I always try 5-6 commands to get a gist of the dataset
?DataSet – To the know the column definitions (only in open source dataset)
head(dataset) – To see the first 5 rows of all the columns
str(dataset) – To get data type and first few values
summary(dataset) – To get the mean median percentile max min of each columns, basically you understand the range of numerical data
Before Loading Boston Housing Data, I personally import a few libraries which might or might not help in the analysis..I am Lazy as fuck !!
install.packages("mlbench") install.packages("psych") library(caret) library(dplyr) library(xgboost) library(Matrix) library(glmnet) library(psych) library(mlbench)
Understand the basics of the dataset, but first import the data set
data("BostonHousing")
BD <- BostonHousing
Now BD have the complete data set, you can explore the dataset’s column definition by the following code
?BostonHousing
Let's look at the head of the data set
head(BD)
While exploring multiple things, I came across one of the packages in R which has an awesome correlation function pairs.panels(dataset[])
Correlation requires only numeric variables
pairs.panels(BD[c(-4,-14])
The above code will get you all the correlation and scatter plot which will help you understand the distribution as well as correlation between variables. The matrix looks something like the one below
If you are not comfortable with the above plot and are more into conventional form of looking at correlation then try the cor() function
cor(BD[c(-4,-14)])
Eliminate collinearity, but why?
Okay, say you want to predict the salary of employees and there is a high correlation between the age and number of working years in the dataset. In this case having both the variable in the model does not make sense as both symbolises the same thing.
High Correlation leads to multicollinearity and thus overfitting
Now, let’s start with Linear Regression Model. The complete code is provided at the end of the tutorial
sam = sample
train and test command creates a division of 70:30 for train and test
Always create a Cross Validation parameter, Here I am creating one with 10 parts and 5 repeats.
#We have 387 observations in train and 119 observations in Test #Create Cross Validation parameter, in CV training data is split into n #number of parts and each one is trained, after this model is created using #n-1 number of parts and then error is estimated from 1 part, this is #repeated x times. You can use verboseIter to monitor the progress while #the code is running. verboseIter is optional cv <- trainControl(method="repeatedcv", number=10, repeats = 5, verboseIter = T )
In short, you are creating a parameter to divide a dataset into 10 parts and keep 9 to train and 1 to test it and you are doing it 5 times to eliminate the chances of random bias.
verboseIter = T gives a good experience when you see your code doing some fancy stuff. Take a slow-mo and put it on Instagram 😛
set.seed(34)
linear <- train(medv ~.,
BD,
method='lm',
trControl = cv)linear$results
linear
summary(linear)
We will do all the EDAs in some other tutorial. In this article we are only focusing on covering the explanation and code of each Regression types
This was the basic Linear Regression, we will evaluate all the models at the end of the series. First let’s create all the models
Next is Ridge Regression
set.seed(123)
ridge <- train(medv~.,
BD,
method = 'glmnet',
tuneGrid = expand.grid(alpha=0,
lambda = seq(0.0001,1,length=10)),
trControl=cv)
We will cover only Linear and Ridge Regression here.
In the next article we will cover LASSO and Elastic Net.
The third article will have the complete evaluation, picking up the best model, and predicting the test cases