Gradient Boosting in Python

I hope you are already done with the basic concepts of regression and have covered our previous post on Adaptive Boosting.

Nai padhe, toh padh lo yaar..easy hai

Remember, If you want to participate in a Hackthon or using regression in your day to day work then learning Adaptive,Gradient, and Extreme Gradient will definitely help you.

Adaptive -> Gradient -> Extreme Gradient

You must have already installed Anaconda and should have practiced basic codes in Jupyter Notebook.

The complete idea of boosting is to make the weak learner contribute more to the prediction because these learners are the only one which holds back the accuracy of your model. If your model is like 90% accurate, then you can make it better by identifying the weak learners and boosting their contribution. This was the concept in Adaptive Boosting and the same is followed by Gradient Boosting

Gradient boosting involves three elements:

  1. A loss function to be optimized.
  2. A weak learner to make predictions.
  3. An additive model to add weak learners to minimize the loss function.

We will be using the Titanic Dataset to understand Gradient Boosting. To download the dataset visit this link

Let’s try to understand the algorithm while building the model.
We will start with getting all the required packages in our envionment

 import pandas as pd
 from xgboost import XGBClassifier
 from sklearn.preprocessing import MinMaxScaler
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import classification_report, confusion_matrix
 from sklearn.ensemble import GradientBoostingClassifier

P.S. – If you are having error in downloading XGBClassifier in Jupyter notebook, then try installing it using pip command given below. Actually, you will face problem in downloading xgboost package in Jupyter šŸ˜›

!pip install xgboost

After executing this command, again run the ‘from xgboost import XGBClassifier’

I assume you have already downloaded the train and test dataset of Titanic from Kaggle, if you are using Mac then move the dataset in the same folder where your codes are saved
Importing train and test dataset

 titanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")

Let’s see how the dataset looks like

titanic_train.head()

So, the titanic dataset is used to predict whether an individual survived or not in this tragic accident using various attributes like Passengerid,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,and Embarked
So, our y variable i.e. the dependent variable is Survived and the above given attributes are the independent variables. You may or may not use all the above variables and you can also create new variables using the existing dataset.
Ex. Extracting the title of the passenger like Mr./Mrs./Miss Or bucketing the age of the passenger
So we will put the dependent variable ‘Survived’ in a variable y_train and will delete the Survived field from the training dataset

 y_train = titanic_train["Survived"]
 titanic_train.drop(labels="Survived", axis=1, inplace=True)
 complete_data = titanic_train.append(test_data)

We will be building a basic model of Gradient boosting, so we are not going deep into feature engineering. If you want to understand feature engineering in brief using Titanic Dataset then take a look at the below mentioned page
http://thedatamonk.com/kaggle-titanic-solution/

To build a simple model We are dropping the categorical variables

drop = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
complete_data.drop(labels=drop, axis=1, inplace=True)

What is the get_dummies() command?
Get_dummies() is one of the mostly used commands in Feature engineering where you convert one column into multiple columns given the fact that the initial column contains categorical variable. It converts n categorical variable in the original column into n-1 columns

Is one hot encoding and get_dummies() same?
Onehot encodingĀ converts it into nĀ variables, whileĀ dummy encodingĀ converts it into n-1 variables. If we have kĀ categorical variables, each of which has n values.Ā One hot encodingĀ ends up with knĀ variables, whileĀ dummy encodingĀ ends up with kn-kĀ variables
Creating dummy encoding for the sex column

 complete_data = pd.get_dummies(complete_data, columns=["Sex"])
 complete_data.fillna(value=0.0, inplace=True)
 complete_data.head()

Creating the train and test dataset

X_train = complete_data.values[0:891]
X_test = complete_data.values[891:]

What is a MinMaxScaler?
For each value in a feature,Ā MinMaxScalerĀ subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.Ā MinMaxScalerĀ preserves the shape of the original distribution.

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

You create a minmaxscaler object as the name scaler and then tranform the train and test dataset

train_test_splitĀ splits arrays or matrices intoĀ randomĀ train and test subsets.
That means that everytime you run it without specifying random_state , you will get a different result, this is expected behavior.

state = 12  
test_size = 0.20
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=test_size, random_state=state)

So, a state in the random_state variable is almost equivalent to seed.
Tests Size is the split between test and train size, we are taking a split of 80:20

Now we will create a list of learning rate, I would be testing the model on 10 learning rates and will then take the one with maximum accuracy

learning_rate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]

Creating a loop to build 10 Gradient boosting Models

for lr in learning_rate:
GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123)
GBModel.fit(X_train, y_train)

The above command will take all the values of learning_rate in a loop and will fit the GBModel every time

Printing the accuracy in both training and validation dataset in the loop itself

for lr in learning_rate:
     GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123)
     GBModel.fit(X_train, y_train)
print("Learning : ", lr)
     print("Accuracy in training dataset : {0:.4f}".format(GBModel.score(X_train, y_train)))
     print("Accuracy in validation dataset : {0:.4f}".format(GBModel.score(X_val, y_val)))

You will get a result something like the one below

Learning :  0.1
Accuracy in training dataset : 0.8244
Accuracy in validation dataset : 0.7430
Learning :  0.2
Accuracy in training dataset : 0.8287
Accuracy in validation dataset : 0.7598
Learning :  0.3
Accuracy in training dataset : 0.8539
Accuracy in validation dataset : 0.7542
Learning :  0.4
Accuracy in training dataset : 0.8385
Accuracy in validation dataset : 0.7430
Learning :  0.5
Accuracy in training dataset : 0.8539
Accuracy in validation dataset : 0.7654
Learning :  0.6
Accuracy in training dataset : 0.8525
Accuracy in validation dataset : 0.7542
Learning :  0.7
Accuracy in training dataset : 0.8610
Accuracy in validation dataset : 0.7374
Learning :  0.8
Accuracy in training dataset : 0.8610
Accuracy in validation dataset : 0.7263
Learning :  0.9
Accuracy in training dataset : 0.8722
Accuracy in validation dataset : 0.7151
Learning :  1.0
Accuracy in training dataset : 0.8820
Accuracy in validation dataset : 0.7374

Which learning rate do you think you should take?
The one with the best Accuracy in training or validation dataset?
You should always select the one with decent accuracy in training dataset and maximum accuracy in validation dataset which is 0.5 in this case

Want to create a confusion matrix for the above learning rate?

GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, max_features=2, max_depth=2, random_state=6132) GBModel.fit(X_train, y_train) predictions = GBModel.predict(X_val)print("Confusion Matrix:") 

print(confusion_matrix(y_val, predictions))
print("Confusion Matrix Analysis")
print(classification_report(y_val, predictions))

The above code will get you the following confusion matrix

Confusion Matrix:
[[82 18]
 [28 51]]
Confusion Matrix Analysis
              precision    recall  f1-score   support

           0       0.75      0.82      0.78       100
           1       0.74      0.65      0.69        79

    accuracy                           0.74       179
   macro avg       0.74      0.73      0.74       179
weighted avg       0.74      0.74      0.74       179


If you are good with Ada Boosting and Gradient Boosting then you can directly hop on our next article on Extreme Gradient Boosting which is nothing but an extrapolation of the already learnt Regression techniques.

Keep Learning šŸ™‚

The Data Monk

Ada Boost Algorithm in Python

Gist of Adaptive boost Algorithm in layman’s term – If you want to improve the performance of a class then you should concentrate on improving the average marks of the class. In order to increase the average marks you need to focus on the weaker section of the class because the toppers will anyways perform. Toppers might dip down from 95% to 90%, but it won’t matter much if you can improve the percentage of the bottom 10 student from 35% to 80% which is relatively easier than training the toppers to improve from 90% to 95%.

Complete Ada Boosting algorithm is mostly about the example given above

When I started my career as a Decision Scientist, I had very limited knowledge of anything which was even remotely close to the domain. With time I was exposed to Regression models which allured me to try new algorithms on my data set.

Linear and Logistic Regression is still the best algorithm to start exploring the Data Science. But, sooner you will start feeling that the regression has a lot into it.

Before you start exploring the Queen Algorithm of all the Kaggle solution i.e. XGBoost, you should learn about Gradient Boosting, and before exploring GBM, you should understand Ada Boosting

Ada Boosting -> Gradient Boosting -> XGBoosting

Boosting in general is the method to empower the weak learner i.e. It is a method of converting a weak learner into strong learner.

Let’s take a simple example, you have a dataset in which you are predicting the sale of cake in a particular region. Now the strong learners are something like , festivals, birthdays, etc. i.e. whenever there is a festival or birthday then the sale of cakes increases.

A weak learner could be something like temperature or rainfall which might be a very weak learner but the challenge is to convert it into a strong learner to chisel the prediction.

Talking about Ada Boost, it starts with training a decision tree in which each observation is assigned an equal weight. Then comes the interesting part, you already know which is a strong learner, now you lower the weight of the strong learner and gives more weight to the weak learner.

Thus the second tree which is grown is on the previous weight.

Tree 1 = All the observation with equal weight
Tree 2 = More weight to the weak learner and less weight to the strong learner

The error is then calculated on the second tree. Predictions of the final ensemble model is therefore the weighted sum of the predictions made by the previous tree models.

The major difference between Ada and Gradient boosting is the way the weak learners are treated. While Ada treats it with increasing the weight of the weak learner, the Gradient Boosting uses loss function to evaluate and transform the weak learner

AĀ loss functionĀ measures how well a machine learning or statistics model fits empirical data of a certain phenomenon (e.g. speech or image recognition, predicting the price of real estate, describing user behavior on a web site).

The way a loss function is treated depends on the problem we want to solve, if we are predicting the number of tickets created in a particular month, then loss function will be the difference between the actual and predicted value.
If you want to predict if a person is suffering from a particular disease then confusion matrix could be your loss function

One of the biggest motivations of using gradient boosting is that it allows one to optimize a user specified cost function, instead of a loss function that usually offers less control and does not essentially correspond with real world applications. Below is the flow diagram of prediction process

Let’s build our first Ada Boost model

from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics

First import all the important packages in Python. The above code is ready to use, but I would appreciate writing the code
Now get the most famous iris dataset

iris_dataset = datasets.load_iris()
X = iris_dataset.data
y = iris_dataset.target

Taking the test size as 0.25

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) 

Creating our classifier

n_estimator = The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. The default value of n_estimator is 50

learning_rate = Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators

base_estimator = The base estimator from which the boosted ensemble is built.

classifier = AdaBoostClassifier(n_estimators=40, learning_rate=1)

Creating a model on this classifier

model_1 = classifier.fit(X_train, y_train)

Predicting the response for test dataset

y_pred = model_1.predict(X_test)

Checking the accuracy of the model

print("Model_1 accuracy",metrics.accuracy_score(y_test, y_pred))

Complete Code

 from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics
iris_dataset = datasets.load_iris()
X = iris_dataset.data
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
classifier = AdaBoostClassifier(n_estimators=50,
learning_rate=1)
model_1 = classifier.fit(X_train, y_train)
y_pred = model_1.predict(X_test)
print("Model_1 accuracy",metrics.accuracy_score(y_test, y_pred))
Model_1 accuracy 0.9622641509433962

By the way, Ada Boosting stands for Adaptive Boosting.

If you have reached till here, I guess that you are already good with the basic concepts and above all you can build a simple Adaptive Boosting model.
The way in which Ada Boost works:-
It takes the training subset, repeatedly trains the model by selecting the accuracy of the training dataset, means, if the model is correctly predicting the dependent variable then it learns from the model.
The main part is that it assigns higher weight to the weaker learner. This process iterate until the complete training data fits without any error or until reached to the specified maximum number of estimators.

Keep Learning šŸ™‚

The Data Monk

Complete path to master SQL before interview

We have interviewed a lot of candidates and found out that SQL is still something which is very less explored by people who want to get deep into this domain.

Remember – Data Science is not about all about SQL, but it’s a bread and butter for most of the jobs irrespective of your profile.

This is a small post which will cater around the ways to master your SQL skills.

I am assuming that you are a complete noob in SQL, skip according to your expertise

1. Start with either w3school or tutorials point.
It should not take more than 8-10 hours for you to complete the tutorial(irrespective of your Engineering branch/ current domain)

2. Go for SQLZoo. Solve all the questions.
If you get stuck, then try this link which have all the solved questions. Should take you somewhere 15 hours

3. Once this is done, create an account on HackerRank and try there SQL course.
Try all the easy questions first and then slowly move to the medium level questions. It should not take you more than 20 hours, earn a 4 star at least before moving ahead and do follow the discussion panel

If you are good with the above 3, then do try our four pages(This is not a self promotion, but we have hand picked some of the important questions which you should definitely solve before your interview)

http://thedatamonk.com/day-3-basic-queries-to-get-you-started/
http://thedatamonk.com/day-4-sql-intermediate-questions/
http://thedatamonk.com/day-5-sql-advance-concepts/
http://thedatamonk.com/day-6-less-asked-sql-questions/

You are already interview ready, send me a mail at nitinkamal132@gmail.com or contact@thedatamonk.com to get a free copy of our ebook or purchase it on Amazon

https://www.amazon.in/Questions-Crack-Business-Analyst-Interview-ebook/dp/B01K16FLC4
https://www.amazon.in/Write-better-queries-interview-Questions-ebook/dp/B076BXFGW1

You are not done yet, complete that HackerRank Hard Questions as well.

This will suffice the knowledge which you need to crack any Data Science SQL interview round.

For Python,R, and statistics we will have separate post.

Keep Learning šŸ™‚
The Data Monk