Gradient Boosting in Python

I hope you are already done with the basic concepts of regression and have covered our previous post on Adaptive Boosting.

Nai padhe, toh padh lo yaar..easy hai

Remember, If you want to participate in a Hackthon or using regression in your day to day work then learning Adaptive,Gradient, and Extreme Gradient will definitely help you.

Adaptive -> Gradient -> Extreme Gradient

You must have already installed Anaconda and should have practiced basic codes in Jupyter Notebook.

The complete idea of boosting is to make the weak learner contribute more to the prediction because these learners are the only one which holds back the accuracy of your model. If your model is like 90% accurate, then you can make it better by identifying the weak learners and boosting their contribution. This was the concept in Adaptive Boosting and the same is followed by Gradient Boosting

Gradient boosting involves three elements:

  1. A loss function to be optimized.
  2. A weak learner to make predictions.
  3. An additive model to add weak learners to minimize the loss function.

We will be using the Titanic Dataset to understand Gradient Boosting. To download the dataset visit this link

Let’s try to understand the algorithm while building the model.
We will start with getting all the required packages in our envionment

 import pandas as pd
 from xgboost import XGBClassifier
 from sklearn.preprocessing import MinMaxScaler
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import classification_report, confusion_matrix
 from sklearn.ensemble import GradientBoostingClassifier

P.S. – If you are having error in downloading XGBClassifier in Jupyter notebook, then try installing it using pip command given below. Actually, you will face problem in downloading xgboost package in Jupyter 😛

!pip install xgboost

After executing this command, again run the ‘from xgboost import XGBClassifier’

I assume you have already downloaded the train and test dataset of Titanic from Kaggle, if you are using Mac then move the dataset in the same folder where your codes are saved
Importing train and test dataset

 titanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")

Let’s see how the dataset looks like

titanic_train.head()

So, the titanic dataset is used to predict whether an individual survived or not in this tragic accident using various attributes like Passengerid,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,and Embarked
So, our y variable i.e. the dependent variable is Survived and the above given attributes are the independent variables. You may or may not use all the above variables and you can also create new variables using the existing dataset.
Ex. Extracting the title of the passenger like Mr./Mrs./Miss Or bucketing the age of the passenger
So we will put the dependent variable ‘Survived’ in a variable y_train and will delete the Survived field from the training dataset

 y_train = titanic_train["Survived"]
 titanic_train.drop(labels="Survived", axis=1, inplace=True)
 complete_data = titanic_train.append(test_data)

We will be building a basic model of Gradient boosting, so we are not going deep into feature engineering. If you want to understand feature engineering in brief using Titanic Dataset then take a look at the below mentioned page
https://thedatamonk.com/kaggle-titanic-solution/

To build a simple model We are dropping the categorical variables

drop = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
complete_data.drop(labels=drop, axis=1, inplace=True)

What is the get_dummies() command?
Get_dummies() is one of the mostly used commands in Feature engineering where you convert one column into multiple columns given the fact that the initial column contains categorical variable. It converts n categorical variable in the original column into n-1 columns

Is one hot encoding and get_dummies() same?
Onehot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each of which has n values. One hot encoding ends up with kn variables, while dummy encoding ends up with kn-k variables
Creating dummy encoding for the sex column

 complete_data = pd.get_dummies(complete_data, columns=["Sex"])
 complete_data.fillna(value=0.0, inplace=True)
 complete_data.head()

Creating the train and test dataset

X_train = complete_data.values[0:891]
X_test = complete_data.values[891:]

What is a MinMaxScaler?
For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum. MinMaxScaler preserves the shape of the original distribution.

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

You create a minmaxscaler object as the name scaler and then tranform the train and test dataset

train_test_split splits arrays or matrices into random train and test subsets.
That means that everytime you run it without specifying random_state , you will get a different result, this is expected behavior.

state = 12  
test_size = 0.20
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=test_size, random_state=state)

So, a state in the random_state variable is almost equivalent to seed.
Tests Size is the split between test and train size, we are taking a split of 80:20

Now we will create a list of learning rate, I would be testing the model on 10 learning rates and will then take the one with maximum accuracy

learning_rate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]

Creating a loop to build 10 Gradient boosting Models

for lr in learning_rate:
GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123)
GBModel.fit(X_train, y_train)

The above command will take all the values of learning_rate in a loop and will fit the GBModel every time

Printing the accuracy in both training and validation dataset in the loop itself

for lr in learning_rate:
     GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123)
     GBModel.fit(X_train, y_train)
print("Learning : ", lr)
     print("Accuracy in training dataset : {0:.4f}".format(GBModel.score(X_train, y_train)))
     print("Accuracy in validation dataset : {0:.4f}".format(GBModel.score(X_val, y_val)))

You will get a result something like the one below

Learning :  0.1
Accuracy in training dataset : 0.8244
Accuracy in validation dataset : 0.7430
Learning :  0.2
Accuracy in training dataset : 0.8287
Accuracy in validation dataset : 0.7598
Learning :  0.3
Accuracy in training dataset : 0.8539
Accuracy in validation dataset : 0.7542
Learning :  0.4
Accuracy in training dataset : 0.8385
Accuracy in validation dataset : 0.7430
Learning :  0.5
Accuracy in training dataset : 0.8539
Accuracy in validation dataset : 0.7654
Learning :  0.6
Accuracy in training dataset : 0.8525
Accuracy in validation dataset : 0.7542
Learning :  0.7
Accuracy in training dataset : 0.8610
Accuracy in validation dataset : 0.7374
Learning :  0.8
Accuracy in training dataset : 0.8610
Accuracy in validation dataset : 0.7263
Learning :  0.9
Accuracy in training dataset : 0.8722
Accuracy in validation dataset : 0.7151
Learning :  1.0
Accuracy in training dataset : 0.8820
Accuracy in validation dataset : 0.7374

Which learning rate do you think you should take?
The one with the best Accuracy in training or validation dataset?
You should always select the one with decent accuracy in training dataset and maximum accuracy in validation dataset which is 0.5 in this case

Want to create a confusion matrix for the above learning rate?

GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, max_features=2, max_depth=2, random_state=6132) GBModel.fit(X_train, y_train) predictions = GBModel.predict(X_val)print("Confusion Matrix:") 

print(confusion_matrix(y_val, predictions))
print("Confusion Matrix Analysis")
print(classification_report(y_val, predictions))

The above code will get you the following confusion matrix

Confusion Matrix:
[[82 18]
 [28 51]]
Confusion Matrix Analysis
              precision    recall  f1-score   support

           0       0.75      0.82      0.78       100
           1       0.74      0.65      0.69        79

    accuracy                           0.74       179
   macro avg       0.74      0.73      0.74       179
weighted avg       0.74      0.74      0.74       179


If you are good with Ada Boosting and Gradient Boosting then you can directly hop on our next article on Extreme Gradient Boosting which is nothing but an extrapolation of the already learnt Regression techniques.

Keep Learning 🙂

The Data Monk