Gradient Boosting in Python
I hope you are already done with the basic concepts of regression and have covered our previous post on Adaptive Boosting.
Nai padhe, toh padh lo yaar..easy hai
Remember, If you want to participate in a Hackthon or using regression in your day to day work then learning Adaptive,Gradient, and Extreme Gradient will definitely help you.
Adaptive -> Gradient -> Extreme Gradient
You must have already installed Anaconda and should have practiced basic codes in Jupyter Notebook.
The complete idea of boosting is to make the weak learner contribute more to the prediction because these learners are the only one which holds back the accuracy of your model. If your model is like 90% accurate, then you can make it better by identifying the weak learners and boosting their contribution. This was the concept in Adaptive Boosting and the same is followed by Gradient Boosting
Gradient boosting involves three elements:
- A loss function to be optimized.
- A weak learner to make predictions.
- An additive model to add weak learners to minimize the loss function.
We will be using the Titanic Dataset to understand Gradient Boosting. To download the dataset visit this link
Let’s try to understand the algorithm while building the model.
We will start with getting all the required packages in our envionment
import pandas as pd from xgboost import XGBClassifier from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix from sklearn.ensemble import GradientBoostingClassifier
P.S. – If you are having error in downloading XGBClassifier in Jupyter notebook, then try installing it using pip command given below. Actually, you will face problem in downloading xgboost package in Jupyter 😛
!pip install xgboost
After executing this command, again run the ‘from xgboost import XGBClassifier’
I assume you have already downloaded the train and test dataset of Titanic from Kaggle, if you are using Mac then move the dataset in the same folder where your codes are saved
Importing train and test dataset
titanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
Let’s see how the dataset looks like
titanic_train.head()
So, the titanic dataset is used to predict whether an individual survived or not in this tragic accident using various attributes like Passengerid,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,and Embarked
So, our y variable i.e. the dependent variable is Survived and the above given attributes are the independent variables. You may or may not use all the above variables and you can also create new variables using the existing dataset.
Ex. Extracting the title of the passenger like Mr./Mrs./Miss Or bucketing the age of the passenger
So we will put the dependent variable ‘Survived’ in a variable y_train and will delete the Survived field from the training dataset
y_train = titanic_train["Survived"] titanic_train.drop(labels="Survived", axis=1, inplace=True) complete_data = titanic_train.append(test_data)
We will be building a basic model of Gradient boosting, so we are not going deep into feature engineering. If you want to understand feature engineering in brief using Titanic Dataset then take a look at the below mentioned page
https://thedatamonk.com/kaggle-titanic-solution/
To build a simple model We are dropping the categorical variables
drop = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"] complete_data.drop(labels=drop, axis=1, inplace=True)
What is the get_dummies() command?
Get_dummies() is one of the mostly used commands in Feature engineering where you convert one column into multiple columns given the fact that the initial column contains categorical variable. It converts n categorical variable in the original column into n-1 columns
Is one hot encoding and get_dummies() same?
One–hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables. If we have k categorical variables, each of which has n values. One hot encoding ends up with kn variables, while dummy encoding ends up with kn-k variables
Creating dummy encoding for the sex column
complete_data = pd.get_dummies(complete_data, columns=["Sex"]) complete_data.fillna(value=0.0, inplace=True) complete_data.head()
Creating the train and test dataset
X_train = complete_data.values[0:891] X_test = complete_data.values[891:]
What is a MinMaxScaler?
For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum. MinMaxScaler preserves the shape of the original distribution.
scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
You create a minmaxscaler object as the name scaler and then tranform the train and test dataset
train_test_split splits arrays or matrices into random train and test subsets.
That means that everytime you run it without specifying random_state , you will get a different result, this is expected behavior.
state = 12
test_size = 0.20
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=test_size, random_state=state)
So, a state in the random_state variable is almost equivalent to seed.
Tests Size is the split between test and train size, we are taking a split of 80:20
Now we will create a list of learning rate, I would be testing the model on 10 learning rates and will then take the one with maximum accuracy
learning_rate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
Creating a loop to build 10 Gradient boosting Models
for lr in learning_rate:
GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123)
GBModel.fit(X_train, y_train)
The above command will take all the values of learning_rate in a loop and will fit the GBModel every time
Printing the accuracy in both training and validation dataset in the loop itself
for lr in learning_rate: GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=lr, max_features=2, max_depth=2, random_state=123) GBModel.fit(X_train, y_train) print("Learning : ", lr) print("Accuracy in training dataset : {0:.4f}".format(GBModel.score(X_train, y_train))) print("Accuracy in validation dataset : {0:.4f}".format(GBModel.score(X_val, y_val)))
You will get a result something like the one below
Learning : 0.1 Accuracy in training dataset : 0.8244 Accuracy in validation dataset : 0.7430 Learning : 0.2 Accuracy in training dataset : 0.8287 Accuracy in validation dataset : 0.7598 Learning : 0.3 Accuracy in training dataset : 0.8539 Accuracy in validation dataset : 0.7542 Learning : 0.4 Accuracy in training dataset : 0.8385 Accuracy in validation dataset : 0.7430 Learning : 0.5 Accuracy in training dataset : 0.8539 Accuracy in validation dataset : 0.7654 Learning : 0.6 Accuracy in training dataset : 0.8525 Accuracy in validation dataset : 0.7542 Learning : 0.7 Accuracy in training dataset : 0.8610 Accuracy in validation dataset : 0.7374 Learning : 0.8 Accuracy in training dataset : 0.8610 Accuracy in validation dataset : 0.7263 Learning : 0.9 Accuracy in training dataset : 0.8722 Accuracy in validation dataset : 0.7151 Learning : 1.0 Accuracy in training dataset : 0.8820 Accuracy in validation dataset : 0.7374
Which learning rate do you think you should take?
The one with the best Accuracy in training or validation dataset?
You should always select the one with decent accuracy in training dataset and maximum accuracy in validation dataset which is 0.5 in this case
Want to create a confusion matrix for the above learning rate?
GBModel = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, max_features=2, max_depth=2, random_state=6132) GBModel.fit(X_train, y_train) predictions = GBModel.predict(X_val)print("Confusion Matrix:") print(confusion_matrix(y_val, predictions)) print("Confusion Matrix Analysis") print(classification_report(y_val, predictions))
The above code will get you the following confusion matrix
Confusion Matrix: [[82 18] [28 51]] Confusion Matrix Analysis precision recall f1-score support 0 0.75 0.82 0.78 100 1 0.74 0.65 0.69 79 accuracy 0.74 179 macro avg 0.74 0.73 0.74 179 weighted avg 0.74 0.74 0.74 179
If you are good with Ada Boosting and Gradient Boosting then you can directly hop on our next article on Extreme Gradient Boosting which is nothing but an extrapolation of the already learnt Regression techniques.
Keep Learning 🙂
The Data Monk
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
- YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel - Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website - E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page - Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page - Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate
The Data Monk e-books
We know that each domain requires a different type of preparation, so we have divided our books in the same way:
1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics
You can check out all the other e-books on our e-shop page – Do not miss it
For any information related to courses or e-books, please send an email to [email protected]