Ada Boost Algorithm in Python

Gist of Adaptive boost Algorithm in layman’s term – If you want to improve the performance of a class then you should concentrate on improving the average marks of the class. In order to increase the average marks you need to focus on the weaker section of the class because the toppers will anyways perform. Toppers might dip down from 95% to 90%, but it won’t matter much if you can improve the percentage of the bottom 10 student from 35% to 80% which is relatively easier than training the toppers to improve from 90% to 95%.

Complete Ada Boosting algorithm is mostly about the example given above

When I started my career as a Decision Scientist, I had very limited knowledge of anything which was even remotely close to the domain. With time I was exposed to Regression models which allured me to try new algorithms on my data set.

Linear and Logistic Regression is still the best algorithm to start exploring the Data Science. But, sooner you will start feeling that the regression has a lot into it.

Before you start exploring the Queen Algorithm of all the Kaggle solution i.e. XGBoost, you should learn about Gradient Boosting, and before exploring GBM, you should understand Ada Boosting

Ada Boosting -> Gradient Boosting -> XGBoosting

Boosting in general is the method to empower the weak learner i.e. It is a method of converting a weak learner into strong learner.

Let’s take a simple example, you have a dataset in which you are predicting the sale of cake in a particular region. Now the strong learners are something like , festivals, birthdays, etc. i.e. whenever there is a festival or birthday then the sale of cakes increases.

A weak learner could be something like temperature or rainfall which might be a very weak learner but the challenge is to convert it into a strong learner to chisel the prediction.

Talking about Ada Boost, it starts with training a decision tree in which each observation is assigned an equal weight. Then comes the interesting part, you already know which is a strong learner, now you lower the weight of the strong learner and gives more weight to the weak learner.

Thus the second tree which is grown is on the previous weight.

Tree 1 = All the observation with equal weight
Tree 2 = More weight to the weak learner and less weight to the strong learner

The error is then calculated on the second tree. Predictions of the final ensemble model is therefore the weighted sum of the predictions made by the previous tree models.

The major difference between Ada and Gradient boosting is the way the weak learners are treated. While Ada treats it with increasing the weight of the weak learner, the Gradient Boosting uses loss function to evaluate and transform the weak learner

loss function measures how well a machine learning or statistics model fits empirical data of a certain phenomenon (e.g. speech or image recognition, predicting the price of real estate, describing user behavior on a web site).

The way a loss function is treated depends on the problem we want to solve, if we are predicting the number of tickets created in a particular month, then loss function will be the difference between the actual and predicted value.
If you want to predict if a person is suffering from a particular disease then confusion matrix could be your loss function

One of the biggest motivations of using gradient boosting is that it allows one to optimize a user specified cost function, instead of a loss function that usually offers less control and does not essentially correspond with real world applications. Below is the flow diagram of prediction process

Let’s build our first Ada Boost model

from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics

First import all the important packages in Python. The above code is ready to use, but I would appreciate writing the code
Now get the most famous iris dataset

iris_dataset = datasets.load_iris()
X = iris_dataset.data
y = iris_dataset.target

Taking the test size as 0.25

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) 

Creating our classifier

n_estimator = The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. The default value of n_estimator is 50

learning_rate = Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators

base_estimator = The base estimator from which the boosted ensemble is built.

classifier = AdaBoostClassifier(n_estimators=40, learning_rate=1)

Creating a model on this classifier

model_1 = classifier.fit(X_train, y_train)

Predicting the response for test dataset

y_pred = model_1.predict(X_test)

Checking the accuracy of the model

print("Model_1 accuracy",metrics.accuracy_score(y_test, y_pred))

Complete Code

 from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics
iris_dataset = datasets.load_iris()
X = iris_dataset.data
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
classifier = AdaBoostClassifier(n_estimators=50,
learning_rate=1)
model_1 = classifier.fit(X_train, y_train)
y_pred = model_1.predict(X_test)
print("Model_1 accuracy",metrics.accuracy_score(y_test, y_pred))
Model_1 accuracy 0.9622641509433962

By the way, Ada Boosting stands for Adaptive Boosting.

If you have reached till here, I guess that you are already good with the basic concepts and above all you can build a simple Adaptive Boosting model.
The way in which Ada Boost works:-
It takes the training subset, repeatedly trains the model by selecting the accuracy of the training dataset, means, if the model is correctly predicting the dependent variable then it learns from the model.
The main part is that it assigns higher weight to the weaker learner. This process iterate until the complete training data fits without any error or until reached to the specified maximum number of estimators.

Keep Learning 🙂

The Data Monk

Author: TheDataMonk

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)