Have you ever wondered why two different people gets different accuracy while using the same algorithm?
We all know that XGBoost can help us get a very good result in our Hackathons, but then also only few people achieve a decent rank using the same algorithm, why?
Well !! The answer is feature engineering i.e. creating more features/data points from the fixed number of given data set.
Feature engineering is the art of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful
Let’s take some examples:-
We had this Titanic Dataset (most used data set in Data Science domain)
Problem Statement – Given the name, age, class, sex, cabin type, and number of family members traveling in Titanic. Can you predict which passenger survived and which did not?
It’s obviously a supervised learning questions and you already have a data set with the output.
All you need to do is to predict for a test data set
We are not going too deep into the solution. You can find the solution here.
What we want to discuss is the opportunities to create new columns.
I have seen people using the following types of columns in the data set. Before reading forward, remember it’s not about how good the new data point is? It’s about whether you can think out of the box.
Columns created by different solution submitter:-
1- Title of the passenger(Dr.,Mr.,Miss,etc.)
2-Creating blocks of ages rather than using actual age
3-With or without wife – Binary variable which suggests whether the person was with or without his wife
4-Number of children traveling
5-Number of alphabets in the name – Yes people did use the length of the name to try and test if this was useful. Not good enough, but brave enough 🙂
Why to create more variables when we already have a handful?
The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train that model. If you are able to create new features that help in providing more information to the model about the target variable, it’s performance will go up
Spend a considerable amount of time in pre-processing and feature engineering. You need to concentrate a lot on this since this can make a huge difference in the scores.
Better features means flexibility.
You can choose “the wrong models” (less than optimal) and still get good results. Most models can pick up on good structure in data. The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand and easier to maintain. This is very desirable.
Better features means simpler models.
With well engineered features, you can choose “the wrong parameters” (less than optimal) and still get good results, for much the same reasons. You do not need to work as hard to pick the right models and the most optimized parameters.
With good features, you are closer to the underlying problem and a representation of all the data you have available and could use to best characterize that underlying problem
How to do feature engineering?
The feature engineering process might look as follows:
- Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
- Devise features: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
- Select features: Use different feature importance scorings and feature selection methods to prepare one or more “views” for your models to operate upon.
- Evaluate models: Estimate model accuracy on unseen data using the chosen features.
You can start with any hackathon at Analytics Vidhya and try to create more and more columns, feed your algorithm with these variables and evaluate the model.
Keep Learning 🙂
The Data Monk