Feature Engineering in Data Science
Have you ever wondered why two different people gets different accuracy while using the same algorithm?
We all know that XGBoost can help us get a very good result in our Hackathons, but then also only few people achieve a decent rank using the same algorithm, why?
Well !! The answer is feature engineering i.e. creating more features/data points from the fixed number of given data set.
Feature engineering is the art of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful
Let’s take some examples:-
We had this Titanic Dataset (most used data set in Data Science domain)
Problem Statement – Given the name, age, class, sex, cabin type, and number of family members traveling in Titanic. Can you predict which passenger survived and which did not?
It’s obviously a supervised learning questions and you already have a data set with the output.
All you need to do is to predict for a test data set
We are not going too deep into the solution. You can find the solution here.
What we want to discuss is the opportunities to create new columns.
I have seen people using the following types of columns in the data set. Before reading forward, remember it’s not about how good the new data point is? It’s about whether you can think out of the box.
Columns created by different solution submitter:-
1- Title of the passenger(Dr.,Mr.,Miss,etc.)
2-Creating blocks of ages rather than using actual age
3-With or without wife – Binary variable which suggests whether the person was with or without his wife
4-Number of children traveling
5-Number of alphabets in the name – Yes people did use the length of the name to try and test if this was useful. Not good enough, but brave enough 🙂
Why to create more variables when we already have a handful?
The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train that model. If you are able to create new features that help in providing more information to the model about the target variable, it’s performance will go up
Spend a considerable amount of time in pre-processing and feature engineering. You need to concentrate a lot on this since this can make a huge difference in the scores.
Better features means flexibility.
You can choose “the wrong models” (less than optimal) and still get good results. Most models can pick up on good structure in data. The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand and easier to maintain. This is very desirable.
Better features means simpler models.
With well engineered features, you can choose “the wrong parameters” (less than optimal) and still get good results, for much the same reasons. You do not need to work as hard to pick the right models and the most optimized parameters.
With good features, you are closer to the underlying problem and a representation of all the data you have available and could use to best characterize that underlying problem
How to do feature engineering?
The feature engineering process might look as follows:
- Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
- Devise features: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
- Select features: Use different feature importance scorings and feature selection methods to prepare one or more “views” for your models to operate upon.
- Evaluate models: Estimate model accuracy on unseen data using the chosen features.
You can start with any hackathon at Analytics Vidhya and try to create more and more columns, feed your algorithm with these variables and evaluate the model.
Keep Learning 🙂
The Data Monk
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
- YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel - Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website - E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page - Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page - Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate
The Data Monk e-books
We know that each domain requires a different type of preparation, so we have divided our books in the same way:
1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer ->Â 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics
You can check out all the other e-books on our e-shop page – Do not miss it
For any information related to courses or e-books, please send an email to [email protected]