Before you start Modeling – Feature Engineering
Feature Engineering is one place where you have to put in a lot of efforts. In the beginning of any project, you will have very less data, but then you need to dig in and torture the data set to get more columns. Let’s take few example to see how feature engineering is done.
Let’s take the famous Titanic data set which have the following columns and data types
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Name 1309 non-null object
Sex 1309 non-null object
Age 1046 non-null float64
SibSp 1309 non-null int64
Parch 1309 non-null int64
Ticket 1309 non-null object
Fare 1308 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
This data set is already complete and once you start building a supervised learning model to predict who survived the accident, then you will get a decent accuracy. But as a Data Scientist, your job is to make a model as good as possible. Let’s see what all columns can we create
First of all you must know the concept of one hot encoding where you turn a categorical column with suppose 3 categories in 3 different columns with binary (0/1) input. We will deal with it below
1. The name of the passengers are given as Nitin, Mr. Kamal. So, from here we can definitely get the title of each passenger and create a new categorical column with only titles. There are 10+ titles like Mr, Miss, Doc, Mrs, etc. So we first take the frequency of each title and then merge the low frequency categories into one ‘other’ category. Congrats, you created your first column.
2. There were a lot of passengers who did not had cabin in their room and others had cabin number in the column ‘cabin’. You can create a new column with value 0 and 1 if the passenger has cabin or not. So you have another categorical variable. Yeahhh
3. You can also put categories to Age
4. You can put categories to Fare
5. You can add Parch and SibSp to get the family size of the passenger. Maybe a single person survived more than a passenger with larger family size
#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']
6. We have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.
#Turning cabin number into Deck cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown'] df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list)) |
There was one more project where we were supposed to predict the number of Burgers sold by McDonalds and we only had a monthly level data. We tried ARIMA model but there was a need to test Linear Regression and ARIMAX. So we needed few more columns. So these were the columns which we came up with
1. We created 4 binary columns for seasons i.e. Summer, Winter, Spring, and Monsoon.
2. We created a separate column for Month number
3. A column for year number which we converted into factor so that model does not interpret that 2018 is higher than 2005 because these are not numbers but categories
4. We created a flag of number of weekends. The hypothesis was that a month with more weekend will have more number of burgers sold
5. We also created a number of days column in the data set. So a month with 31 days will sell more Burgers than the one with 28 days
We ran an ensemble model using Linear Regression, ARIMA, and ARIMAX to get a good accuracy.
To practice more, pic up any data set and under the problem statement. Then you can create more columns to boost the performance of the model.
If you want to learn a complete modeling experience on real data set, then you can go through the book given below
Complete Linear Regression and ARIMA forecasting using R
You can also learn the complete project in a conversational way below
The Monk who knew Linear Regression
Keep coding 🙂
XtraMous
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
- YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel - Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website - E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page - Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page - Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate
The Data Monk e-books
We know that each domain requires a different type of preparation, so we have divided our books in the same way:
1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics
You can check out all the other e-books on our e-shop page – Do not miss it
For any information related to courses or e-books, please send an email to [email protected]