Before you start Modeling – Feature Engineering

Feature Engineering is one place where you have to put in a lot of efforts. In the beginning of any project, you will have very less data, but then you need to dig in and torture the data set to get more columns. Let’s take few example to see how feature engineering is done.

Let’s take the famous Titanic data set which have the following columns and data types

PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object

This data set is already complete and once you start building a supervised learning model to predict who survived the accident, then you will get a decent accuracy. But as a Data Scientist, your job is to make a model as good as possible. Let’s see what all columns can we create

First of all you must know the concept of one hot encoding where you turn a categorical column with suppose 3 categories in 3 different columns with binary (0/1) input. We will deal with it below

1. The name of the passengers are given as Nitin, Mr. Kamal. So, from here we can definitely get the title of each passenger and create a new categorical column with only titles. There are 10+ titles like Mr, Miss, Doc, Mrs, etc. So we first take the frequency of each title and then merge the low frequency categories into one ‘other’ category. Congrats, you created your first column.

2. There were a lot of passengers who did not had cabin in their room and others had cabin number in the column ‘cabin’. You can create a new column with value 0 and 1 if the passenger has cabin or not. So you have another categorical variable. Yeahhh

3. You can also put categories to Age

4. You can put categories to Fare

5. You can add Parch and SibSp to get the family size of the passenger. Maybe a single person survived more than a passenger with larger family size

#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']

6. We have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.

#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

There was one more project where we were supposed to predict the number of Burgers sold by McDonalds and we only had a monthly level data. We tried ARIMA model but there was a need to test Linear Regression and ARIMAX. So we needed few more columns. So these were the columns which we came up with

1. We created 4 binary columns for seasons i.e. Summer, Winter, Spring, and Monsoon.

2. We created a separate column for Month number

3. A column for year number which we converted into factor so that model does not interpret that 2018 is higher than 2005 because these are not numbers but categories

4. We created a flag of number of weekends. The hypothesis was that a month with more weekend will have more number of burgers sold

5. We also created a number of days column in the data set. So a month with 31 days will sell more Burgers than the one with 28 days

We ran an ensemble model using Linear Regression, ARIMA, and ARIMAX to get a good accuracy.

To practice more, pic up any data set and under the problem statement. Then you can create more columns to boost the performance of the model.

If you want to learn a complete modeling experience on real data set, then you can go through the book given below

Complete Linear Regression and ARIMA forecasting using R

You can also learn the complete project in a conversational way below

The Monk who knew Linear Regression

Keep coding 🙂

XtraMous

Leave a Reply

Your email address will not be published. Required fields are marked *