Feature Engineering is one place where you have to put in a lot of efforts. In the beginning of any project, you will have very less data, but then you need to dig in and torture the data set to get more columns. Let’s take few example to see how feature engineering is done.
Let’s take the famous Titanic data set which have the following columns and data types
PassengerId 1309 non-null int64 Pclass 1309 non-null int64 Name 1309 non-null object Sex 1309 non-null object Age 1046 non-null float64 SibSp 1309 non-null int64 Parch 1309 non-null int64 Ticket 1309 non-null object Fare 1308 non-null float64 Cabin 295 non-null object Embarked 1307 non-null object
This data set is already complete and once you start building a supervised learning model to predict who survived the accident, then you will get a decent accuracy. But as a Data Scientist, your job is to make a model as good as possible. Let’s see what all columns can we create
First of all you must know the concept of one hot encoding where you turn a categorical column with suppose 3 categories in 3 different columns with binary (0/1) input. We will deal with it below
1. The name of the passengers are given as Nitin, Mr. Kamal. So, from here we can definitely get the title of each passenger and create a new categorical column with only titles. There are 10+ titles like Mr, Miss, Doc, Mrs, etc. So we first take the frequency of each title and then merge the low frequency categories into one ‘other’ category. Congrats, you created your first column.
2. There were a lot of passengers who did not had cabin in their room and others had cabin number in the column ‘cabin’. You can create a new column with value 0 and 1 if the passenger has cabin or not. So you have another categorical variable. Yeahhh
3. You can also put categories to Age
4. You can put categories to Fare
5. You can add Parch and SibSp to get the family size of the passenger. Maybe a single person survived more than a passenger with larger family size
#Creating new family_size column
6. We have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.
There was one more project where we were supposed to predict the number of Burgers sold by McDonalds and we only had a monthly level data. We tried ARIMA model but there was a need to test Linear Regression and ARIMAX. So we needed few more columns. So these were the columns which we came up with
1. We created 4 binary columns for seasons i.e. Summer, Winter, Spring, and Monsoon.
2. We created a separate column for Month number
3. A column for year number which we converted into factor so that model does not interpret that 2018 is higher than 2005 because these are not numbers but categories
4. We created a flag of number of weekends. The hypothesis was that a month with more weekend will have more number of burgers sold
5. We also created a number of days column in the data set. So a month with 31 days will sell more Burgers than the one with 28 days
We ran an ensemble model using Linear Regression, ARIMA, and ARIMAX to get a good accuracy.
To practice more, pic up any data set and under the problem statement. Then you can create more columns to boost the performance of the model.
If you want to learn a complete modeling experience on real data set, then you can go through the book given below
Complete Linear Regression and ARIMA forecasting using R
You can also learn the complete project in a conversational way below
The Monk who knew Linear Regression
Keep coding 🙂