Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Before you start Modeling – Feature Engineering

Feature Engineering is one place where you have to put in a lot of efforts. In the beginning of any project, you will have very less data, but then you need to dig in and torture the data set to get more columns. Let’s take few example to see how feature engineering is done.

Let’s take the famous Titanic data set which have the following columns and data types

PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object

This data set is already complete and once you start building a supervised learning model to predict who survived the accident, then you will get a decent accuracy. But as a Data Scientist, your job is to make a model as good as possible. Let’s see what all columns can we create

First of all you must know the concept of one hot encoding where you turn a categorical column with suppose 3 categories in 3 different columns with binary (0/1) input. We will deal with it below

1. The name of the passengers are given as Nitin, Mr. Kamal. So, from here we can definitely get the title of each passenger and create a new categorical column with only titles. There are 10+ titles like Mr, Miss, Doc, Mrs, etc. So we first take the frequency of each title and then merge the low frequency categories into one ‘other’ category. Congrats, you created your first column.

2. There were a lot of passengers who did not had cabin in their room and others had cabin number in the column ‘cabin’. You can create a new column with value 0 and 1 if the passenger has cabin or not. So you have another categorical variable. Yeahhh

3. You can also put categories to Age

4. You can put categories to Fare

5. You can add Parch and SibSp to get the family size of the passenger. Maybe a single person survived more than a passenger with larger family size

#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']

6. We have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.

#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

There was one more project where we were supposed to predict the number of Burgers sold by McDonalds and we only had a monthly level data. We tried ARIMA model but there was a need to test Linear Regression and ARIMAX. So we needed few more columns. So these were the columns which we came up with

1. We created 4 binary columns for seasons i.e. Summer, Winter, Spring, and Monsoon.

2. We created a separate column for Month number

3. A column for year number which we converted into factor so that model does not interpret that 2018 is higher than 2005 because these are not numbers but categories

4. We created a flag of number of weekends. The hypothesis was that a month with more weekend will have more number of burgers sold

5. We also created a number of days column in the data set. So a month with 31 days will sell more Burgers than the one with 28 days

We ran an ensemble model using Linear Regression, ARIMA, and ARIMAX to get a good accuracy.

To practice more, pic up any data set and under the problem statement. Then you can create more columns to boost the performance of the model.

If you want to learn a complete modeling experience on real data set, then you can go through the book given below

Complete Linear Regression and ARIMA forecasting using R

You can also learn the complete project in a conversational way below

The Monk who knew Linear Regression

Keep coding 🙂

XtraMous

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

  1. YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
    Link – The Data Monk Youtube Channel
  2. Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
    Link – The Data Monk website
  3. E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
    Link – The Data E-shop Page
  4. Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
    Link – The Data Monk Instagram page
  5. Mock Interviews/Career Guidance/Mentorship/Resume Making
    Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it


For any information related to courses or e-books, please send an email to [email protected]

About TheDataMonkGrand Master

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)

Follow Me