Kaggle Titanic Solution

Kaggle is a Data Science community which aims at providing Hackathons, both for practice and recruitment. You should at least try 5-10 hackathons before applying for a proper Data Science post.

Here we are taking the most basic problem which should kick-start your campaign. This hackathon will make sure that you understand the problem and the approach.

To download the dataset and submission of the solution, click here

P.S. –
1. We have used an intermediate level of feature engineering, you might have to create more features to boost your rank, but it’s a good way to start the journey
2. You need to have Python installed in your system and very basic knowledge of Python
3. We have deliberately put the screenshots and not the actual code because we want you to write the codes

Problem Description – The ship Titanic met with an accident and a lot of passengers died in it. The dataset describes a few passengers information like Age, Sex, Ticket Fare, etc.

Aim – We have to make a model to predict whether a person survived this accident. So, your dependent variable is the column named as ‘Surv
ived’

Let’s start with importing the data

-Check the dataset by the following commands

train.head()
test.head()


-Check the number of rows and columns in each of the datasets by the following command

train.shape
test.shape


-The first thing which you need to do before starting any hackathon or project is to import the following important libraries

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


Following is a brief description of the columns in the dataset

-You need to know the columns with missing values. the very basic thing is to check the description of the dataset with the following command

train.info()
test.info()

You can see we have 891 rows and there are missing values in Age, Cabin, and Embarked.

– It’s time to identify the important variables

Pclass is the class of the passenger, let’s see how many passengers were there in each class

There were a lot of customers in Class 3, followed by Class 1 and Class2.

-We will be creating a variable to store the survived and not survived passengers to check how many passengers died from each Class

  • Now let’s check how many male and female died in this accident
75% female and 25% male survived the accident

-Let’s check if the class of the passenger was also given a priority. Class 1 is the rich class, followed by 2 and 3

40% Class 1 passenger and 25% Class 3 passengers survived the accident
  • Let’s check the Embarked column i.e. the point of boarding. This column has 2 missing values
Most of the passengers boarded from point S. So we can directly fill the 2 missing values with S

More than 66% of the passengers who boarded from the point S died in the incident.

-Parch is the number of parents or children traveling along with a passenger

More than 65% of the passengers travelling alone died in the accident
  • SibSp is the number of siblings or spouse traveling along with a passenger
More than 65% of the passengers travelling without a sibling or spouse died in the accident.

-Understanding the correlation between two variables gives you an understanding of whether the features are directly or indirectly related to each other.

-We will be merging the dataset train and test so that the changes applied to the complete dataset can be done at once

final_data = [train,test]

Changing Data Types

1. Change male and female to binary value

2. Age has some missing values, right now we are replacing the missing values with the mean. But, you can very well replace it with random values in the range of mean+standard deviation and mean-standard deviation

3. Since there are only 2 missing values in Pclass, so we are replacing it with the most common Pclass i.e. S

Let’s now fix the Pclass and convert the categorical variables into numeric variable

4. We will fix the missing values present in the Fare column with the median value

5. Let’s create one more variable i.e. Family Size which will have the following formula:-

Family Size = Parch + SibSp + 1

This will include the family size of a passenger traveling in the shi

Do keep checking the head of train and test to make sure that dataset is getting modified

We will be removing Ticket and Cabin because Ticket number is an UID so there won’t be any relation with the person survived and Cabin because of heavy missing values
Though you are free to apply your mind in getting something out of the Ticket Number

We are also not using the Name column, though a lot of Kaggle solution used to extract the title from each name. You should try it once you complete the basic submission

-Check the head of train1

train1.head()

Drop PassengerId from both train1 and test1

-Put the survived column in the variable y_train1
-Keep every column other than Survived in X_train1
-Keep all the test columns in a new variable X_test1

Why are we doing these new variables?
The idea is to keep the dependent variable i.e. the on which you want to predict in y_train1.
Put all the independent variables in X_train1 which will be used to create a model

Once the model is ready, you have to predict the value for the passengerId given in the test dataset, so we have kept it in a separate variable i.e. X_test1


Just to iterate, before we move forward with the models
X_train1 – All the independent columns which you need in the model. Drop the unnecessary columns
y_train1 – The dependent variable
X_test1 – The dataset on which you want to make the prediction

Creating models
This will include a set of steps

Step 1 – Import the package
Step 2 – Put the algorithm in a variable
Step 3 – Fit the dependent variable(y_train1) and the independent variable(X_train1)
Step 4 – Do the prediction using the predict function on the X_test1
Step 5 – Get the accuracy of the model by using the score function


1. Logistic Regression

2. Support Vector Machine

3. K-Nearest Neighbor – We will try the value of KNN as 2,3, and 4

K-Nearest Neighbor with neighbor = 2

K-Nearest Neighbor with neighbor = 4

K-Nearest Neighbor with neighbor = 3

4. Decision Tree – Decision Tree and Random Forest will definitely overfit as these consider all the possible combination of the training dataset. That’s why the accuracy of DT is 100%

5. Random Forest – n_estimator is the number of trees you want in the Forest

6. Perceptron

We tried these algorithms
1. Logistic Regression
2. SVM
3. KNN
4. Decision Tree
5. Random Forest
6. Perceptron

Make your first submission using Random Forest

You need to get the pred_RF column from the model and combine it with PassengerId from the test datset

Submit it on Kaggle.

You can also try submitting results from other algorithms. Following is the example of Logistic Regression

Note:-
1. This article is just to make sure that you understand how to start exploring Data Science Hackathons
2. Feature Engineering is the key
3. Try more algorithms to climb the Leader Board


Keep Learning 🙂

The Data Monk



Author: TheDataMonk

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)