Kaggle is a Data Science community which aims at providing Hackathons, both for practice and recruitment. You should at least try 5-10 hackathons before applying for a proper Data Science post.
Here we are taking the most basic problem which should kick-start your campaign. This hackathon will make sure that you understand the problem and the approach.
To download the dataset and submission of the solution, click here
P.S. –
1. We have used an intermediate level of feature engineering, you might have to create more features to boost your rank, but it’s a good way to start the journey
2. You need to have Python installed in your system and very basic knowledge of Python
3. We have deliberately put the screenshots and not the actual code because we want you to write the codes
Problem Description – The ship Titanic met with an accident and a lot of passengers died in it. The dataset describes a few passengers information like Age, Sex, Ticket Fare, etc.
Aim – We have to make a model to predict whether a person survived this accident. So, your dependent variable is the column named as ‘Surv
ived’
Let’s start with importing the data
-Check the dataset by the following commands
train.head()
test.head()
-Check the number of rows and columns in each of the datasets by the following command
train.shape
test.shape
-The first thing which you need to do before starting any hackathon or project is to import the following important libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Following is a brief description of the columns in the dataset
-You need to know the columns with missing values. the very basic thing is to check the description of the dataset with the following command
train.info()
test.info()
You can see we have 891 rows and there are missing values in Age, Cabin, and Embarked.
– It’s time to identify the important variables
Pclass is the class of the passenger, let’s see how many passengers were there in each class
There were a lot of customers in Class 3, followed by Class 1 and Class2.
-We will be creating a variable to store the survived and not survived passengers to check how many passengers died from each Class
- Now let’s check how many male and female died in this accident
-Let’s check if the class of the passenger was also given a priority. Class 1 is the rich class, followed by 2 and 3
- Let’s check the Embarked column i.e. the point of boarding. This column has 2 missing values
More than 66% of the passengers who boarded from the point S died in the incident.
-Parch is the number of parents or children traveling along with a passenger
- SibSp is the number of siblings or spouse traveling along with a passenger
-Understanding the correlation between two variables gives you an understanding of whether the features are directly or indirectly related to each other.
-We will be merging the dataset train and test so that the changes applied to the complete dataset can be done at once
final_data = [train,test]
Changing Data Types
1. Change male and female to binary value
2. Age has some missing values, right now we are replacing the missing values with the mean. But, you can very well replace it with random values in the range of mean+standard deviation and mean-standard deviation
3. Since there are only 2 missing values in Pclass, so we are replacing it with the most common Pclass i.e. S
Let’s now fix the Pclass and convert the categorical variables into numeric variable
4. We will fix the missing values present in the Fare column with the median value
5. Let’s create one more variable i.e. Family Size which will have the following formula:-
Family Size = Parch + SibSp + 1
This will include the family size of a passenger traveling in the shi
Do keep checking the head of train and test to make sure that dataset is getting modified
–We will be removing Ticket and Cabin because Ticket number is an UID so there won’t be any relation with the person survived and Cabin because of heavy missing values
Though you are free to apply your mind in getting something out of the Ticket Number
– We are also not using the Name column, though a lot of Kaggle solution used to extract the title from each name. You should try it once you complete the basic submission
-Check the head of train1
train1.head()
–Drop PassengerId from both train1 and test1
-Put the survived column in the variable y_train1
-Keep every column other than Survived in X_train1
-Keep all the test columns in a new variable X_test1
Why are we doing these new variables?
The idea is to keep the dependent variable i.e. the on which you want to predict in y_train1.
Put all the independent variables in X_train1 which will be used to create a model
Once the model is ready, you have to predict the value for the passengerId given in the test dataset, so we have kept it in a separate variable i.e. X_test1
Just to iterate, before we move forward with the models
X_train1 – All the independent columns which you need in the model. Drop the unnecessary columns
y_train1 – The dependent variable
X_test1 – The dataset on which you want to make the prediction
Creating models
This will include a set of steps
Step 1 – Import the package
Step 2 – Put the algorithm in a variable
Step 3 – Fit the dependent variable(y_train1) and the independent variable(X_train1)
Step 4 – Do the prediction using the predict function on the X_test1
Step 5 – Get the accuracy of the model by using the score function
1. Logistic Regression
2. Support Vector Machine
3. K-Nearest Neighbor – We will try the value of KNN as 2,3, and 4
K-Nearest Neighbor with neighbor = 2
K-Nearest Neighbor with neighbor = 4
K-Nearest Neighbor with neighbor = 3
4. Decision Tree – Decision Tree and Random Forest will definitely overfit as these consider all the possible combination of the training dataset. That’s why the accuracy of DT is 100%
5. Random Forest – n_estimator is the number of trees you want in the Forest
6. Perceptron
We tried these algorithms
1. Logistic Regression
2. SVM
3. KNN
4. Decision Tree
5. Random Forest
6. Perceptron
Make your first submission using Random Forest
You need to get the pred_RF column from the model and combine it with PassengerId from the test datset
Submit it on Kaggle.
You can also try submitting results from other algorithms. Following is the example of Logistic Regression
Note:-
1. This article is just to make sure that you understand how to start exploring Data Science Hackathons
2. Feature Engineering is the key
3. Try more algorithms to climb the Leader Board
Keep Learning 🙂
The Data Monk