EDA in Python

The complete Machine Learning journey can be penned down in 4 steps:-

1. Exploratory Data Analysis This is the first thing you do when you get a dataset. Before jumping on to building models, you need to first understand the nature of the data. EDA helps in making it easier for the audience to get along with data.

EDA includes visualizing the raw data, looking for correlation in the dataset and finding missing values in the data set. In short you have to plot a hell lot of graphs to understand the dataset.

2. Cleaning the data – You will spend more than 50 percent of your time in cleaning the data and doing missing value treatment. Cleaning is important because the accuracy of your model will depend on the number of proper data points.

3. Building models – We are talking about Machine Learning algorithms, so, once you have the clean data, you need to build models, visualize results, check the results and improve the success metric of the model

4. Result Presentation – You have the results of the model. This result is of no use until and unless it is consumed by the audience. You will again need the power of visualizations to prove the result of your analysis

We will take us a data set and will build some graphs from scratch.
I will be using the Titanic Data set because of the following reasons:-
1. Firstly, It is freely and easily available at this link – https://www.kaggle.com/c/titanic/data
2. It’s clean and easy to understand

Gist of the dataset – Titanic train dataset contains the various information(like. age, sex, family size of the passenger, etc.) about those who survived and who could not survive

  1. PassengerId: Id of every passenger.
  2. Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
  3. Pclass: There are 3 classes of passengers. Class1, Class2 and Class3.
  4. Name: Name of passenger.
  5. Sex: Gender of passenger.
  6. Age: Age of passenger.
  7. SibSp: Indication that passenger have siblings and spouse.
  8. Parch: Whether a passenger is alone or have family.
  9. Ticket: Ticket no of passenger.
  10. Fare: Indicating the fare.
  11. Cabin: The cabin of passenger.
  12. Embarked: The embarked category.
  13. Initial: Initial name of passenger.

Import the following libraries:

import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd

We will directly start with plotting some insightful graphs to get a gist of the problem statement. The codes are self explanatory but We will try to provide as much explanation as possible

We will start with importing the test and train files.

Titanic_Train = pd.read_csv('/Users/nitinkamal/Downloads/titanic/train.csv')
Titanic_Test = pd.read_csv('/Users/nitinkamal/Downloads/titanic/test.csv')

Now we will start plotting on these data points

1. Let’s see how many people actually survived in the training

import seaborn as sb
Count of number of people surviving the Titanic

2. Using SibSp to get the number of people who survived the incident


3. Check the number of survivors and non-survivors on the basis of gender

import seaborn as sb
sb.catplot(x='Sex', col='Survived', kind='count', data=Titanic_Train)

4. Let’s check the survival on Embarkment i.e. S,C and Q

sb.catplot(x='Survived', col='Embarked', kind='count', data=Titanic_Train);

5. Cross tab covers three dimensional information
e.g. Survivor on the basis of gender and class

pd.crosstab([Titanic_Train.Sex, Titanic_Train.Survived], Titanic_Train.Pclass, margins=True).style.background_gradient(cmap='coolwarm')

6. Getting the survived data according to Passenger Class

pd.crosstab(Titanic_Train.Pclass, Titanic_Train.Survived, margins=True).style.background_gradient(cmap='autumn_r')

Leave a Reply

Your email address will not be published. Required fields are marked *