EDA in Python

The complete Machine Learning journey can be penned down in 4 steps:-

1. Exploratory Data Analysis – This is the first thing you do when you get a dataset. Before jumping on to building models, you need to first understand the nature of the data. EDA helps in making it easier for the audience to get along with data.

EDA includes visualizing the raw data, looking for correlation in the dataset and finding missing values in the data set. In short you have to plot a hell lot of graphs to understand the dataset.

2. Cleaning the data – You will spend more than 50 percent of your time in cleaning the data and doing missing value treatment. Cleaning is important because the accuracy of your model will depend on the number of proper data points.

3. Building models – We are talking about Machine Learning algorithms, so, once you have the clean data, you need to build models, visualize results, check the results and improve the success metric of the model

4. Result Presentation – You have the results of the model. This result is of no use until and unless it is consumed by the audience. You will again need the power of visualizations to prove the result of your analysis

We will take us a data set and will build some graphs from scratch.
I will be using the Titanic Data set because of the following reasons:-
1. Firstly, It is freely and easily available at this link – https://www.kaggle.com/c/titanic/data
2. It’s clean and easy to understand

Gist of the dataset – Titanic train dataset contains the various information(like. age, sex, family size of the passenger, etc.) about those who survived and who could not survive

PassengerId: Id of every passenger.
Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
Pclass: There are 3 classes of passengers. Class1, Class2 and Class3.
Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger.
SibSp: Indication that passenger have siblings and spouse.
Parch: Whether a passenger is alone or have family.
Ticket: Ticket no of passenger.
Fare: Indicating the fare.
Cabin: The cabin of passenger.
Embarked: The embarked category.
Initial: Initial name of passenger.

Import the following libraries:

import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd

We will directly start with plotting some insightful graphs to get a gist of the problem statement. The codes are self explanatory but We will try to provide as much explanation as possible

We will start with importing the test and train files.

Titanic_Train = pd.read_csv('/Users/nitinkamal/Downloads/titanic/train.csv')
Titanic_Test = pd.read_csv('/Users/nitinkamal/Downloads/titanic/test.csv')

Now we will start plotting on these data points

1. Let’s see how many people actually survived in the training

import seaborn as sb
sb.countplot('Survived',data=Titanic_Train)

Count of number of people surviving the Titanic

2. Using SibSp to get the number of people who survived the incident

Titanic_Train[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar()
 sb.countplot('SibSp',hue='Survived',data=Titanic_Train,)
 plt.show()

3. Check the number of survivors and non-survivors on the basis of gender

import seaborn as sb

sb.catplot(x='Sex', col='Survived', kind='count', data=Titanic_Train)

4. Let’s check the survival on Embarkment i.e. S,C and Q

sb.catplot(x='Survived', col='Embarked', kind='count', data=Titanic_Train);

5. Cross tab covers three dimensional information
e.g. Survivor on the basis of gender and class

pd.crosstab([Titanic_Train.Sex, Titanic_Train.Survived], Titanic_Train.Pclass, margins=True).style.background_gradient(cmap='coolwarm')

6. Getting the survived data according to Passenger Class

pd.crosstab(Titanic_Train.Pclass, Titanic_Train.Survived, margins=True).style.background_gradient(cmap='autumn_r')

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

Register Now

Login

Lost Password

The Data Monk services

The Data Monk e-books

About TheDataMonkGrand Master

Related Posts

Amazon Business Intelligence Interview Questions

Data Analytics Internship – 2025

200 Tricky and Advanced SQL Interview Questions

SQL Hard Interview Questions

Data Modeling Interview Questions