EDA in Python
The complete Machine Learning journey can be penned down in 4 steps:-
1. Exploratory Data Analysis – This is the first thing you do when you get a dataset. Before jumping on to building models, you need to first understand the nature of the data. EDA helps in making it easier for the audience to get along with data.
EDA includes visualizing the raw data, looking for correlation in the dataset and finding missing values in the data set. In short you have to plot a hell lot of graphs to understand the dataset.
2. Cleaning the data – You will spend more than 50 percent of your time in cleaning the data and doing missing value treatment. Cleaning is important because the accuracy of your model will depend on the number of proper data points.
3. Building models – We are talking about Machine Learning algorithms, so, once you have the clean data, you need to build models, visualize results, check the results and improve the success metric of the model
4. Result Presentation – You have the results of the model. This result is of no use until and unless it is consumed by the audience. You will again need the power of visualizations to prove the result of your analysis
We will take us a data set and will build some graphs from scratch.
I will be using the Titanic Data set because of the following reasons:-
1. Firstly, It is freely and easily available at this link – https://www.kaggle.com/c/titanic/data
2. It’s clean and easy to understand
Gist of the dataset – Titanic train dataset contains the various information(like. age, sex, family size of the passenger, etc.) about those who survived and who could not survive
- PassengerId: Id of every passenger.
- Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
- Pclass: There are 3 classes of passengers. Class1, Class2 and Class3.
- Name: Name of passenger.
- Sex: Gender of passenger.
- Age: Age of passenger.
- SibSp: Indication that passenger have siblings and spouse.
- Parch: Whether a passenger is alone or have family.
- Ticket: Ticket no of passenger.
- Fare: Indicating the fare.
- Cabin: The cabin of passenger.
- Embarked: The embarked category.
- Initial: Initial name of passenger.
Import the following libraries:
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
We will directly start with plotting some insightful graphs to get a gist of the problem statement. The codes are self explanatory but We will try to provide as much explanation as possible
We will start with importing the test and train files.
Titanic_Train = pd.read_csv('/Users/nitinkamal/Downloads/titanic/train.csv')
Titanic_Test = pd.read_csv('/Users/nitinkamal/Downloads/titanic/test.csv')
Now we will start plotting on these data points
1. Let’s see how many people actually survived in the training
import seaborn as sb
sb.countplot('Survived',data=Titanic_Train)
2. Using SibSp to get the number of people who survived the incident
Titanic_Train[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar()
sb.countplot('SibSp',hue='Survived',data=Titanic_Train,)
plt.show()
3. Check the number of survivors and non-survivors on the basis of gender
import seaborn as sb
sb.catplot(x='Sex', col='Survived', kind='count', data=Titanic_Train)
4. Let’s check the survival on Embarkment i.e. S,C and Q
sb.catplot(x='Survived', col='Embarked', kind='count', data=Titanic_Train);
5. Cross tab covers three dimensional information
e.g. Survivor on the basis of gender and class
pd.crosstab([Titanic_Train.Sex, Titanic_Train.Survived], Titanic_Train.Pclass, margins=True).style.background_gradient(cmap='coolwarm')
6. Getting the survived data according to Passenger Class
pd.crosstab(Titanic_Train.Pclass, Titanic_Train.Survived, margins=True).style.background_gradient(cmap='autumn_r')
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
- YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel - Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website - E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page - Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page - Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate
The Data Monk e-books
We know that each domain requires a different type of preparation, so we have divided our books in the same way:
1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics
You can check out all the other e-books on our e-shop page – Do not miss it
For any information related to courses or e-books, please send an email to [email protected]