Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Login

Register Now

It will take less than 1 minute to register for lifetime. Bonus Tip - We don't send OTP to your email id Make Sure to use your own email id for free books and giveaways

Splitting data set in Python | Python for Data Science | Day 11

Hey, The Data Monk Fellas! 
We are back for you with yet another topic in Python. Today we will be discussing “splitting data set in Python”. Follow this tutorial thoroughly to get a complete overview of this significant topic in the discipline of ML. So, before we dive into the procedure of splitting data set in Python, it is cardinal to know why we actually need to do so. 

Basically, for most of the problems in ML, we train a model to learn from the given data and then test its performance on some data. These two types of data are respectively known as training and test sets.

Splitting data set in Python

Alright, let’s pick an example to see what train and test sets really mean. 

Say, you have some data which looks like this:

TOTAL AREAFLOORSCARPET AREAPRICE
112390500000
23041001000000
29031502000000

*The table above is just a preview. Actual data may have 10000 rows.

We need to predict the price of certain houses in Manhattan using their features like total area, floors and carpet area.

So, we begin collecting the data. After collecting, we have the total area, floors and carpet area of different houses and the prices corresponding to these houses. This is our data set.

After this we build a model. For this, we need to split(divide) this dataset in a certain ratio so that our model can learn from a fraction of data and we could analyse its performance from the rest of that data. So, out of the data of 10000 houses, I split the data set in such a way that 8000 rows are used for training and 2000 are used for testing. To do so, we can write some lines of code on our own or simply use an available Python function. Let’s consider the code below to understand:

Firstly, download the dataset here:

Linear_x_train.csv

https://www.dropbox.com/s/z82u1qpibiju3pw/Linear_X_Train.csv?dl=0

Linear_y_train.csv

https://www.dropbox.com/s/20m1w0yummyskrc/Linear_Y_Train.csv?dl=0

When your dataset is downloaded, do as instructed below:

Import pandas as follows:

Now, type in the code displayed below:

Import the data as a Pandas dataframe.

Splitting data set in Python

Now, see below-

Splitting data set in Python

For the purpose of demonstrating the splitting process, we have taken a sample dataset. It consists of 3750 rows and 1 column. Thus, the shape as shown above, is (3750,1).

Now, repeat the steps above with the second file i.e. Linear_y_train.csv

In y, we have the labels corresponding to the x values.

Let’s come to the main point now. First off, we will show you how to split this dataset into training and testing data using two techniques:

  • Custom
  • Using sklearn

Method 1

Suppose I wish to use 70% of the data set for training my model and 30% of the data for testing it, here is the code I will write:

Here, the train set size is defined as 70% of the dataset size. So we slice the data such that:

  • 70% rows come into train set
  • Rest, 30% come into test set

So, train set size=2625 and test set size= 1125.

Then, we repeated the same steps with the set of labels. (Shown above)

Method 2

 Now, we will use the sklearn library to import a function called train_test_split, which reduces all the code to just a single line!

See the code below:

Note that in code cell [23], we split the dataset into train and test by providing the dataset x and y as the first two parameters. Followed by the test-size 30%, which implies that the train set size is 70 %. We also specify the random state, which is a parameter of train_test_split that allows us to fix seeds for shuffling the data.

As it can be observed, we get the train set size=2625 and test set size= 1125, which is exactly similar to the custom code we created above. But see how our task is significantly reduced to just a line which does everything and we no longer need to do it separately for the x data and labels. This is the power of libraries!

Alright data monks! We highly recommend that you practice train test splitting on a new dataset, using both the methods we discussed today. Test your knowledge here:

https://thedatamonk.com/question/try-splitting/

See you with a fresh topic next time. Goodbye!

We have covered 40+ complete Data Science company interviews from the candidates who cracked these interviews.
Data Science Companies interview questions
We also have 30+ e-books on Amazon, Insta Mojo and books which can be delivered directly on your email address
Complete Set of e-books from The Data Monk
Understand some of the very complex topics in Analytics which are asked in most of the interviews
The Data Monk Top Articles
How to become a Data Scientist? Complete study material, free resources and websites to practice
Become a Data Scientist 
Make your profile on our website and practice at least 5-7 questions per day. Be a part of ~2000 Analytics expert

Article By,
Yashika Khurana
Intern | The Data Monk

About TheDataMonkGrand Master

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)

Follow Me

Leave a reply