Data set and type of data sets for modeling – TheDataMonk

Home » Statistics gyan » Data set and type of data sets for modeling

Data set and type of data sets for modeling

Q.) What is data set?
A.) Data set is a complete data which you use for your project. Dataset includes data from multiple data bases and tables combined together. The dataset for modeling can be divided into 2 parts- Train and Test dataset

Q.) What is train dataset?
A.) When you are building a model, you use some part of the dataset to train your model. This train dataset sets an example for your model to help it behave in a consistent manner.

For example, if you have a restaurant data for last 13 months, then this is your complete data set. You can divide the dataset in 80:20 ratio and can take around 11 months of data for training your model. You build this model on the above 80% dataset

Q.) What is test dataset?
A.) Now the rest 20% of the data is used to test your model before getting into the real time. Taking the above example forward, the last 2 months of data out of the 13 months, will not be used in training the model. So what ever prediction your model is doing will be tested against already known values for the last 2 months, to check the effectiveness of the algorithm

Q.) What comes under implementation of models?
A.) Implementation mainly means to understand the requirement of the stakeholders and to mold the model to meet the business requirement. For example. You can build a forecasting algorithm in R, but then you might have to implement it in PowerBI(Business Intelligence tool from Microsoft) to make it more consumable or you might have to develop an app to meet the requirement, etc.

Q.)Why data cleaning plays an important role?
A.) We are back to cleaning data. Once I participated in one of the Kaggle competition which required applying different text analytics algorithm to see sentiment of the text. I had done a similar project in the past on a clean data and I had the code ready for it. But, it took me almost a couple of days to clean the data and only a couple of hours to run the model.

The reason why cleaning is important is because you won’t get a good result on a dirty dataset and chances are that you might reject a particular algorithm just because it does not show you expected result, while on the other hand the algorithm was correct but your unclean data was running the case here

For more such questions, go here

 


Leave a comment

Your email address will not be published. Required fields are marked *