BCG Interview Question | Data Allocation
Question
How much data will you allocate for your training, validation and test sets? How will this variation be beneficial?
in progress
0
Machine Learning
55 years
2 Answers
1004 views
Great Grand Master 0
Answers ( 2 )
Generally the rule is 80-20
Out of the total data, 80 percent is used for training and the remaining 20 percent for testing.
Out of the 80 percent , 80 percent is used for training and 20 percent for validation.
Or we can use 70-10-10 rule where 70 percent is used for training, 10 percent for validation and 10 percent for testing if we have large datasets..
Generally, the most followed distribution is 60-20-20 for train, validation and test.
Training Dataset: The sample of data used to fit the model.
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model
fit on the training dataset while tuning model hyperparameters.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit
on the training dataset.
But the ratio for splitting can be decided upon by the size of your data and the model
which you are trying to train.
If there are large no of hyperparameters, then may be you can increase the size of validation set.
Training set should not be very small as the model will not learn enough from the data.