How do you ensure you’re not overfitting with a model?
Question
Explain methods to avoid overfitting
in progress
0
Machine Learning
4 years
4 Answers
2003 views
Contributor 1
Answers ( 4 )
This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.
There are three main methods to avoid overfitting:
1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
2- Use cross-validation techniques such as k-folds cross-validation.
3- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.
Initially, I split the data into three-part, training, validation and test. Then I choose some valid cross-validation strategies like KFold, Stratified KFold, GroupKFold, TimeSeriesSplit based on the objective, and train my model with some folds and validate on validation dataset. I will calculate the eval score for each fold and then cv score (mean of evaluation score for total number of the fold) and its standard deviation. If my standard deviation is too large and testing the model on test data give score beyond my confidence interval (cv_score +- std) then I am overfitting the model.
The reason might be large number of features, irrelevant or redundant features.
ways to avoid overfitting :-
– using a simpler model like linear regression .
– using emsembling of multiple models.
– using ridge and lasso regression techniques.
– performing stratified k-fold or k-fold cross validation
1) You can introduce a little bit of bias in the model like in Linear Regression, so
that you will reduce the variance.
2) You can use k-fold cross validation instead of just splitting your data in one
single train and test set.
3) You can use regularization techniques like ridge and lasso which make the
target variable less sensitive to changes in the independent variable.
4) You can use ensemble methods like Random forests which average out the results
of many different trees.
By cross validation technique
By reducing variance and adding some bias to model
By regularization techniques like L1 and L2
By checking accuracy score on unseen test dataset
By train validation and test splitting of data
Etc.