Share
PhonePe Interview Question | Overfitting
Question
Your model is suffering from the problem of overfitting. Suggest some ways by which you can avoid it.
in progress
0
Machine Learning
55 years
1 Answer
815 views
Great Grand Master 0
Answer ( 1 )
Some ways by which overfitting can be avoided –
Cross-validation
Cross-validation is a powerful preventative measure against overfitting.The idea is clever: Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your model.
In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set .
Cross-validation allows you to tune hyperparameters with only your original training set. This allows you to keep your test set as a truly unseen dataset for selecting your final model.
Train with more data
It won’t work every time, but training with more data can help algorithms detect the signal better.
Of course, that’s not always the case. If we just add more noisy data, this technique won’t help. That’s why you should always ensure your data is clean and relevant.
Remove features
Some algorithms have built-in feature selection.
For those that don’t, you can manually improve their generalizability by removing irrelevant input features.
An interesting way to do so is to tell a story about how each feature fits into the model.
If anything doesn’t make sense, or if it’s hard to justify certain features, this is a good way to identify them.
In addition, there are several feature selection heuristics you can use for a good starting point.
Early stopping
When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs.
Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data.
Early stopping refers stopping the training process before the learner passes that point.
Today, this technique is mostly used in deep learning while other techniques (e.g. regularization) are preferred for classical machine learning.
Regularization
Regularization refers to a broad range of techniques for artificially forcing your model to be simpler.
The method will depend on the type of learner you’re using. For example, you could prune a decision tree, use dropout on a neural network, or add a penalty parameter to the cost function in regression.
Oftentimes, the regularization method is a hyperparameter as well, which means it can be tuned through cross-validation.
Ensembling
Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are:
Bagging attempts to reduce the chance overfitting complex models.
It trains a large number of “strong” learners in parallel.
A strong learner is a model that’s relatively unconstrained.
Bagging then combines all the strong learners together in order to “smooth out” their predictions.
Boosting attempts to improve the predictive flexibility of simple models.
It trains a large number of “weak” learners in sequence.
A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
Each one in the sequence focuses on learning from the mistakes of the one before it.
Boosting then combines all the weak learners into a single strong learner.
While bagging and boosting are both ensemble methods, they approach the problem from opposite directions.
Bagging uses complex base models and tries to “smooth out” their predictions, while boosting uses simple base models and tries to “boost” their aggregate complexity.