Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Login

Register Now

It will take less than 1 minute to register for lifetime. Bonus Tip - We don't send OTP to your email id Make Sure to use your own email id for free books and giveaways

Data Science Model with high accuracy in training dataset but low in testing dataset

What do you mean when I say “The model has high accuracy in Training dataset but low in testing dataset”
Data Science model interview question

Answer by Swapnil

Data Science model interview question

It means the model is getting trained to the noise in the data and trying to fit exactly to the training data rather than generalizing it well over many different data sets. So, the model is suffering from high variance in the test set and the solution is to introduce a little bit of bias in the model so that it reduces the variance in the test set. This is also called as overfitting in technical terms.

Answer by Shubham Bhatt

“The model has high accuracy in Training dataset but low in testing dataset” means overfitting.

When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

It suggests “High variance and low bias”.

Techniques to reduce overfitting :
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.

Answer by SMK – The Data Monk user

Data Science model interview question
1) This is a case of overfitting a model. It happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data
2) Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. We can prune a tree after it has learned in order to remove some of the detail it has picked up
3) Techniques to limit overfitting:
a) Use a resampling technique to estimate model accuracy
– k-fold cross-validation: We partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).

b) Hold back a validation dataset – A validation dataset is simply a subset of your training data that you hold back from your algorithms until the very end of your project. After you have tuned your algorithms on your training data, you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data

c) Remove irrelevant input features (Feature selection)

d) Early Stopping: Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data. Early stopping refers to stopping the training process before the learner passes that point. Deep Learning uses this technique.

Answer by Harshit Goyal

Data Science model interview question
The model’s high accuracy in the training dataset but low in the testing dataset is due to overfitting.

Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points.

In reality, the data often studied has some degree of error or random noise within it. Thus, attempting to make the model conform too closely to slightly inaccurate data can infect the model with substantial errors and reduce its predictive power.
Therefore, the model fails to fit additional data or predict future observations reliably.

We have covered 40+ complete Data Science company interviews from the candidates who cracked these interviews.
Data Science Companies interview questions

We also have 30+ e-books on Amazon, Insta Mojo and books which can be delivered directly on your email address
Complete Set of e-books from The Data Monk

Understand some of the very complex topics in Analytics which are asked in most of the interviews
The Data Monk Top Articles

How to become a Data Scientist? Complete study material, free resources and websites to practice
Become a Data Scientist 

Make your profile on our website and practice at least 5-7 questions per day. Be a part of ~2000 Analytics expert
Keep Learning 🙂

Nitin Kamal
Co-Founder | The Data Monk

About TheDataMonkMaster

I am the Co-Founder of The Data Monk. I have a total of 4+ years of analytics experience with 3+ years at Mu Sigma and 1 year at OYO. I am an active trader and a logically sarcastic idiot :)

Follow Me

Leave a reply