BCG Interview Question | Data Distribution
Question
What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
in progress
1
Machine Learning
4 years
2 Answers
823 views
Great Grand Master 0
Answers ( 2 )
Some of the issues can be:
1. Covariate shift: training and test input follow different distributions, but functional relation remains unchanged.
2. Sample selection bias: the training examples have been obtained through a biased method, such as non-uniform selection.
3. Non-stationary environments: Training environment is different from the test one, whether it’s due to a temporal or a spatial change. One typical scenario is adversarial classification problems, such as spam filtering and network intrusion detection.
It would be difficult to gauge the performance of the model as the training and
test samples appear to be very different from each other.