Feature Selection

Question

Two people A and B, train an algorithm on a same set of data. A trains by selecting 10 most important features from the dataset using an Algorithm C and randomly splits the dataset into training and testing dataset. B randomly splits the dataset and selects 10 most important features from the training dataset using an Algorithm C and trains and tests the data. Both report a relevant metric D on their respective test datasets. Whose value of the metric D is more reliable?

in progress 0
demogorgon 4 years 1 Answer 712 views Member 0

Answer ( 1 )

  1. Assuming both train the algorithm using the same hyperparameters and the same training time and assume the dataset is not too small. B’s result is more accurate here. In machine learning, we try to learn the universal function through our dataset which is sampled, and evaluate it on a completely new test set and the test set should not in any way interact with the training set. Since A’s feature selection algorithm tries to optimize on both the test and training dataset, thereby by the metric reported on the test set would definitely be better thereby biasing our estimation power and wrongly rewarding the algorithm.

Leave an answer

Browse
Browse