While working on a data set, how do you select important variables?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
It will take less than 1 minute to register for lifetime. Bonus Tip - We don't send OTP to your email id Make Sure to use your own email id for free books and giveaways
Answers ( 2 )
There are various methods of Feature extractions. Variance, correlation, and from the variable are few of them to name. You can refer to the following article: https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
To explain:
1. Variance: You can eliminate the features which have very little variance, as they are not giving any insight into the prediction and overfitting the model.
2. Correlated Features: There may be features with high correlations among them, we can keep few and eliminate the others as the kept set are enough to explain the variability by the eliminated one, keeping the correlated features would increase the cost of dimensionality.
3. From the Model itself: This method of feature selection is more time consuming but give us the important features produced by the models itself. In this, we can use 2-3 (any large number can be used depends on you.) models and fit the train set on it and asks the model to produce let say n numbers of important functions and eliminate the others. There are two functions in feature selection module of Sklearn RFE and RFECV, which we can use for the feature selection.
1) In R, you can use the step() function and pass your model as the parameter, the step function internally builds various models and determines the predictors which goes onto build the best model.
2) In sklearn , feature importance is an inbuilt class in tree based classifiers. you can plot the importance of all the features relative to the most important feature.
3) You can also use the correlation matrix, where you get to know the correlation of all variables with the target variable.
you can eliminate the features which do not possess a strong correlation with the target variable.