Feature selection methods in Model Building

Feature selection methods in Model Building
When we deal with Big Data then there is a high chance of having a plethora of columns and rows in our dataset.
Selecting a group of columns is the crux of your model’s performance.
There are multiple methods for selecting columns from your table to build your model.
Feature selection methods are a group of process which aims at reducing the number of input variables while developing a predictive model.

Feature selection methods in Model Building

Why do we need to reduce the number of input variable?

When you are creating a Machine Learning algorithm, your model will take ‘n’ number of variables to read and understand the behavior of the data.
These ‘n’ number of variables are your independent variables and the output variable is your dependent variable. While creating a model you need to have only those variables in your independent bucket which actually impacts the output/dependent variable.
Example – If you are trying to predict the salary of a person, then you probably do not need the menu of his breakfast.
In a dataset, there are 100s of columns and your target to pick only relevant parameters for your model. This is why we need a method to reduce the number of input variables in the model

This involves: 

  • Linear discrimination analysis – LDA is a way of reducing dimensionality which is used as a preprocessing step in Model building, prediction and classification problem

    We start with the calculating the separability between different classes(i.e the distance between the mean of different classes). This is also called between-class variance. Here we take
Feature selection methods
Between Class Variance

Then we look after the

distance between the mean and sample of each class, which is called the within-class variance

Third and the last step is to get the lower-dimensional space which takes the maximizes the between-class variance and minimizes the ‘within-class variance’

feature selection methods
  • ANOVA – A technique which comap
  • Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in. 

Wrapper Methods

This involves: 

  • Forward Selection: We test one feature at a time and keep adding them until we get a good fit
  • Backward Selection: We test all the features and start removing them to see what works better
  • Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Explore these methods and if you know of some other method then do comment below

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics Manager, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Data Science Model with high accuracy in training dataset but low in testing dataset

What do you mean when I say “The model has high accuracy in Training dataset but low in testing dataset”
Data Science model interview question

Answer by Swapnil

Data Science model interview question

It means the model is getting trained to the noise in the data and trying to fit exactly to the training data rather than generalizing it well over many different data sets. So, the model is suffering from high variance in the test set and the solution is to introduce a little bit of bias in the model so that it reduces the variance in the test set. This is also called as overfitting in technical terms.

Answer by Shubham Bhatt

“The model has high accuracy in Training dataset but low in testing dataset” means overfitting.

When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

It suggests “High variance and low bias”.

Techniques to reduce overfitting :
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.

Answer by SMK – The Data Monk user

Data Science model interview question
1) This is a case of overfitting a model. It happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data
2) Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. We can prune a tree after it has learned in order to remove some of the detail it has picked up
3) Techniques to limit overfitting:
a) Use a resampling technique to estimate model accuracy
– k-fold cross-validation: We partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).

b) Hold back a validation dataset – A validation dataset is simply a subset of your training data that you hold back from your algorithms until the very end of your project. After you have tuned your algorithms on your training data, you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data

c) Remove irrelevant input features (Feature selection)

d) Early Stopping: Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can weaken as it begins to overfit the training data. Early stopping refers to stopping the training process before the learner passes that point. Deep Learning uses this technique.

Answer by Harshit Goyal

Data Science model interview question
The model’s high accuracy in the training dataset but low in the testing dataset is due to overfitting.

Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points.

In reality, the data often studied has some degree of error or random noise within it. Thus, attempting to make the model conform too closely to slightly inaccurate data can infect the model with substantial errors and reduce its predictive power.
Therefore, the model fails to fit additional data or predict future observations reliably.

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics

Missing Value Treatment by mean, mode, median, and KNN Imputation | Day 5

Missing Value Treatment by mean, mode, median, and KNN Imputation

One of the most important technique in any Data Science model is to replace missing values with some numbers/values.
We can’t afford to remove the rows with missing values as there will be a lot of columns and every column might have some missing values. Removing all the missing rows will drastically reduce the data volume. That’s why we use Missing Value treatment

Explore all the answers from our users – https://thedatamonk.com/question/explain-missing-value-treatment-by-meanmode-median-and-knn-imputation/

Missing Value Treatment by mean, mode, median, and KNN Imputation

Missing Value Treatment
Missing Value Treatment

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.

One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values. In other words, we need to infer those missing values from the existing part of the data.

There are three main types of missing data:
Missing completely at random (MCAR)
Missing at random (MAR)
Not missing at random (NMAR)

However, in this article, I will focus on 6 popular ways for data imputation for cross-sectional datasets ( Time-series dataset is a different story ).

1- Do Nothing:

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

2- Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.

Mean/Median Imputation
3- Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.

Most Frequent Imputation
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify
4- Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:

KNN Imputation for California Housing Dataset
How does it work?

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.

K-NN is quite sensitive to outliers in the data (unlike SVM)
5- Imputation Using Multivariate Imputation by Chained Equation (MICE)
This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.

6- Imputation Using Deep Learning (Datawig):
This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. It also supports both CPU and GPU for training.

The Data Monk Interview Books – Don’t Miss

Now we are also available on our website where you can directly download the PDF of the topic you are interested in. At Amazon, each book costs ~299, on our website we have put it at a 60-80% discount. There are ~4000 solved interview questions prepared for you.

10 e-book bundle with 1400 interview questions spread across SQL, Python, Statistics, Case Studies, and Machine Learning Algorithms – Ideal for 0-3 years experienced candidates

23 E-book with ~2000 interview questions spread across AWS, SQL, Python, 10+ ML algorithms, MS Excel, and Case Studies – Complete Package for someone between 0 to 8 years of experience (The above 10 e-book bundle has a completely different set of e-books)

12 E-books for 12 Machine Learning algorithms with 1000+ interview questions – For those candidates who want to include any Machine Learning Algorithm in their resume and to learn/revise the important concepts. These 12 e-books are a part of the 23 e-book package

Individual 50+ e-books on separate topics

Important Resources to crack interviews (Mostly Free)

There are a few things which might be very useful for your preparation

The Data Monk Youtube channel – Here you will get only those videos that are asked in interviews for Data Analysts, Data Scientists, Machine Learning Engineers, Business Intelligence Engineers, Analytics managers, etc.
Go through the watchlist which makes you uncomfortable:-

All the list of 200 videos
Complete Python Playlist for Data Science
Company-wise Data Science Interview Questions – Must Watch
All important Machine Learning Algorithm with code in Python
Complete Python Numpy Playlist
Complete Python Pandas Playlist
SQL Complete Playlist
Case Study and Guesstimates Complete Playlist
Complete Playlist of Statistics