Explain Missing Value Treatment by mean,mode, median, and KNN Imputation
Question
What is missing value treatment
solved
1
Machine Learning
55 years
16 Answers
7276 views
Grand Master 0
Answers ( 16 )
Missing Value treatment is no doubt one of the most important parts of the whole process of building a model. Why?
Because we can’t afford to eliminate rows wherever there is a missing value in any of the columns. We need to tackle it in the best possible way. There are multiple ways to deal with missing values, and these are my top four methods:-
1. Mean – When do you take an average of a column? There is a saying which goes like this, “When a Billionaire walks in a small bar, everyone becomes a millionaire”
So, avoid using Mean as a missing value treatment technique when the range is too high. Suppose there are 10,000 employees with a salary of Rs.40,000 each and there are 100 employees with a salary of Rs. 1,00,000 each. In this case you can consider using the mean for missing value treatment.
But, if there are 10 employees with 8 employees earning Rs.40,000 and one of them earning Rs. 10,00,00. Now, here you should avoid using mean for missing value treatment. You can use mode !!
2. Median – Median is the middle term when you write the terms in ascending or descending order. Think of one example where you can use this? The answer is at the bottom of the article
3. Mode – Mode is the maximum occurring number. As we discussed in point one, we can use Mode where there is a high chance of repetition.
4. KNN Imputation – This is the best way to solve a missing value, here n number of similar neighbors are searched. The similarity of two attributes is determined using a distance function.
Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.
Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
Mean method – replace missing values with mean of that particular column
Median method – replace missing values with median of that particular column
Mode method – replace missing values with mode of that particular column
KNN method – use KNN algorithm to predict the missing value with the help of K- nearest neighbors
Do we have any other different ways/methods to treat missing other than mean, median, mode and KNN imputer.
Missing value imputation
1. Mean – When your numerical data has missing values given the values are normally distributed, we can replace the missing values with the mean of that numerical attribute
2. Median – When you see your numerical data is skewed (left/right) . Instead of imputing them with the mean values (that can imbalance your data further). Impute them with the median value of the attribute
3. Mode – Mainly used in Categorical and Ordinal variables, if missing values encountered, replace them with the modal value which basically means highest occurring value (highest frequency)
4. KNN Imputation – KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal, and categorical which makes it particularly useful for dealing with all kinds of missing data.
Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality.
1) Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
2) KNN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use
3. Mode – Mainly used in Categorical and Ordinal variables, if missing values encountered, replace them with the modal value which basically means highest occurring value (highest frequency)
Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.
One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values. In other words, we need to infer those missing values from the existing part of the data. There are three main types of missing data:
Missing completely at random (MCAR)
Missing at random (MAR)
Not missing at random (NMAR)
However, in this article, I will focus on 6 popular ways for data imputation for cross-sectional datasets ( Time-series dataset is a different story ).
1- Do Nothing:
That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.
2- Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
Mean/Median Imputation
3- Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
Most Frequent Imputation
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify
4- Imputation Using k-NN:
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:
KNN Imputation for California Housing Dataset
How does it work?
It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
5- Imputation Using Multivariate Imputation by Chained Equation (MICE)
This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
6- Imputation Using Deep Learning (Datawig):
This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. It also supports both CPU and GPU for training.
Missing values affect our performance and predictive capacity. They have the potential to change all our statistical parameters. The way they interact with outliers once again affects our statistics. Conclusions can thus be misleading.
1. Mean: When your numerical data has missing values given the values are normally distributed, we can replace the missing values with the mean of that numerical attribute
2. Median: Median is the middle score of data-points when arranged in order. And unlike the mean, the median is not influenced by outliers of the data set. When you see your continuous data is skewed, imputing them with the mean values can imbalance your data further. Impute them with the median value of the attribute
Disadvantages of mean/median method:
a. Doesn’t factor the correlations between features. It only works on the column level
b. Will give poor results on encoded categorical features (do NOT use it on categorical features)
c. Doesn’t account for the uncertainty in the imputations
3. Mode: Mode is the most frequent value in our data set. But when it comes to continuous data then mode can create ambiguities. There might be more than one mode or (rarely) none at all if none of the values is repeated. The mode is thus used to impute missing values in columns which are categorical in nature
Disadvantages:
a. It also doesn’t factor the correlations between features
b. It can introduce bias in data
4. kNN Imputation – Predict the missing values:
The algorithm uses ‘feature similarity’ to predict the values of any new data points. The new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
Advantages:
a) Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset)
Disadvantages: Computationally expensive and sensitive to outliers
b) It can be used for data that are continuous, discrete, ordinal, and categorical which makes it particularly useful for dealing with all kinds of missing data
Disadvantages: Computationally expensive and sensitive to outliers
Mean – This central tendency measure should only be used to replace missing values when the data is symmetric and the type of column should be numeric. It may not be good idea to use mean imputation for replacing the missing values, if the data is skewed.
Mode – mode imputation in which the missing values are replaced with the mode value or most frequent value of the entire feature column. When the data is skewed, it is good to consider using mode value for replacing the missing values. It is preferred when the feaature column is categorical, mostly.
Median – Another technique is median imputation in which the missing values are replaced with the median value of the entire feature column. When the data is skewed, it is good to consider using median value for replacing the missing values.
KNN Imputation – KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.
Sorry it a private answer.
Missing Value Treatment by:
1. Mean :- if the numerical data having missing value is normally distributed then we will replace the missing values with mean of that numerical attribute.
2. Median :- if the numerical data having missing value is skewed either positive or negative then we will replace the missing value with median of that numerical attribute.
3. Mode :- it is basically used for categorical or ordinal data and if there is any missing value then it is replaced with maximum occurance (maximum frequency) of the value.
4. KNN :- K-Nearest Neighbour is machine learning algorithm for both regression and classification problem. it basically works by finding the distance between query data and all the example data. and it is also most popularly used for missing value imputation as it calculates the distance between the missing value data and the remaining data and replace the missing value with the nearest non-missing value. It is basically handling every type of missing data either they are numerical categorical or ordinal.
various options of how to deal with missing values and how to implement them:
• Mean method: In this method average of non-missing values get filled at the place of missing values.
• Mode Method: In this method value which has more frequency of occurrence gets filled at the place of missing values.
• Median Method: In this method median of non – null values has been found out and gets filed at the place of missing values.
• KNN Imputation: For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.
Methods to treat missing values:
-Deletion- Used when missing values are completely random.
Cons- reduce the sample size
-Mean/ median/ mode imputation- Mot frequently used methods. It replaces the missing values with the mean, median or mode of the column. It can also be of two types:
– Generalized imputation- It replaces the missing value with the mean, median or mode of all the non-missing values in the column.
-similar case imputation- Suppose there is a gender column with male and female gender and another income column with some missing values. Now if there is a missing value in the income column for the gender male, then instead of taking the mean, median and mode of the all (male and female included) the non missing values in the income column it will only take the mean, median or mode of the non missing income values for the gender male. Same goes for female.
-KNN imputation- missing values are imputed using the given number of attributes that are most similar to the attribute whose values are missing. the similarity of two attributes is calculated by using a distance function.
pros- Can be used for both quantitative and qualitative data and correlation of data is taken into consideration.
cons- Very time consuming. Choice of k value is also very critical.
Methods to treat missing values:
-Deletion- Used when missing values are completely random.
Cons- reduce the sample size
-Mean/ median/ mode imputation- Mot frequently used methods. It replaces the missing values with the mean, median or mode of the column. It can also be of two types:
– Generalized imputation- It replaces the missing value with the mean, median or mode of all the non-missing values in the column.
-similar case imputation- Suppose there is a gender column with male and female gender and another income column with some missing values. Now if there is a missing value in the income column for the gender male, then instead of taking the mean, median and mode of the all (male and female included) the non missing values in the income column it will only take the mean, median or mode of the non missing income values for the gender male. Same goes for female.
-KNN imputation- missing values are imputed using the given number of attributes that are most similar to the attribute whose values are missing. the similarity of two attributes is calculated by using a distance function.
pros- Can be used for both quantitative and qualitative data and correlation of data is taken into consideration.
cons- Very time consuming. Choice of k value is also very critical.
Missing Value Treatment
Mean Imputation : When data is numeric in nature, then missing values can be imputed by first calculating the mean of all the values of the column in question and then substituting the missing values by the obtained mean value.
Median Imputation : When data is skewed, imputing them with the mean values can imbalance the data further. Imputing them with the median value of the attribute would then be a much smarter choice.
Mode Imputation : This method works for Categorical and Ordinal features by replacing missing data with the most frequently occurring value (mode) within each column.
KNN Imputation : The KNN algorithm is generally used in regression and classification problems. A new data point is assigned a value based on its Euclidean distance from the chosen “K” nearest neighbours. The new data point is assigned to the class containing the most number of neighbours. Hence, KNN can be deployed to predict the missing values by finding the k closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
Missing Value Treatment by:
1. Mean :- if the numerical data having missing value is normally distributed then we will replace the missing values with mean of that numerical attribute.
2. Median :- if the numerical data having missing value is skewed either positive or negative then we will replace the missing value with median of that numerical attribute.
3. Mode :- it is basically used for categorical or ordinal data and if there is any missing value then it is replaced with maximum occurance (maximum frequency) of the value.
4. KNN :- K-Nearest Neighbour is machine learning algorithm for both regression and classification problem. it basically works by finding the distance between query data and all the example data. and it is also most popularly used for missing value imputation as it calculates the distance between the missing value data and the remaining data and replace the missing value with the nearest non-missing value. It is basically handling every type of missing data either they are numerical categorical or ordinal.