Share
What to do if one of my column with integer value is having more than 30% missing values ?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Should I remove all the rows ?
Answers ( 5 )
Removing the values is not a good case. Removing values is equivalent to the loss of information.
Instead, you can try two methods:
One, Create a binary variable which will have 1 if your corresponding column has missing value else 0 and impute some value based on the distribution of variable like mean, median, or mode.
Another method is, to impute a very large value something like -999 or 9999 which far away from the distribution of your model. So the model can handle it separately.
30% is not a humongous amount of data loss . in this case we can do some proper eda and figure out what led to the missing value and impute it using techniques like mean, median or mode . in advanced cases knn imputation can also be used. but null values need to be treated before feeding the data in the algorithm
30% missing values can be handled by some from of imputation instead of
completely deleting the entire column. There are various ways to handle missing
values like filling up with mean, median and mode or k-nn imputation.
Though, which method to choose is a different discussion as we need to first
inspect as to why the missing values are missing in the first case.
Missing values can take many forms like Missing completely at random(MCAR),
Missing at random(MAR), Missing Not at Random(MNAR) and each category needs to
be dealt in it’s own way.
As far as, it is the about the question of 30% missing values, they should be handled
and replaced with one of the above methods.
handle missing values like filling up with mean, median and mode or k-NN imputation.
Handling the missing values using mean, median or mode, K-NN, MCAR, MNAR, MAR. The best imputation way for Numerical value is to use statistical method median() of the column. As the average/ mean of the column results in outliers therefore we use median for imputation.
30% missing values are large so we can’t delete all of them it will be a huge loss of data. We have to first find out the cause of missing values of they are missing at random or missing for a particular type of pattern based on other data in that rows. And there are different techniques for imputation in each case like mean median mode knn-imputation etc.