Label Encoder and One Hot Encoding
In our datasets we can have any sort of data, we can have numbers, categories, texts, or literally anything. If you have ever created any model , you already know that you can’t use Textual Data to train it.
Label Encoder and One Hot Encoding are two most important ways to convert a textual categorical variable into a usable format. It’s very important to understand the ways in which you can use your categorical fields.
Now, we have two columns i.e. Name and Country, and the task is to use the country name as a category in our predictive model. We already know that we can’t use textual categorical variables in our model. So, we need to convert these into numerical values (using the sample code given below in Python)
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])
Now we have the third column in the data set with the name ‘Label Encoder’. The column ‘Label Encoder’ is in numerical form, now you can use it 🙂
But, wait !!
If we use the above category using these numerical values then the model will provide weghtage in the following order India>Nepal>Sri Lank as the numbers are 3>2>1 in the new column. But, this column needs to be used as a category and not as a numerical variable.
It’s simple, All the Indians need to be treated alike, but the model should not learn that Nepal with value 2 is inferior to India with numerical value 3.
Here, One Hot Encoding comes into the picture.
OHE takes all the distinct categories and create a separate binary column for each with 0 and 1. This removes the numerical weightage from the equation.
Sample code in Python is below
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
In simple words, Label Encoder converts Textual Categories into numerical values and One Hot Encoding converts each of these distinct values into new columns
These methods are extensively used to prepare data for Classification, Regression, and Tree Based algorithms.
Keep Learning 🙂
The Data Monk