Label Encoder and One Hot Encoding

In our datasets we can have any sort of data, we can have numbers, categories, texts, or literally anything. If you have ever created any model , you already know that you can’t use Textual Data to train it.

Label Encoder and One Hot Encoding are two most important ways to convert a textual categorical variable into a usable format. It’s very important to understand the ways in which you can use your categorical fields.

Columns in green are original data set and that in yellow are generated after Label-Encoder and One Hot Encoder

Now, we have two columns i.e. Name and Country, and the task is to use the country name as a category in our predictive model. We already know that we can’t use textual categorical variables in our model. So, we need to convert these into numerical values (using the sample code given below in Python)

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

Now we have the third column in the data set with the name ‘Label Encoder’. The column ‘Label Encoder’ is in numerical form, now you can use it 🙂

But, wait !!
If we use the above category using these numerical values then the model will provide weghtage in the following order India>Nepal>Sri Lank as the numbers are 3>2>1 in the new column. But, this column needs to be used as a category and not as a numerical variable.

It’s simple, All the Indians need to be treated alike, but the model should not learn that Nepal with value 2 is inferior to India with numerical value 3.
Here, One Hot Encoding comes into the picture.

OHE takes all the distinct categories and create a separate binary column for each with 0 and 1. This removes the numerical weightage from the equation.

Sample code in Python is below

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

In simple words, Label Encoder converts Textual Categories into numerical values and One Hot Encoding converts each of these distinct values into new columns

These methods are extensively used to prepare data for Classification, Regression, and Tree Based algorithms.

Keep Learning 🙂
The Data Monk

100 Questions in R to crack any Data Science Interview

R is one of the two most popular Data Science programming language. If you are new to this domain, then we would always recommend you to start with R because of it’s easier installation steps, minimal version control, and libraries.

I used to write all the questions which were asked to me in my interviews. Since, I had R in my resume, questions used to revolve around functions, loops, regular expression, etc. I jotted down all the questions with their explanation and output and combined it in this book.

The book contains questions on :-
-Apply
-Plot
-Library functions
-User defined functions
-Regular Expression
-Data Type and Data Structure of R
And many more

We highly, highly recommend you to cover this book end to end in order to completely destroy interviews in R.

The book is available for free on 18th and 19th April’20 from the link below

https://www.amazon.in/dp/B0876F4JP7

Do check our other Data Science posts !!

Keep Learning 🙂

The Data Monk



300+ Data Science Interview Questions

Probably you have a lot of information about what to study and not to study for a Data Science job. But, when you start applying to a DS job, then you will realize that the whole process contains a lot of diverse rounds which includes
1. Problem Solving
2. Statistics
3. Aptitude
4. Guesstimate
5. Python and R
6. SQL and Excel
7. Project Description

This book contains more than 300 questions covering all the essential topics. You will get the complete idea of the recruitment drive

Following is the book where we have covered the 300+ interview questions for Data Science interviews

https://www.amazon.in/dp/B07TSJ85FP

Once you go through the above book, you can try your hands on specific algorithms. Do comment and let us know if you like the book 🙂

Keep Learning 🙂

The Data Monk

Interview Questions – Flipkart, Myntra, Oyo Rooms, Tredence, and Meredith India

Data Science is one of those domains which are less explored in colleges but is in high demand in the I.T. sector.
The combination of Maths and Technology makes it even more interesting.

I have been working as a Senior Data Scientist at OYO for the last one year, prior to which I was at Mu Sigma for 3+ years. I have appeared for a lot of interviews and I, along with my peers have converted a handful of opportunities in various product and service based companies.

The Data Monk is an initiative to make everyone aware of the current job scenario in Data Science domain and how to crack these interviews.

The reality of these interviews is that it’s very less about the theory which you learn on the online portals and courses, and is more about live-project experience.

We have interviewed more than 1000 candidates for our respective teams and companies, and we hold a good grip on what actually is asked in these interviews.

The link to the book is given below which have 5 complete interview papers of:-
1. OYO – Bangalore
2. Flipkart
3. Meredith
4. Tredence
5. Myntra

https://www.amazon.in/dp/B07QFYJ75Z

If you are not from India, then search the name of the book

Please go through this book if you are appearing for any interview in the upcoming month or if you are just starting with your interview preparation.

Get in touch with me

Linkedin Account – https://www.linkedin.com/in/nitin-kamal-a2841a80/

The Data Monk Page on Linkedin – https://www.linkedin.com/company/14573769/

Facebook Page – https://www.facebook.com/thedatamonk/

Janta Hack – Analytics Vidhya R code

install.packages(“stringr”)
library(stringr)

data = read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/train.csv”)
head(data)
str(data)

data$product <- str_count(data$ProductList,”;”)+1
head(data)
data$hours <- with(data, difftime(endTime,startTime,units=”hours”) )
data$min <- with(data, difftime(endTime,startTime,units=”mins”) )
data$x <- as.double(data$endTime – data$startTime, units = “mins”)

table(data$product)
hist(data$product)

table(data$gender)
count <- table(data$gender,data$product)
barplot(count)
str(data)
head(data$ET)
head(data$endTime)

date1 = as.POSIXlt(’16/12/14 14:41′,format=”%Y-%m%dT%H:%M:%S”)
date2 = as.POSIXlt(‘2015-10-05T22:43:00.000’,format=”%Y-%m-%dT%H:%M:%S”)
install.packages(“lubridate”)
library(lubridate)
year(date1)
month(date1)
day(date1)
hour(date1)

data$date <- substr(data$startTime,1,2)
head(data)

#Merge Train and Test Data Set
test <- read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/test.csv”)
df_test <- as.data.frame(append(test,list(gender=0),after = 4))
head(df_test)

data$gender_num <- ifelse()

data_x <- read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/train.csv”)
data_x$G <- ifelse(data_x$gender==’male’,1,0)
head(data_x)
data_x = subset(data_x,select=-c(gender))

data_test <- read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/test.csv”)
df_data_test <- as.data.frame(append(data_test,list(G=0),after=4))
head(df_data_test)

df_Janta <- rbind(data_x,df_data_test)
df_Janta$Product <- str_count(df_Janta$ProductList,”;”)+1
head(df_Janta)

#In trainig datset we have 8192 females and 2308 male
table(data_x$G)

#In total there has been 7934 single purchase
table(df_Janta$Product)

df_Janta$first <- substr(df_Janta$ProductList,21,6)
str(df_Janta)
df_Janta$ProductList <- as.character(df_Janta$ProductList)
df_Janta$x <- substr(df_Janta$ProductList,21,6)
df_Janta$x
str(df_Janta)

first = sapply(df_Janta$ProductList,function(x) {
if(substr(x,1,6) != ”){
return(substr(x,1,6))
}
else {

return("Null")

}

}
)

table(first)
table(first,df_Janta$G)

second = sapply(df_Janta$ProductList,function(x){

return(substr(x,7,6))

})
table(second)

train_f <- read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/train.csv”)
head(train_f)
test_f <- read.csv(“C:/Users/User/Desktop/Hackathon/JantaHack/test.csv”)
head(test_f)

str(train_f)
test_f <- as.data.frame(append(test_f,list(gender=0),after=4))
str(test_f)

both <- rbind(train_f,test_f)

#Adding number of products
both$no_prod <- str_count(both$ProductList,”;”)+1
str(both)

both$gender <- as.factor(both$gender)
both$gb_p <- as.factor(both$gb_p)
both$gb_p2 <- as.factor(both$gb_p2)
both$gb_1 <- as.factor(both$gb_1)
both$sum <- as.factor(both$sum)
both$sum_gb <- as.factor(both$sum_gb)
str(both)

traindata <- both[1:10500,]
testdata <- both[10501:15000,]

model_log <- glm(gender ~ gb_p+gb_p2+gb_1+sum+sum_gb+no_prod,data = traindata,family = binomial)
summary(model_log)

x <- predict(model_log,testdata)
sub <- cbind(testdata$session_id,x)

write.csv(sub,”C:/Users/User/Desktop/Hackathon/JantaHack/submit_lm.csv”)

install.packages(“caret”)
install.packages(“e1071”)
library(caret)
library(e1071)
set.seed(101)

tuned = tune.svm(gender~ gb_p+gb_p2+gb_1+sum+sum_gb+no_prod , data = traindata, gamma = seq(.1,0.5,0.1), cost = seq(1,60,10))
tuned$best.parameters

model_svm <- svm(gender~ gb_p+gb_p2+gb_1+sum+sum_gb+no_prod , data = traindata, gamma = 0.1, cost = 1, type = “C-classification”)

summary(model_svm)

svm_pred <- predict(model_svm,testdata,type=”response”)

fin_svm <- cbind(testdata$session_id,svm_pred)
write.csv(fin_svm,”C:/Users/User/Desktop/Hackathon/JantaHack/submit_svm.csv”)

model_lin <- lm(gender~ no_prod , data = traindata)
summary(model_lin)
lm_pred <- predict(model_lin,testdata)
head(testdata)
pred_lm <- cbind(testdata$session_id,lm_pred)
head(pred_lm)
table(lm_pred)
write.csv(pred_lm,”C:/Users/User/Desktop/Hackathon/JantaHack/submit_lin.csv”)

install.packages(“randomForest”)
library(randomForest)

model_rf <- randomForest(gender~ gb_p+gb_p2+gb_1+sum+sum_gb+no_prod , data = traindata)
model_rf

pred_rf <- predict(model_rf,testdata)
sub_rf <- cbind(testdata$session_id,pred_rf)

write.csv(sub_rf,”C:/Users/User/Desktop/Hackathon/JantaHack/submit_rf.csv”)

head(train_f)

train_f$Str <- as.String(train_f$ProductList)

library(xgboost)

Missing Value Treatment – Mean, Median, Mode, KNN Imputation, and Prediction

Missing Value treatment is no doubt one of the most important parts of the whole process of building a model. Why?
Because we can’t afford to eliminate rows wherever there is a missing value in any of the columns. We need to tackle it in the best possible way. There are multiple ways to deal with missing values, and these are my top four methods:-

1. Mean – When do you take an average of a column? There is a saying which goes like this, “When a Billionaire walks in a small bar, everyone becomes a millionaire”
So, avoid using Mean as a missing value treatment technique when the range is too high. Suppose there are 10,000 employees with a salary of Rs.40,000 each and there are 100 employees with a salary of Rs. 1,00,000 each. In this case you can consider using the mean for missing value treatment.

But, if there are 10 employees with 8 employees earning Rs.40,000 and one of them earning Rs. 10,00,00. Now, here you should avoid using mean for missing value treatment. You can use mode !!

2. Median – Median is the middle term when you write the terms in ascending or descending order. Think of one example where you can use this? The answer is at the bottom of the article

3. Mode – Mode is the maximum occurring number. As we discussed in point one, we can use Mode where there is a high chance of repetition.

4. KNN Imputation – This is the best way to solve a missing value, here n number of similar neighbors are searched. The similarity of two attributes is determined using a distance function.

In one of the Hackathon, I had to impute or treat the missing value of age, so I tried the following way out( in R)

new_dataset <- knnImputation(data = df,k=8)

k-nearest neighbour can predict both qualitative & quantitative attributes but it consumes a lot of time and processor

install.packages(“imputeTS”)
library(imputeTS)
x <- ts(c(12,23,41,52,NA,71,83,97,108))

na.interpolation(x)

na.interpolation(x, option = “spline”)

na.interpolation(x, option = “stine”)



5. Bonus type – Prediction
This is another way of fixing the missing values. You can try linear regression/time series analysis or any other method to fill in the missing values using prediction

Median – You can use median where there is low variance in age


Came across KNN Imputation, so thought of sharing the same !!

Keep Learning 🙂
The Data Monk

Story of Bias, Variance, Bias-Variance Trade-Off

Why do we predict?
We predict in order to identify the trend of the future by using our sample data set. Whenever we create a model, we try to create a formula out of our sample data set. And the aim of this formula is to satisfy all the possible conditions of the universe.
Mathematicians and Statisticians all across the globe try to create a perfect model that can answer future questions.

Thus we create a model, and this model is bound to have some error. Why? Because we can’t cover all the possible combinations to fit in one formula. The error or difference between the actual and predicted value is called prediction error.

Bias – It is the difference between the average prediction of the model with the actual values. A model with HIGH bias will create a very simple model and it will be far away from the actual values in both train and test data set

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and SVM.

Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression

Variance – Variance refers to the spread of our data. A model with high variance will be so specific in its training dataset that it tries to cover all the points while training the data which results in high training accuracy but low test accuracy

Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

A simple model with high bias (left) a complicated model with high variance in the left

As you can see, the line in the left tries to cover all the points, so it creates a complicated model which is very accurate in the training data set.

Let’s see how an under fitting, over fitting, and good model looks like

As you can see, A high variance occurs in a model that tries to create a complicated formula on the training data set.
A high bias model is very generic. Matlab aiwaiey kuch v average bna diya

If you want to understand the mathematics behind these errors, then below is the formula

The above formula has 3 terms, the first term is the bias square, second is the variance and third is the irreducible error.
No matter what, you can’t remove the irreducible error. It is the measure of noise in the data and you can’t have a noiseless data set.

When you have a very limited dataset then there is a high chance of getting a under-fitting data set(High Bias and Low Variance)
When you have very noisy data then the model tries to fit in a complicated model which might result in over-fitting on the training dataset(High Variance and Low Bias)

What is the bias-variance trade-off?
The trade-off between bias and variance is done to minimize the overall error(formula above)

Error = Reducible Error+Irreducible Error
Reducible error = (Bias)^2 + Variance

Estimated Mean Square Error(Pic. from Quora)

Let’s try to ease out the formula for Bias and Variance
Bias =Estimation of target-target
Variance of estimates = (Target – Estimated target)^2
 The variance error measure how much our target function would differ if a new training data was used.

To keep all the errors positive, we have bias square, variance(which itself is a squared value) and irreducible error squared

The biasvariance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.

How do we actually try to make bias-variance trade-off?
There are multiple methods for B-V Trade-off
-Separate training and testing dataset
-Cross-Validation
-Good Performance metrics
-Fitting model parameters

Keep Learning 🙂
The Data Monk

GRE Verbal | Barron’s 800 Destroyed | Day 11

Too much work in WFH 😛
Was already dozing, but let’s end the day with at least a few words in hand. Let’s go

281. quorum – Number of member necessary to conduct a meeting
To start the proceeding of the Parliament, the quorum needs to be maintained.

282. rail – to scold with bitter or abusive language

Tu kisi rail se gujarti hai..sorry, tu kisi rail si gariyati(abusive in my regional language) hai, main kisi pool sa thartharata hun

I think, ye yaad rhega sab ko 😛

283. seismic – related to earthquake

seismic waves or vibrations are observed before earthquake

284. sensual – something with the physical senses

She got aroused with his sensual moves. Can’t have a decent sentence for this word.

285. diffuse – to spread out

There are two words, diffuse and infuse, infuse means to spread in and diffuse means to spread out

286. dismiss – put away from consideration; reject

hehehe..easiest of the lot

287. errant – mistaken; straying away from the proper course

To err is to human, to forgive it to divine i.e. humans are supposed to do mistakes

An “errant child” is one who misbehaves

288. existential – having to do with existence

The word itself is the meaning

289. fallow – plowed but not sowed; uncultivated

Fallow land


290. heterodox – not widely accepted

Not sure about any sentence 🙁

Date – 8th April

291. virtuoso – extremely talented in music or art

He is a guitar virtuoso. Wo aisa guitar bajata hai ki virtual lgta hai 😛

292. vindictive – vengeful; unforgiving

Vindictive sounds like vengeance which means badla lena or revengeful

293. venerate – to adore; respect

It certainly sounds like accelerate, but it’s completely different. It means to respect

The Data Monk was venerated as a saint

294. venal – corrupt

you remember banal which meant boring, no you don’t remember 😛
Vijay Mallya is notoriously venal

295.

Multicollinearity in Simple Terms

We all know the definition of multi-collinearity i.e. when 2 or more explanatory variable in multi regression model are highly linearly related then it’s called multicollinearity

Example –
Age and Selling price of a Car
Education and Annual Income
Height and Weight

Why should we remove multicollinearity from our model?
Example, You are watching WWE and Batista is thrashing Undertaker. Now you know that Batista is better.
But suppose it’s a Royal Rumble where 5 wrestlers are beating Undertaker simultaneously. Now, you can’t say which one is attacking with what intensity and thus you can’t say which wrestler among the five are better.

Thus when you have multiple variables which are correlated, the model is unable to give proper weightage about the impact of each variable. So, we need to remove redundant variables

What all methods are used to remove multi-collinearity?
There are two methods to do the same:-
1. VIF – It stands for Variance Inflation Factor. During regression analysis, VIF assesses whether factors are correlated to each other (multicollinearity), which could affect p-values and the model isn’t going to be as reliable

Factor with high VIF should be removed. A VIF of 1 suggests no correlation

2. PCA – Principal component analysis (PCA) is a technique used to emphasise variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualise.

How so we deal with Multicollinearity in our model?
1. You can use feature engineering to convert the two variables into one and then use this variable

2. Use VIF/PCA to eliminate one of the variables
You should eliminate the one which is not strongly correlated with the target variable

I think this is much about Multi-collinearity. Let me know if you have any questions

Keep Learning 🙂
The Data Monk

GRE Verbal | Barron’s 800 Destroyed | Day 10

We are already good with 220 words, lets’s pass that 250 mark.
All the words given below are directly from Barron’s 800 most frequent words.

251. abeyance – temporary suspension

If you have ever created an Angle Priya profile, there is a high chance that your profile must have been fell into abeyance 😛

252.accretion – growth in size

What is acceleration ?
Growth in speed, it sounds like accretion

253. aggrandize – to make larger or greate4

Can you spot the word GRAN in aggrandize? it’s because it grows in size

254. allure – the power to attract by charm

Easy

255. amalgamate – to combine into one unite

This was last used in Chemistry, where you amalgamate two elements to make an alloy or something.

You also amalgamate ideas

256. ambiguous – unclear

Easy

257. ambivalence – the state of having conflicting emotional attitude

okay, so ambi means both, example – ambidextrous means a person who uses both the hands equally good

So, ambi means tow equal things and valence is emotion, so when you have two emotions, then they clashes

258. amenity – something that increases effort

Wo sab toh thik hai, but what all amenities are there in the hotel?

259. ardor – great emotion or passion

Ardor 2.1 is an awesome pub in Gurgaon(India), and what is it famous for? It’s awesome whiskey which will fill your spirit with emotion and passion

260. argot – a specialised vocabulary used by a specific group

Heyy..whatsapp man? I ain’t doing that thing no more bro

This is the teenage argot

261. beneficent – doing good; generous

A beneficent landowner or a beneficent democracy

262. burgeon – to grow and flourish

Berger is a paint company which helps in flourishing your apartment

263. burnish – to polish

burnish sounds like furnish which means to polish

Highly burnished armour

264. castigation – punishment

If you don’t castigate him when he actually makes the mistake, he will get away with it every time and never improve himself

265. catalyst – something which ameliorate a reaction

you know the meaning of catalyst, now with this you know ameliorate as well..Yeah, I am awesome 🙂

266. chasten – to correct by castigation/punishment

Try to use the words which you are learning on the way.
chasten is to correct something by giving punishment, example Mohammad Asif (match-fixing) 😛

267. chicanery – trickery or fraud

The word itself looks like a tricky one !!
Back in other days, a horse trade was often tinged with fraud and chicanery

268. cozen – To mislead by trick

chicanery and cozen are bhai-bhai
Please learn and remember all the synonyms

269. craven – cowardly

braven is someone who is brave, thus craven is someone who is coward

270. defame – to malign(Malinga wala example), to harm someone’s reputation

A certain set of people tried to defame Sachin Tendulkar by dragging his name in match-fixing

271. demur – to express doubt

After some demur, Nitin accepted the food offered to him by the fellow passenger in the train.

Don’t be like Nitin, you should accept the cold drink as well. It goes well with the food

272. denizen –a regular visitor; inhibitator

Remember, every Den has a denizen

Easyyyy

273. Desiccate – to dry completely

In summer our lips desiccates, so keep yourself hydrated

274. discrete – distinct

Please mention the discrete responsibilities of the UN

275. doggerel – poor verse

I will try to make one, you should also comment your doggerel

When life gives you aata
Don’t make a lachcha paratha

276. dross – waste; worthless matter

alchemist tries to create gold from the dross

277. enhance – to increase; improve

Work on the enhancement of the model

278. exculpate – to clear of blame; disabuse

See, We will always get questions where we need to fill one blank with two synonyms. So, please keep the synonyms in mind

279. exigency – crisis; urgent requirement

Yaar, exigency toh sound hi kr rha hai emergency jaisa 😛

280. filibuster – use of obstructive tactics in a legislature to block passage of law

This is too specific. Couldn’t find a good way to learn it, so I learnt it 😛

Keep Learning 🙂
Target 330