Data Science Terms which are often confused

Data Science = Maths+Code+Business Understanding

Many a time you come across different terminologies which sounds confusing. We will try to make them easier for you to understand and to remember

1.Data Scientist vs Data Analyst

Data Scientist helps you understand the “whatif’s” associated with a problem whereas Data Analyst gets you insights, builds report and present it to the Client.

Data Scientists mostly works around solving long term problems like building an image processor, optimizing sales route, forecasting, etc.
Whereas Data Analysts are mostly occupied with urgent requests, adhocs, etc.. They do have the liberty to work on ML/AI but it will not be the major chunk of his work.

Data Scientist tries to find answers to their own questions whereas Data Analyst answers the asked questions.

And then there are Decision Scientists 😛

2. Linear Regression vs Logistic Regression

-Linear regression is used when the dependent variable is continuous and the nature of the regression line is linear. Whereas Logistic Regression is used when the dependent variable is binary in nature.
Example – Forecasting sales of McD is a Linear problem. Forecasting if a person is depressed is a Logistic Regression problem.

-Linear Regression gives you a value whereas Logistic gives you the probability of success or failure

-The pre-requisite or assumption of Linear Regression is the “linear relationship between the dependent and independent variable”. There is no such assumption in Logistic Regression

-In the linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other

-Logistic Regression is used for Binary classification

3. Dependent vs Independent variable

The dependent variable is your “Y” and Independent variable is your “X” in the equation
Y = mX+constant

An independent variable is a variable that is changed in a scientific experiment to test the effects on the dependent variable.

Example – If I want to forecast the temperature of Bangalore on 15th August using the variables like Temperature of the previous week, humidity, wind speed, etc. then Y is the temperature which you want to forecast and X are humidity, wind speed, etc.

4. Forecasting vs Prediction

These two looks fairly similar but there is a striking difference between the two.
-Forecasting is a way of finding values of the future by looking at the historical data. Whereas prediction helps you in finding answers to the future.
-Forecasting is scientific whereas prediction is subjective (and vague)

Example – You can forecast the number of Biryani’s sold in Mani’s Dum Biryani on a weekend by looking at the historical data.
An astrologer can predict your future by looking at your ever-changing palm lines.

5. K-NN vs K means

– kNN stands for K Nearest Neighbor whereas K-means is K-means
– kNN is a supervised learning algorithm used for classification and regression problem. Whereas K-means an unsupervised learning algorithm used for the clustering problem
– Basically, in kNN, we set the value of K which suggests the algorithm about the number of neighbors a new dataset has to consider before classifying it into one bucket.

Example – Suppose we want to classify image into two groups i.e. Cat and Dog. Since it’s a Supervised Learning technique, so there will be some co-ordinates for the already classified images

Here, red circle is Cat, blue one is dog and the Black rectangle is the new data point. The number lines is equal to the value of K

Above we have a really messy co-ordinate system. You have already specified the value of K for this dataset i.e. Nearest Neighbor as 6. Now whenever a new data point arrives, it draws 6 connections to the nearest values. Here you can see that the number of blue circles are 4 and that of red is 2. So, the new dataset will be classified as blue.

Since, K-means is an unsupervised learning method so we don’t have a training dataset with the correct output. If you are unsure about the difference between supervised and unsupervised learning methods then go through the next point first.
K-means belongs to the family of moving centroid algorithms, i.e. at every iteration the centroid of the cluster moves slightly to minimize the objective function.

Basically, You start with placing a data point on the co-ordinates and then the next data point will be placed on the co-ordinates with respect to the previous point. Similarly, the centroid of the cluster is adjusted.

6. Supervised Learning vs Unsupervised Learning

Supervised means a thing which you can monitor. Supervised learning includes all the algorithms where you know the output of some data. You train your model on these data assuming the fact that these are correct data points. And then you build a model on top of it.

Example – We want to know the number of customers which will come to my restaurant in November. Now, I have the number of customers who have visited my restaurant in the last 3 years. So, we have some data points of the past, we can build a forecasting model using these data points and then we can predict the customers visiting in coming November.

Anything for which we know the output for a few data points will fall under supervised learning

A supervised learning needs some output to build a model. An unsupervised learning algorithm needs nothing. It will build a model on your training dataset y finding connection between different values and it will keep iterating the process until all the data points are considered. An example will help you understand better:-

Example – You have things with different geometric shape, some are circular, some are oval, square, rectangular, etc. You need to make bucket these into 4 parts. Now the algorithm which you will use does not know anything about bucketing, it only knows that you need 4 buckets. It will most probably take the first 4 items and place them on a co-ordinate. Now each object coming in will be allocated near to one of the four buckets. The algorithm will keep iterating till you are done with all the items. By then end of the run, you will have 4 buckets. This is unsupervised learning

7. Training vs Test Dataset

Suppose you have 1000 rows of data and you want to create a forecasting model. You start with LSTM algorithm.

Since forecasting is a supervised learning and you have 1000 rows of historical data, so, you will split your dataset in either 80:20 or 70:30 or any breakdown depending on the business need.

Now you would like to build your model i.e. you want to train your model to act in a similar model when you have a new dataset. The model will create a lot of “ifs and buts”, the dataset on which it will build it’s set of rules is called training dataset. This dataset gives a gist of rules governing the model.

Now you have trained your model on 80% of the data. You would not like to test the model on the real time data, right?
Because you would not like your model to break down on the new data points. So, you have to test the data on that set for which you already know the output i.e. the 20% of the data set for which you know the output but have not included in training the model.

This is not that difficult to understand once you build any statistical model.

8. Random Forest vs Decision Tree

If you have participated in Hackathons, then you must be knowing these algorithms. These two and XGB are one of the best combinations to get a good rank on the table.

– A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then averages the results.

– Random Forest is made of many Decision “Trees”. So, a Decision Tree is more like a flow chart which finally gets you to a decision. When you combine these trees, you get a forest.

– The reason why Random Forest os almost 99 to 100% accurate on the training dataset is that it takes all the possible combinations to boil down to the already provided output. But it might fail to give the same result on the testing dataset whenever there is a new combination of attributes. This is called overfitting.

We will keep on updating the article.

Keep Learning 🙂

The Data Monk

Linear Regression Part 3 – Evaluation of the model

Check out Part 1 and Part 2 of the series before going further

Linear Regression Part 1 – Assumption and Basics of LR
Linear Regression Part 2 – Code and implementation of model

Accuracy is not the only measure to evaluate a Linear Regression model. There multiple ways in which you can evaluate an LR model, we will discuss four out of these:-
1. R Square
2. Adjusted R-Square
3. F-Test

SST i.e. Sum of Squares Total – How far the data are from the mean
SSE i.e. Sum of Squares Error – How far the data are from the model’s predicted values

R Squared = (SST-SSE)/SST
It indicates the goodness of fit of the model.
R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero indicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-squared.

More the predictor, better the R-Squared error. Is this statement true? If, Yes then how to counter this?
This is true, that’s why we do not use R-Squared error as a success metric for models with a lot of predictor variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not.  It is always lower than the R-squared.

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model.

Rsquared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted Rsquared adjusts the statistic based on the number of independent variables in the model.

Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom.

3. F-Test
The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero


Root Mean Square Error takes the difference between the predicted and actual value and square it before dividing the value with the total number of terms.

As you can observe, the RMSE penalizes the difference in prediction quite heavily by doing a square of the difference.

There are other methods also to determine the performance of the Linear Model. These three articles will definitely help you to kick-start your “modeling career”

Keep Learning 🙂

The Data Monk

Linear Regression Part 2 – Implementation of LR

We already know the assumptions of the Linear Regression. We will quickly go through the implementation part of the Linear Regression. We will be using R for this article(Read Sometimes, I am more comfortable in R 😛 )

Remember – The easiest part in any modeling project is to implement the model. The major pain point is cleaning the data, understanding important variables, and checking the performance of the model.

We will quickly go through the packages which you need, the functions and terminologies you need to understand in order to run a Linear Regression Model.

Packages in R
You don’t need any specific package to run the lm() function in R which is used to create a Linear Regression model.

Step 1 – Get your train and test dataset in a variable.
Remember – The name of columns and the number of columns in both the dataset should be same.

pizza <- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizza.csv”)
pizzaTest<- read.csv(“C:\Users\User\Desktop\TDM Book\PaulPizzaTest.csv”)

pizza contains the training dataset and pizza test contains the testing dataset

Top rows of the training dataset
Top rows of the testing dataset

You can also do a multiple fold validation to randomly decide the training and test dataset. But we will try to keep it straight and simple. So, we have manually taken the training and testing dataset. We will not encourage you to do like this. Anyways, we have the training and testing dataset with ourself.

LinearModel <- lm(NumberOfPizzas ~ Spring+WorkingDays,
data = pizza)

You are creating a LR model by the name of LinearModel. The function lm() takes the dependent variable and two independent variables i.e Spring and Working Days. The dataset is pizza i.e. the training dataset of the problem statement.

Let’s look into the summary to analyze the model


What are the residuals?
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Both the sum and the mean of the residuals are equal to zero.

What is coefficient-estimate?
It is the expected value of the number of pizzas which will be sold in a coming month, Since the accuracy of the model is too bad, the number is quite off. It shows that on an average 156691 pizzas is predicted to be sold. The second row shows the impact of each of the variable in this estimated calculation, this is called the slope term. For ex. The slope of Summer is 169.29, this suggests the effect that Summer has on the estimated value.

What is coefficient-standard error?
It suggests the average amount that the coefficient estimates vary from the actual value. The standard error can be used to compute and estimate of the expected difference.

What is coefficient-t value?
The t-value tells you how many Standard Deviation the coefficient is far away from 0. The more far it is, the easier it is to reject the null-hypothesis – i.e. we can declare a relationship between Y and x1,x2, etc.. In the above case only Working Days coefficient t-value is relatively far away from 0 which suggests a high correlation between number of pizzas sold and Working Days. This also helps us in understanding the p-value of that variable.

What should be the value of a good variable which we should include in our model?
A good variable for the model will have the following attributes:-
i.   High Coefficient – t value
ii.  Very low Pr(>|t|) value

What according to you is the most important variable in the model summary given above?
Looking at the t value and Pr value, Working Days seems to be the most relevant variable for the model

Now you need to use this model to predict the values for testing dataset which is stored in pizzatest


pred_model <- predict(LinearModel,pizzaTest)

You can check the accuracy of the values by comparing it with the actual data for these three months.

There are more ways to check the performance of a Linear Regression model which we will discuss in the next article.

Linear Regression Part 1 – Assumptions and basics of LR
Linear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

Linear Regression Part 1 – Assumptions of LR

Here we will talk about concepts which will let you understand Linear Regression and will help you tackle questions in your interview. I personally know a handful of candidates who have bosted themselves as a master of Linear Regression but have failed to answer basic questions(Read – Me :P)

That’s why We have tried to consolidate all the learnings from my peer group.

To understand a Data Science algorithm you need to cover at least these three things:-
1. Assumptions
2. Code and Mathematical Formulas
3. Evaluating Performance of your model

Regression is a method of modeling a dependent value based on independent variables. This model is used extensively to solve a basic forecasting problem.

Do, You want to know what was the next question asked by the panel(Lowe’s Technical Interview Round)?

What is the difference between forecasting and prediction?
These two do look the same, but there is a striking difference between them. While forecasting is based on the historical data and it’s mostly about projecting a line of future values by extending the historical trend.

Prediction is a judgment. It takes into account changes which are taking place in the future.

When you say that it will rain tomorrow, by looking at the historical data then that’s forecasting, but reading palm and telling your future is an example of prediction. So, be ready for all types of questions.

Let’s get back to Linear regression. In a layman term, we can say that Regression is a method to predict the outcome of a variable in the best possible way given the past data and its outcomes.

Example – Can you forecast the salary of the age group 40-45 using the given data?


You can guess that the Salary would be somewhere in the range of $4,000 – $4,500 looking at the already given outcomes. This is how a basic Linear Regression works.

The core of Linear Regression is to understand how the outcome variable is dependent on the independent variables.

What are the assumptions of Linear Regression?

This is one of the most asked questions in an interview which revolves around Linear Regression. There are five major assumptions of a Linear Regression:-
1. Linear Relationship
2. Low or No Multicollinearity
3. No autocorrelation
4. Multivariate Normality
5. Homoscedasticity

You don’t have to learn these points, you need to understand each of these before diving in the implementation part.

1. Linear Relationship – A linear regression assumes that there is a straight line relationship between X and y in the equation given below

Y = Bo + B1X + Epsilon

Y = Bo + B1X is nothing but an equation of straight line

To do any statistical inference, we need to make some assumptions about the error term which is represented as Epsilon. The first assumption comes into the picture where we assume three things for this random error terms:-

a. Mean of the error is 0
b. Error is normally distributed
c. Error has a constant variance

Read the following line at least thrice to understand it

” Error Term ~ N(0, Variance) = This shows that every error term is normally distributed and have a mean of 0 and a constant variance”

Remember the equation Y = Bo+B1X+Epsilon ??

Now we can re-write the equation as Y ~ N(Bo+B1X,Variance)

This means that the Y is normally distributed with a mean of Bo+B1X and a constant variance.

So the first assumption goes like this “The dependent variable is a linear combination of the independent variable and the error term”

You can check the relationship between the dependent and independent variable by a scatter plot

A low or little linearity present in the dataset

2. Multivariate Normality

The second assumption states that every Linear combination of Y with the independent variables needs to have a univariate Normal distribution. Multivariate normality is not a fancy term, it is just a generalization of the one-dimensional normal distribution to higher dimensions

3. Low or No Multicollinearity

This one is easy to understand. Suppose I want to find out the selling price of a car and your independent variables are age of the car, Kilometers, health of engine, etc.

Now, we know that the number of Kilometers and age of the car will have high correlation(generally). The number of kilometers traveled by the car will increase with the age of the car.

Linear Regression states that using two variables with high correlation will complicate the model and will be of no use. So you need to chuck one of these. We will talk about two ways in which you can decide on removing one of the variables from age and kilometer

1. Variance Inflation Factor (VIF)
VIF > 100 is a direct indication of high multicollinearity. In a layman term, remove the variable with high VIF value

2. Correlation Matrix – Plot a correlation matrix to understand the strength of correlation between two variables. Take one out of two variable at a time and check the performance of the model

4. Low or No Correlation between the data

In the last point we talked about the multicollinearity between the independent variables. Now, we want to check if there is a correlation between the data itself.

In simple language, if the value of f(x+1) is dependent on f(x) then the data is having a correlation. A classic example is the share price where the price is dependent on the previous value. You can check the correlation of the data by either a scatter plot or a Durbin-Watsom test. The null hypothesis of Durbin-Watsom test is that “the residuals are not linearly correlated”

If 1.5<d<2.5 then the values are not auto-correlated.

See, if the data itself is correlated then it would be hard to know the impact of other variables on Y. So it is assumed that there is no or little correlation between the data

5. Homoscedasticity

If the residuals are equal across the regression line, then there is no homoscedasticity.

We will look into the implementation part in the next article which will be followed by the evaluation of performance.

Linear Regression Part 2 -Implementation of LR
inear Regression Part 3 – Evaluation of the model

Keep Learning 🙂

The Data Monk

Guesstimate – Price of one Kilogram Potato in India ?

Let’s try to guesstimate the price of one kg potato in India. There could be multiple ways to do it, but your aim should be to keep the “scope of error” to the minimum.

To estimate the price of one kg potato, you can take up any food item which has potato in it.

You can estimate the price by using french fries, Potato burger, etc.

I will choose the classic Indian snack to guessimate the price of a kg of potato

Why Samosa and why not french fries?

Advantage of choosing Samosa –
-We know the lowest price of Samosa i.e. Rs. 10 from any roadside vendor
– You can easy estimate the operational cost and profit of a roadside vendor

Disadvantages of French fries:-
– The price range of French fries is huge, which ranges from Rs. 50 to Rs. 200
– If you want to estimate the price of potato using French Fries from McDonalds then you need to estimate the operational cost.

We take guesstimates to reduce the error percentage and come to a close number.

– One samosa weighs 150 gms
– Price of one samosa is Rs. 10

We will divide the price of a samosa in 4 parts
– Profit
– Cost Price of Potato
– Price of Flour, oil,spice and gas
– Labor cost(if any)

We can safely assume that flour will weigh around 30 gm in each samosa and the rest will be potato, so we have 120 gms of potato

Profit – 30% (You can take it in any range between 40 to 20% because it seems logical to expect such a high percentage of profit considering the fact that the Selling Price is on the lower side)

Profit – Rs. 3

Flour, oil, spices, and Gas – Flour and oil used by the road side vendors are generally of cheap quality, So we can assume an investment of 20% in these items. Rs. 2 for other ingredients

Operational Cost – There will be some operational cost like salary to one or two permanent staff and rent of the place. Let’s take it as 20% i.e. Rs. 2

Now we are left with only Potato, which is the main ingredient of Samosa

Rs. 3 for 120gms of Potato. Use unitary method to solve it further

120 gm = Rs. 3
1000 gm = Rs((3/120)*1000) = Rs. 25 per Kg

Roadside vendor mostly gets potato from whole sale. So, you can also state that the normal market rate of one kilogram of potato is 1.2 times the whole sale rate

Rs.(25*1.2) = Rs. 30 per Kg

Enjoy your samosa /_\

Happy Learning

The Data Monk

The measure of Spread in layman terms

Data Science is a combination of Statistics and Technology. In this article, we will try to understand some basic terminologies in Layman’s language.

Suppose I run a chain of Pizza outlets across Bangalore and have around 500 delivery boys. We have assured “less than 30 minutes delivery time” to our customers, but while going through the feedback forms, We can feel that the delivery executives are taking more than the promised time.

NULL hypothesis – The delivery time is less than 30 minutes. It is represented by Ho

Alternate Hypothesis – The delivery time is not less than 30 minutes or it is more than 30 minutes. It is represented as Ha.

We mostly try to test the Null hypothesis and see whether it’s true.

Population – Your total population is 500, which is the number of delivery boys

Sample – It’s not feasible to test the delivery time of each delivery boy, so we randomly take a small fragment of the population which is termed as Sample

You must have heard the term that a ‘p-value of 0.05 is good’, but what does that actually mean?

p-value helps you in determining the significance of your result in a hypothesis test. So, when you say that the p-value of the test is less than 0.05 then you sound like “There is strong evidence against your Null Hypothesis and you can reject it”

Similarly, when the p-value is significantly more than 0.05 then the Null Hypothesis stays strong as there is weak evidence against the Null Hypothesis.

In a layman’s term, if the hypothesis testing results in p-value less than 0.05 for the case mentioned above then we will be rejecting the null hypothesis by saying that the average amount of time to deliver a pizza is more than 30 minutes.

You must have got a fair bit of idea about population, sample, null hypothesis, alternate hypothesis, and p-value.

Let’s get back to sampling. There are four methods to get a segment out of a population i.e. sampling of a population:-

a. Random Sampling – Completely random selection
b. Systematic Sampling – A systematic way to sample a population, like taking the kth record from the population
c. Stratified Sampling – Suppose the complete population is divided into multiple groups, so stratified sampling will take a sample from each group. This reduces the biasness of the sample.
If we have a data set of people of different age group then a random sample might be biased towards a particular group. But, stratified sampling takes care of this
d. Cluster – When a population is divided into different clusters then we need to get an equal sample from each of these

We have the data set i.e. the population and we have taken a sample from it.
Now, we need to know the spread of the sample or the population.

I assume that you already know about mean, median, mode, etc.

The measure of spread describes how similar or varied a set of observed values are for a variable. The measure of spread includes:-

a. Standard Deviation
b. Variance
c. Range
d. Inter Quartile Range
e. Quartile

You can easily find a copy-book definition on the internet. Let’s try to understand it in simple terms.

Mean gives you an idea of average of the data.
Suppose the average salary of 5000 the employees at Walmart is $100,000.

The variance will give you an idea about the spread of the salary i.e. how far is your data point from the mean. We calculate Variance on either the complete population or the sample population.

Both the formulas are almost the same, the only difference is the denominator. If you just want to memorize the formulas then also it’s fine. But, to understand the denominator, you need to go through the concept of degree of freedom. Let’s try

Degree of freedom is nothing but the number of observations in the data that are free to vary when estimating a parameter. In simple words, if you know the mean or average of 5 numbers, then all you need to know is 4 numbers and you can easily get the 5th number. Here the degree of freedom is n-1 i.e. 4. This is an example of the degree of freedom.

Now the reason why Population variance has N and Sample variance has N-1 as the denominator?

When we have a population size of 1000 and we have to calculate Population variance then we already know the mean of the Population, thus we divide it with N.

Read this loud – When we only know the mean of the sample, then we divide the value with N-1 to “compensate on the fact that we don’t have concrete information about the population, thus we try to keep the overall value larger by dividing it with N-1”

Quartile is a number and not a group of values. It’s more like a cut-off value. You have 3 quartiles in statistics

Quartile 1 – Also called 25th percentile. This is the middle number between the smallest number and the median of the data set.

Quartile 2 – Also called the median and the 50th percentile

Quartile 3 – Also called the 75th percentile and upper quartile. This is the middle value between the median and the highest value of the data set

To find the quartiles of this data set, use the following steps:

  1. Order the data from least to greatest.
  2. Find the median of the data set and divide the data set into two halves.
  3. Find the median of the two halves.

Interquartile Range = Q3-Q1
It is the midspread or the middle 50 percentile of the dataset. It’s also value and not a group of numbers

Bonus knowledge

How to identify an outlier?

A basic rule is that whenever a data point is more than 1.5 times of the third quartile or less than 1.5 times of the first quartile, then it’s termed as an outlier.

Ek example dekh lo, samjh aa jaeyga

Numbers – 2,4,5,7,9,11,13

Median = 4th term as we have 7 terms and the numbers are arranged in ascending order. Thus median(Q2) is 7

Quartile 1(Q1) = Median of the dataset containing the lower half of the data i.e. calculate the median of 2,4,5. Thus Q1 will be the 2nd term i.e. 4

Quartile 3(Q3) = Median of the upper half of the data i.e. median of 9,11,13. Thus median is 11

(2,4,5),7,(9,11,13) ~ (Q1),Q2,(Q3)

Inter Quartile Range = Q3-Q1 = 11-4 = 7

Keep Learning

The Data Monk

100 Natural Language Processing Questions in Python

  1. What is NLP?
    NLP stands for Natural Language Processing and it is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.

  2. What are the uses of NLP?
    Natural Language Processing is useful in various domains like Chat bots, Extracting insights from feedback and surveys, text-classification, etc.

  3. What are the different algorithms in NLP?
    NLP is used to analyze text, allowing machines to understand how human’s speak.
    This human-computer interaction enables real-world applications like
    a. automatic text summarization
    b. sentiment analysis
    c. topic extraction
    d. named entity recognition
    e. parts-of-speech tagging
    f. relationship extraction
    g. stemming, and more.
    NLP is commonly used for text mining, machine translation, and automated question answering.
  4. What problems can NLP solve?
    NLP can solve many problems like, automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
  5. What is Regular Expression?
    A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
    Regular expressions are a generalized way to match patterns with sequences of characters.
  6. What are the different applications of Regular Expression in Data Science?
    a. Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you.
    b. Social websites feed like the Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
    c. Speech engines like Apple Siri.
    d. Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.
  7. What are the packages in Python to help in Regular ExpressionThe package which we commonly use for regular expression is re. We can import the package using following command

    import re
  8. What is match function?
    import re

  9. What are the common patterns used in regular expression?
     \w+ -> word
    \d -> digit
    \s -> space
    \* ->wildcard
    + or * -> greedy match
    \S -> anti space i.e. it matches anything which is not a space
    [A-Z] – matches all the character in the range of capital A and capital Z
  10. What are the important functions to use in Regular Expression?
    findall() – It finds all the patterns in a string
    search() – It search for a pattern
    match() – It matches an entire string or a sub string
    split() – It splits a string in Regular Expression. It returns a list object
  11. What is the difference between match and search function?
    Match tries to match the string from beginning whereas search matches it wherever it finds the pattern. The below example will help you understand better

    import re
    print(re.match(‘kam’, ‘kamal’))
    print(re.match(‘kam’, ‘nitin kamal’))
    print(‘kam’,’nitin kamal’))
    <re.Match object; span=(0, 3), match=’kam’>
    <re.Match object; span=(0, 3), match=’kam’>
    <re.Match object; span=(6, 9), match=’kam’>

  12. Guess the output of the following
    import re
    re.split(‘\s’,’The Data Monk is cool’)

  13. Work in finding the output of the following
    regx = r”\w+”
    strx = “This isn’t my pen”

    [‘This’, ‘isn’, ‘t’, ‘my’, ‘pen’]
  14. How to write a regular expression to match some specific set of characters in a string?
    special_char = r”[?/}{‘;]“
    The above Regular Expression will take all the characters between []

  15. Write a regular expression to split a paragraph every time it finds an exclamation mark

    import re
    exclamation = r”[!]”
    strr = “Data Science comprises of innumerable topics! The aim of this 100 Days series is to get you started assuming ! that you have no prior! knowledge of any of these topics. “
    excla = re.split(exclamation,strr)

    [‘Data Science comprises of innumerable topics’, ‘ The aim of this 100 Days series is to get you started assuming ‘, ‘ that you have no prior’, ‘ knowledge of any of these topics. ‘]
  16. Get all the words starting with a capital letter

    capital = r”[A-Z]\w+”

    [‘Data’, ‘Science’, ‘The’, ‘Days’]

  17. Find the output of the following code?
    digit = “12 34 98”
    find_digit = r”\d+”

    [’12’, ’34’, ’98’]

  18. What is tokenization?
    Tokenization is one of the most important part of NLP. It simply means to break down the string into smaller chunks. It breaks the paragraph into words, sentences, etc.
  19. What is NLTK?
    NLTK stands for Natural Language Toolkit Library and it is a package in Python which is very commonly used for tokenization.

    from nltk.tokenize import word_tokenize
    word_tokenize(“This is awesome!”)

    [‘This’, ‘is’, ‘awesome’, ‘!’]

  20. What are the important nltk tokenizer?

    sent_tokenize – Tokenize a sentence
    tweet_tokenize – This one is exclusively for tweets which can come handy if you are trying to do sentiment analysis by looking at a particular hashtag or tweets
    regexp_tokenize – tokenize a string or document based on a regular expression pattern

  21. What is the use of the function set() ?
    The data type set is a collection. It contains an unordered collection of unique and immutable objects. So when you extract a set of words from a novel, then it will get you the distinct words from the complete novel. It is a very important function and it will continue to come in the book as you go ahead.
  22. Tokenize the paragraph given below in sentence.
    para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”     

    import nltk.tokenize import sent_tokenize
    import nltk.tokenize import word_tokenize

    para = “This is the story about 
    Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”
    sent = sent_tokenize(para)

      [‘This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization.’, ‘This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead.’, ‘Be with Piyush and Pihu to understand Data Science and Machine Learning.’]
  23. Now get all the words from the above paragraph

    word = word_tokenize(para)

    [‘This’, ‘is’, ‘the’, ‘story’, ‘about’, ‘Piyush,29’, ‘,’, ‘Senior’, ‘Data’, ‘Scientist’, ‘at’, ‘Imagine’, ‘Incorporation’, ‘and’, ‘myself’, ‘,’, ‘Pihu,24’, ‘,’, ‘Junior’, ‘Data’, ‘Scientist’, ‘at’, ‘the’, ‘same’, ‘organization’, ‘.’, ‘This’, ‘is’, ‘about’, ‘the’, ‘journey’, ‘of’, ‘Piyush’, ‘once’, ‘he’, ‘retired’, ‘from’, ‘his’, ‘job’, ‘,’, ‘after’, ‘being’, ‘unsatisfied’, ‘with’, ‘the’, ‘way’, ‘his’, ‘career’, ‘was’, ‘moving’, ‘ahead’, ‘.’, ‘Be’, ‘with’, ‘Piyush’, ‘and’, ‘Pihu’, ‘to’, ‘understand’, ‘Data’, ‘Science’, ‘and’, ‘Machine’, ‘Learning’, ‘.’]

  24. Now get the unique words from the above paragraph

    {‘retired’, ‘ahead’, ‘the’, ‘about’, ‘with’, ‘Piyush,29’, ‘Senior’, ‘Piyush’, ‘being’, ‘Science’, ‘was’, ‘Imagine’, ‘at’, ‘journey’, ‘way’, ‘same’, ‘and’, ‘Pihu’, ‘Pihu,24’, ‘Learning’, ‘from’, ‘story’, ‘he’, ‘Be’, ‘Machine’, ‘once’, ‘to’, ‘unsatisfied’, ‘Junior’, ‘of’, ‘career’, ‘Data’, ‘moving’, ‘is’, ‘understand’, ‘.’, ‘myself’, ‘after’, ‘job’, ‘,’, ‘Incorporation’, ‘Scientist’, ‘organization’, ‘This’, ‘his’}

  25. What is the use of .start() and .end() function?

    Basically .start() and .end() helps you find the starting and ending index of a search. Below is an example:

    x =“Piyush”,para)

    24 30
  26. What is the OR method?
    OR method, as the name suggests is used to give condition to the regular expression. See the example below:-

    x = r”\d+ | \w+”

    The above regex will get you all the words and numbers, but it will ignore other characters like punctuation, ampersand, etc.

  27. What are the advance tokenization techniques?
    Take for example [A-Za-z]+, this will get you all the alphabets regardless of upper or lowercase alphabets

  28. How to write a regex to match spaces or commas?
    (/s+|,) – The /s+ will get you one or more spaces, and the pipe will mark an OR operator to take the comma into consideration

  29. How to include special characters in a regex?
    If you have any experience with regular expression or SQL queries, then this syntax will look familiar. You need to give a backward slash before any special character like below

    (\,\.\?) – This will consider comma, full stop and question mark in the text
  30. What is the difference between (a-z) and [A-Z]?
    This is a very important concept, when you specify (a-z), it will only match the string “a-z”. But when you specify [A-Z] then it covers all the alphabet between upper case A and Z.
  31. Once again go through the difference between search() and match() function
    Search() will find your desired regex expression anywhere in the string, but the match always looks from the beginning of the string. If a match() function hits a comma or something, then it will stop the operation then and there itself. Be very particular on selecting a function out of these
  32. What is topic modeling?
    In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
  33. What is bag-of-words?
    Bag-of-words is a process to identify topics in a text. It basically counts the frequency of the token in a text. Example below to help you understand the simple concept of bag-of-words

    para = “The game of cricket is complicated. Cricket is more complicated than Football”

    The – 1
    game – 1
    Cricket – 1
    than – 1
    Football – 1

    As you can see, the word cricket is counted two times as bag-of-words is case sensitive.

  34. How to counter the case sensitive nature of bag-of-words?
    It’s a logical question, just convert every word in lower or upper case and then count the words. Look for question 35 to convert every word in lower case using loop.
  35. What is counter?
    A counter is a container that keeps count of number of times equivalent values are added. It looks similar to dictionary in Python. Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts.

  36. How to import Counter in Python?
    Counter is present in the Collection package, you can use it directly by importing it like below:

    from collections import Counter

  37. Use the same paragraph used above and print the top 3 most common words
    The code is self explanatory and is given below:

    word2 = word_tokenize(para)
    lower_case = [t.lower() for t in word2]
    bag_of_words = Counter(lower_case)

    [(‘the’, 4), (‘,’, 4), (‘data’, 3)]
  38. What is text preprocessing?
    text pre processing is a complete process to make the text ready for analysis by removing stop words, common punctuations, spelling mistakes, etc. Before any analysis you are suppose to process the text.
  39. What are the commonly used methods of text preprocessing?
    Converting the complete text in either lower or upper case
    Removing stop words
  40. How to tokenize only words from a paragraph while ignoring the numbers and other special character?

    x = “Here is your text. Your 1 text is here”
    from nltk.corpus import stopwords
    only_alphabet = [w for w in word_tokenize(x.lower())
    if w.isalpha()]

    w.isalpha() function will check if the word has only text in it and will remove the numbers

    [‘here’, ‘is’, ‘your’, ‘text’, ‘your’, ‘text’, ‘is’, ‘here’]
  41. What are stop words?
    Stop words are common occurring words in a text which have high frequency but less importance. Words like the, are, is, also, he, she, etc. are some of the examples of English stop words.

  42. How to remove stop words from my text?
    from nltk.corpus import stopwords
    para = “Your text here. Here is your text”
    tokens = [w for w in word_tokenize(para.lower)
                      if w.isalpha()]
    stoppy = [t for t in tokens
                      if t not in stopwords.words(‘english’)]

  43. What is Lemmatization?
    Lemmatization is a technique to keep words in its base form or dictionary form of the word. Example will help you understand better

    The lemma of better will be good.
    The word “walk” is the base form of the word “Walking”
  44. Give an example of Lemmatization in Python
    x = “running”
    import nltk‘wordnet’)

  45. How to lemmatize the texts in your paragraph?
    Use the module WordNetLemmatizer from nltk.stem

    from nltk.stem import WordNetLemmatizer
    lower_tokens = word_tokenize(para)
    lower_case = [t.lower() for t in lower_tokens]
    only_alphabet = [t for t in lower_case if t.isalpha()]
    without_stops = [x for x in only_alphabet if x not in stopwords.words(“English”)
    lemma = WordNetLemmatizer()
    lemmatized = [lemma.lemmatize(t) for t in without_stops]

  46. What is gensim?
    Gensim is a very popular open-source NLP library. It is used to perform complex tasks like:-
    a. Building document or word vectors
    b. Topic identification
  47. What is a word vector?
    Word vector is a representation of words which helps us in observing relationships between words and documents. Based on how the words are used in text, the word vector help us to get meaning and context of the words. Example, the word vector will connect Bangalore to Karnataka and Patna to Bihar where Bangalore and Patna are capital of the Indian state Karnataka and Bihar.

    These are multi-dimensional mathematical representation of words created using deep learning method. They give us insight into relationships between words in a corpus.

  48. What is LDA?
    LDA is used for topic analysis and modeling. It is used to extract the main topics from a dataset. LDA stands for Latent Dirichlet Allocation. Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.

  49. What is gensim corpus?
    Gensim corpus converts the tokens in bag or words. It gives result in a list of (token id, token reference). The gensim dictionary can be updated and reused

  50. What is stemming?
    Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.
  51. Give an example of stemming in Python
    from nltk.stem.porter import PorterStemmer
    stem = PorterStemmer()
    x = “running”

  52. What is tf-idf?
    Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.
  • The weight will be low in two cases:-
    a. When the term frequency is low i.e. number of occurrence of a word is low
    b. When N is equal to dfi, then the log will be close to zero

    So, using (b), if a word occurs in all the document, then the log value will be low

    If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-


    53. How to create a tf-idf model using gensim?
    from gensim.models.tfidfmodel import TfidfModel
    tfidf = TfidfModel(corpus)
     tf_idf_weights = tfidf([doc])

    # Sort the weights from highest to lowest: sorted_tfidf_weights
    sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
    # Print the top 5 weighted words
    for term_id, weight in sorted_tfidf_weights[:5]:
     print(dictionary.get(term_id), weight)

    54. What is Named Entity Recognition?
    It is a process of identifying important named entity texts in a document. Ex. organization, dashboard names, work of arts, etc.
    It is present in the ne_chunk_sents() function in nltk package. It can be used as below:-

    chunk_Sent = nltk.ne_chunk_sents(Part_Of_Speech_sentence_token, binary = True)

    55. What is POS?
    Part of Speech tag in Natural Language Processing is used to tag a word according to its use in the sentence. It tags the word as a part of speech.
    It is present as pos_tag() in nltk package. You can feed the tokenized word in a loop to get the POS tag for each word like below:-

    pos = [nltk.pos_tag(x) for x in tokenized_word_variable]

    56. What is the difference between lemmatization and stemming?
    Lemmatization gets to the base of the word whereas stemming just chops the tail of the word to get the base form. Below example will serve you better:

    See is the lemma of saw, but if you try to get the stem of saw, then it will return ‘s’ as the stem.
    See is the lemma of seeing, stemming seeing will get you see.

    54. What is spacy package?
    Spacy is a very efficient package present in Python which helps in easy pipeline creation and finding entities in tweets and chat messages.

    55. How to initiate the English module in spacy?
    import spacy
    x = spacy.load(‘en’,tagger=False,parser=False,matcher=False)

    56. Why should one prefer spacy over nltk for named entity recognition?
    Spacy provides some extra categories, other than the one provided by nltk.

    These categories are:-
    -Work of art

    So, you can try spacy for NER according to your need

    57. What are the different packages which uses word vectors?
    Spacy and gensim are the two packages which we have covered so far that uses word vectors.

    58.What if your text is in various different languages? Which package can help you in Named Entity Recognition for most of the largely spoken languages?
    Polygot is one of the package which supports more than 100 languages and uses word vector for Named Entity Recognition

    59.What is supervised learning?
    Supervised learning is a form of Machine Learning where your model is trained by looking at a given output for all the inputs. The model is trained on this input-output combination and then the learning of the model is tested on the test dataset. Linear Regression and Classification are two examples of supervised learning.

    60. How can you use Supervised Learning in NLP?
    Suppose you have a chat data and looking at the keyword you have specified the sentiment of the customer. Now you have got a set of data which have complete chat and the sentiment associated with the chat. Now you can use supervised learning to train the data on this dataset and then use it while there is alive chat to identify the ongoing sentiment of the customer.

    61. What is Naïve-Bayes model?
    Naive Bayes classifiers are linear classifiers that are known for being simple yet very efficient. The probabilistic model of naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent.

    62.What is the flow of creating a Naïve Bayes model?
    from sklearn import metrics
    from sklearn.naive_bayes import MultinomialNB
    # Instantiate a Multinomial Naive Bayes classifier: nb_classifier
    nb_classifier = MultinomialNB()
    # Fit the classifier to the training data,y_train)
    # Create the predicted tags: pred
    pred = nb_classifier.predict(count_test)
    # Calculate the accuracy score: score
    score = metrics.accuracy_score(y_test,pred)
    # Calculate the confusion matrix: cm
    cm = metrics.confusion_matrix(y_test,pred,labels=[‘FAKE’,’REAL’])

    Let’s take some sample text and try to implement basic algorithms first

    63. What is POS?
    POS stands for Parts of Speech tagging and it is used to tag the words in your document according to Parts of Speech. So, noun, pronoun, verb, etc. will be tagged accordingly and then you can filter what you need from the dataset. If I am just looking for names of people mentioned in the comment box then I will look for mainly Nouns. This is a basic but very important algorithm to work with.

    64. Take an example to take a sentence and break it into tokens i.e. each word
    text = “The Data Monk will help you learn and understand Data Science”
    tokens = word_tokenize(text)
    print (tokens)       
    [‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

    65. Take the same sentence and get the POS tags

    from nltk import word_tokenize, pos_tag
    text = “The Data Monk will help you learn and understand Data Science” tokens = word_tokenize(text)
    print (pos_tag(tokens))

    [(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

    66. Take the following line and break it into tokens and tag POS using function
    data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

    data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

    #Tokenize the words and apply POS
    def token_POS(token):
    token = nltk.word_tokenize(token)
       token = nltk.pos_tag(token)
       return token
    token = token_POS(data)


  • 67. What is NER?
    NER stands for Named Entity Recognition and the work of this algorithm is to extract specific chunk of data from your text data. Suppose you want to get all the Nouns from the dataset . It is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes. Etc.

    68. What are some of the common tags in POS. You need to know the meaning of the tags to use it in your regular expression
    DT – Detreminer
    FW – Foreign word
    JJ – Adjective
    JJR – Comparative Adjective
    NN – Singular Noun
    NNS – Plural Noun
    RB – Adverb
    RBS – Superlative Adverb
    VB – Verb

    You can get the complete list on the internet.

    69. Implement NER on the tokenized and POS tagged sentence used above.‘maxent_ne_chunker’)‘words’)
    ne_chunked_sents = nltk.ne_chunk(token)
    named_entities = []
    for tagged_tree in ne_chunked_sents:
        if hasattr(tagged_tree, ‘label’):
            entity_name = ‘ ‘.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label() # get NE category
    named_entities.append((entity_name, entity_type))

    [(‘Data Monk’, ‘ORGANIZATION’), (‘Bangalore’, ‘GPE’), (‘Data Science’, ‘PERSON’), (‘Amazon’, ‘ORGANIZATION’)]

    Code Explanation will import maxent_ne_chunker which is used to break the sentence into named entity chunks and‘words’) will download the dictionary

    We already have a variable token which contains POS tagged tokens. nltk.ne_chunk(token) will tag the tokens to Named entity chunks.

    function hasattr()is used to check if an object has the given named attribute and return true if present, else false.

    .leaves() function is used to get the leaves of the node and label() will get you the NER label

    70.What are n-grams?
    A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.

    71. Create a 3-gram of the sentence below
    “The Data Monk was started in Bangalore in 2018″

    def ngrams(text, n):
        token = text.split()
        final = [] 
        for i in range(len(token)-n+1):
        return final
    ngrams(“The Data Monk was started in Bangalore in 2018”,3)


72. What is the right order for a text classification model components?

Text cleaning
Text annotation
Text to predictors
Gradient descent
Model tuning

73. What is CountVectorizer?
CountVectorizer is  a class from sklearn.feature_extraction.text. It converts a selection of text documents to a matrix of token counts.


Let’s take up a project and try to solve it using NLP. Here we will only create the dataset and will apply Random forest and NLP to train our dataset to identify the sentiment of a review

Objective of the project is to predict the correct tag i.e. whether people liked the food or not using NLP and Random Forest.

74. How to create a dataset? What to write in it?
Open an excel file and save it as Reviews (in the csv format). Now make two columns in the sheet like the one given below

Review Liked
This restaurant is awesome 1
Food not good 0
Ambience was wow 1
The menu is good 1
Base was not good 0
Very bad 0
Wasted all the food 0
Delicious 1
Great atmosphere 1
Not impressed with the food 0
Nice 1
Bad taste 0
Great presentation 1
Lovely flavor 1
Polite staff 1
Bad management 0

Basically you can write the review of anything like Movies, food, restaurant, etc. Just make sure to keep the format like this. Thus your dataset is ready.

75. What all packages do I need to import for this project?
It’s always good to start with importing all the necessary packages which you might use in the project

import re
import pandas as pd
import numpy as np
import nltk‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

We will discuss each of these as we tackle the problem

76. How to import a csv file in Python?
Importing csv file in python requires importing pandas library and using read_csv function

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)

77. Let’s view the top and bottom 5 lines of the file to make sure we are good to go with the analysis
Use the commands given below
review.head() and review.tail()

78. Now we will clean the dataset. Will start with removing numbers and punctuations. Write a regular expression for removing special characters and numbers

review is the name of the data set and Review is the name of the column

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )

79. What is sub() method?
The re.sub() function in the re module can be used to replace substrings.

The syntax for re.sub() is re.sub(pattern,repl,string).

That will replace the matches in string with repl.

80. Convert all the text into lower case and split the words
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()

81. Now we want to stem the words. Do you remember the definition of stemming?
 Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

82. What does the above snippet do?
port = PorterStemmer() allocates the stemming function to the variable port
port.stem(words) for words in x – 
It takes all the words individually. Also remove the words which are stopwords.

x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

The above loop will get all the non-stop words and stem the words

83. Create the final dataset with only stemmed words.
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)

Let’s see how the final dataset looks like after removing the stop words and stemming the text

84. How to use the CountVectorizer() function? Explain using an example
from sklearn.feature_extraction.text import CountVectorizer
corpus = [‘The Data Monk helps in providing resource to the users’,
         ‘It is useful for people making a career in Data Science’,
         ‘You can also take the 100 days Challenge of TDM’]
counter = CountVectorizer()
X = counter.fit_transform(corpus)

get_feature_name() will take all the words from the above dataset and will arrange it in an alphabetical order
fit_transform() will transform each line of the dataset as compared to the result of get_feature_name()
toArray will change the datatype to Array

Lets understand the output

[‘100’, ‘also’, ‘can’, ‘career’, ‘challenge’, ‘data’, ‘days’, ‘for’, ‘helps’, ‘in’, ‘is’, ‘it’, ‘making’, ‘monk’, ‘of’, ‘people’, ‘providing’, ‘resource’, ‘science’, ‘take’, ‘tdm’, ‘the’, ‘to’, ‘useful’, ‘users’, ‘you’]

         [[0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 2 1 0 1 0]
         [0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0]
         [1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1]]

The first output is the 26 unique words from the 3 lines of document arranged in alphabetical order.
The next three contains the presence of the above words in the document. 0 present in the 1,2,3, and 4th place of the first row suggests that the words 100, also, can, and career are not present in the first line of the input.
Similarly 2 present on the 22nd position shows that the word “the” is present twice in the first row of input
The first row of input is “The Data Monk helps in providing resource to the users”

85. Now let’s apply CountVectorizer on our dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(final).toarray()

max_feature = 1500 will make sure that at max 1000 words are put into the master array. In case you are planning to apply this on a huge dataset, then do increase the max_feature component.
X will have the same array of occurrence across all the features as we have seen in the above example

86. How to separate the dependent variable?
As we know we want to see whether the review was positive or not. So the dependent variable here is the second column and we have put the value of the second column in a different variable i.e. y

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values

So, X has the array containing an array of occurrence of different words across all the words and y has the binary value where 1 denotes like and 0 denotes did not like

87. Now we need to split the complete data set into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

You already know about X and y, the test_size will divide the train and test dataset in 75:25 ratio respectively
Now you will have to train the model on X_train and y_train.

88. Random forest is one of the best model to work on supervised learning. By the way, what is Random forest?
Before we start with explaining a forest, we need to know what is a tree? Random forest is made of decision trees. To illustrate the concept, we’ll use an everyday example: predicting the tomorrow’s maximum temperature for our city. To keep things straight, I’ll use Seattle, Washington, but feel free to pick your own city.
In order to answer the single max temperature question, we actually need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for this problem might be 30–70 degrees (Fahrenheit) if we do not know the time of year before we begin. Gradually, through a set of questions and answers we reduce this range until we are confident enough to make a single prediction.

Since temperature is highly dependent on time of year, a decent place to start would be: what is the season? In this case, the season is winter, and so we can limit the prediction range to 30–50 degrees because we have an idea of what the general max temperatures are in the Pacific Northwest during the winter. This first question was a great choice because it has already cut our range in half. If we had asked something non-relevant, such as the day of the week, then we could not have reduced the extent of predictions at all and we would be back where we started. Nonetheless, this single question isn’t quite enough to narrow down our estimate so we need to find out more information. A good follow-up question is: what is the historical average max temperature on this day? For Seattle on December 27, the answer is 46 degrees. This allows us to further restrict our range of consideration to 40–50 degrees. Again, this was a high-value question because it greatly reduced the scope of our estimate.

We need to have similar questions and once we put everything in a flow we will get a decision tree.
So, to arrive at an estimate, we used a series of questions, with each question narrowing our possible values until we were confident enough to make a single prediction. We repeat this decision process over and over again in our daily lives with only the questions and answers changing.

89. What is Random Forest?
Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

 Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

You can read a lot on for the explanation of Decision Tree and Random Forest in layman’s term

90. Let’s create our Random forest model here
model = RandomForestClassifier(n_estimators = 10,

                            criterion = ‘entropy’), y_train)

91. Define n_estimator
n_estimator is basically the number of trees you want to create in your forest. Try to vary the number of trees in this forest.
In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, i.e. at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.

Random forests are ensemble methods, and you average over many trees. Similarly, if you want to estimate an average of a real-valued random variable (e.g. the average heigth of a citizen in your country) you can take a sample. The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample.

92. Define criterion. Why did you use entropy and not gini?
Gini is intended for continuous attributes and Entropy is for attributes that occur in classes.
Gini is to minimize misclassification

Entropy is for exploratory analysis
Entropy is a little slower to compute

93. What is helps you in create your model. The two parameters are that of training dataset i.e. X_train and y_train. It will take the values or the output of the reviews and will create a lot of decision trees to fit the output on the basis of input. These rules will be applied to your testing dataset to get the results

94. Let’s predict the output for the testing dataset
y_pred = model.predict(X_test)

 You have just created the model on X_train and y_train. Now you need to predict the output for X_test. We already have the output for these, but we want our model to predict the answer so that we can match the answers or output

95. Now let’s check the confusion matrix to see how many of our outputs were correct
from sklearn.metrics import confusion_matrix  

cm = confusion_matrix(y_test, y_pred)

96. Lastly, what is a confusion matrix and how to know the accuracy of the model?
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Let’s take an example of a confusion matrix

So, our rows contain real values for a binary classifier and the columns have our predicted values. 50 and 100 show that the predicted and actual values were correctly identified. 10 and 5 show that the predicted values were not correct. Explore precision, recall, etc.

As far as accuracy is concerned, the formula is simple = (50+100)/(50+10+5+100)
i.e. total correct prediction divided by all the prediction.

Our model had very less dataset. The confusion matrix resulted in the following

Therefore accuracy = (1+3)/(1+0+0+3) = 100% accuracy
Yeahhhh..we are perfect

Complete code

import re
import pandas as pd
import numpy as np
import nltk‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)
final = []

for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)

cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 501,
                            criterion = ‘entropy’)   
                , y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

Damn !! I got out in the nervous 90’s 😛

This is all you need to hop on a real life problem or a hackathon. Do comment you find any flaw in the code.

Keep Learning 🙂

The Data Monk

Kaggle Titanic Solution

Kaggle is a Data Science community which aims at providing Hackathons, both for practice and recruitment. You should at least try 5-10 hackathons before applying for a proper Data Science post.

Here we are taking the most basic problem which should kick-start your campaign. This hackathon will make sure that you understand the problem and the approach.

To download the dataset and submission of the solution, click here

P.S. –
1. We have used an intermediate level of feature engineering, you might have to create more features to boost your rank, but it’s a good way to start the journey
2. You need to have Python installed in your system and very basic knowledge of Python
3. We have deliberately put the screenshots and not the actual code because we want you to write the codes

Problem Description – The ship Titanic met with an accident and a lot of passengers died in it. The dataset describes a few passengers information like Age, Sex, Ticket Fare, etc.

Aim – We have to make a model to predict whether a person survived this accident. So, your dependent variable is the column named as ‘Surv

Let’s start with importing the data

-Check the dataset by the following commands


-Check the number of rows and columns in each of the datasets by the following command


-The first thing which you need to do before starting any hackathon or project is to import the following important libraries

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Following is a brief description of the columns in the dataset

-You need to know the columns with missing values. the very basic thing is to check the description of the dataset with the following command

You can see we have 891 rows and there are missing values in Age, Cabin, and Embarked.

– It’s time to identify the important variables

Pclass is the class of the passenger, let’s see how many passengers were there in each class

There were a lot of customers in Class 3, followed by Class 1 and Class2.

-We will be creating a variable to store the survived and not survived passengers to check how many passengers died from each Class

  • Now let’s check how many male and female died in this accident
75% female and 25% male survived the accident

-Let’s check if the class of the passenger was also given a priority. Class 1 is the rich class, followed by 2 and 3

40% Class 1 passenger and 25% Class 3 passengers survived the accident
  • Let’s check the Embarked column i.e. the point of boarding. This column has 2 missing values
Most of the passengers boarded from point S. So we can directly fill the 2 missing values with S

More than 66% of the passengers who boarded from the point S died in the incident.

-Parch is the number of parents or children traveling along with a passenger

More than 65% of the passengers travelling alone died in the accident
  • SibSp is the number of siblings or spouse traveling along with a passenger
More than 65% of the passengers travelling without a sibling or spouse died in the accident.

-Understanding the correlation between two variables gives you an understanding of whether the features are directly or indirectly related to each other.

-We will be merging the dataset train and test so that the changes applied to the complete dataset can be done at once

final_data = [train,test]

Changing Data Types

1. Change male and female to binary value

2. Age has some missing values, right now we are replacing the missing values with the mean. But, you can very well replace it with random values in the range of mean+standard deviation and mean-standard deviation

3. Since there are only 2 missing values in Pclass, so we are replacing it with the most common Pclass i.e. S

Let’s now fix the Pclass and convert the categorical variables into numeric variable

4. We will fix the missing values present in the Fare column with the median value

5. Let’s create one more variable i.e. Family Size which will have the following formula:-

Family Size = Parch + SibSp + 1

This will include the family size of a passenger traveling in the shi

Do keep checking the head of train and test to make sure that dataset is getting modified

We will be removing Ticket and Cabin because Ticket number is an UID so there won’t be any relation with the person survived and Cabin because of heavy missing values
Though you are free to apply your mind in getting something out of the Ticket Number

We are also not using the Name column, though a lot of Kaggle solution used to extract the title from each name. You should try it once you complete the basic submission

-Check the head of train1


Drop PassengerId from both train1 and test1

-Put the survived column in the variable y_train1
-Keep every column other than Survived in X_train1
-Keep all the test columns in a new variable X_test1

Why are we doing these new variables?
The idea is to keep the dependent variable i.e. the on which you want to predict in y_train1.
Put all the independent variables in X_train1 which will be used to create a model

Once the model is ready, you have to predict the value for the passengerId given in the test dataset, so we have kept it in a separate variable i.e. X_test1

Just to iterate, before we move forward with the models
X_train1 – All the independent columns which you need in the model. Drop the unnecessary columns
y_train1 – The dependent variable
X_test1 – The dataset on which you want to make the prediction

Creating models
This will include a set of steps

Step 1 – Import the package
Step 2 – Put the algorithm in a variable
Step 3 – Fit the dependent variable(y_train1) and the independent variable(X_train1)
Step 4 – Do the prediction using the predict function on the X_test1
Step 5 – Get the accuracy of the model by using the score function

1. Logistic Regression

2. Support Vector Machine

3. K-Nearest Neighbor – We will try the value of KNN as 2,3, and 4

K-Nearest Neighbor with neighbor = 2

K-Nearest Neighbor with neighbor = 4

K-Nearest Neighbor with neighbor = 3

4. Decision Tree – Decision Tree and Random Forest will definitely overfit as these consider all the possible combination of the training dataset. That’s why the accuracy of DT is 100%

5. Random Forest – n_estimator is the number of trees you want in the Forest

6. Perceptron

We tried these algorithms
1. Logistic Regression
2. SVM
3. KNN
4. Decision Tree
5. Random Forest
6. Perceptron

Make your first submission using Random Forest

You need to get the pred_RF column from the model and combine it with PassengerId from the test datset

Submit it on Kaggle.

You can also try submitting results from other algorithms. Following is the example of Logistic Regression

1. This article is just to make sure that you understand how to start exploring Data Science Hackathons
2. Feature Engineering is the key
3. Try more algorithms to climb the Leader Board

Keep Learning 🙂

The Data Monk

Python – Part 1/10

5-6 years back Java was said to be ever lasting. Everyone wanted a Java developer in their team. Looking at the current scenario, we can safely assume that Python is and will be one of the most used Programming language across multiple domains ranging from software development to web development and Data Science.

Talking particularly about Data Science, Python is blessed by a humongous community of Data Scientists who contribute a lot to the development and betterment of the language. Apart from the community, the libraries and packages which are regularly developed are making it easier for people to explore Data Science.

Python is not the only language which can be used for Data Science purpose. Few other languages are:-
1. R
2. SAS
4. C

We will try to cover everything in Python so that you get fluent in at least one language and in the current era if you have to choose one language to better your career, then do give a shot to Python.


At the time of writing this blog, two versions of Python are popular
Python 2.7
Python 3.*

Start with downloading Anaconda
Once you have Anaconda in your system, execute it. It will take ~10 mins to get it done.

From the start itself, try to use Jupyter notebook for your Python programming. 

How to launch Jupyter Notebook?
Once you have installed Anaconda, you will get an Anaconda Navigator in your start menu or on your desktop.
Double click to open it.

This is how Anaconda Navigator will look like. Click on the Launch button below the Jupyter Notebook icon

Running your first Python program

Write the below simple code print(“The Data Monk”) and press Shift+Enter to run the line of code. The output will be shown just below the code.

2. print(“Hello”+” World”) – Plus(+) operator to add two strings

3. You can directly use a variable

4. There are three types of numeric types supported in Python:-
a. int
b. float
c. complex

Use the type() command to know the data type

5. Following are a few string-related functions, pay attention to the typecasting in the first print statement

Guesstimate 3 – What are the number of smartphones sold in India per year?

Population of India : 1200 mn

Population above poverty line: 70% 840 mn

Population below 14 years: 30%

Hence, proxy figure: 588 mn

Rural Population (70%) : 410 mn

Rural Households: 82 Mn

Rural Mobile Penetration: Avg 2 per household- 164 Mn

In rural areas assume that a new mobile is bought once in 3 years. Hence, new mobiles bought In current year- 55 Mn

Urban (30%) :176 Mn

Assume Avg No of Mobiles per person : 1.5

Urban Mobile Penetration: 265 Mn

Assuming that a new mobile is bought once in 1.5 years. Hence new mobiles in current year- 176 Mn

Total New Mobiles: 231 mn

Assuming 3 out of 10 new mobiles are smartphones.

No. of smartphones sold=70 Mn