NLP Interview Questions

1. What problems can NLP solve?
NLP can solve many problems like, automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

2. What are the common patterns used in regular expression?
 \w+ -> word
\d -> digit
\s -> space
\* ->wildcard
+ or * -> greedy match
\S -> anti space i.e. it matches anything which is not a space
[A-Z] – matches all the character in the range of capital A and capital Z

3. What is the difference between match and search function?
Match tries to match the string from beginning whereas search matches it wherever it finds the pattern. The below example will help you understand better

import re
print(re.match(‘kam’, ‘kamal’))
print(re.match(‘kam’, ‘nitin kamal’))
print(re.search(‘kam’,’kamal’))
print(re.search(‘kam’,’nitin kamal’))
<re.Match object; span=(0, 3), match=’kam’>
None
<re.Match object; span=(0, 3), match=’kam’>
<re.Match object; span=(6, 9), match=’kam’>

4. How to write a regular expression to match some specific set of characters in a string?
special_char = r”[?/}{‘;]“
The above Regular Expression will take all the characters between []

5. Write a regular expression to split a paragraph every time it finds an exclamation mark

import re
exclamation = r”[!]”
strr = “Data Science comprises of innumerable topics! The aim of this 100 Days series is to get you started assuming ! that you have no prior! knowledge of any of these topics. “
excla = re.split(exclamation,strr)
print(excla)

[‘Data Science comprises of innumerable topics’, ‘ The aim of this 100 Days series is to get you started assuming ‘, ‘ that you have no prior’, ‘ knowledge of any of these topics. ‘]

6. What are the important nltk tokenizer?
sent_tokenize – Tokenize a sentence
tweet_tokenize – This one is exclusively for tweets which can come handy if you are trying to do sentiment analysis by looking at a particular hashtag or tweets
regexp_tokenize – tokenize a string or document based on a regular expression pattern

7. What is the use of .start() and .end() function?

Basically .start() and .end() helps you find the starting and ending index of a search. Below is an example:

x = re.search(“Piyush”,para)
print(x.start(),x.end())

24 30

8. Once again go through the difference between search() and match() function
Search() will find your desired regex expression anywhere in the string, but the match always looks from the beginning of the string. If a match() function hits a comma or something, then it will stop the operation then and there itself. Be very particular on selecting a function out of these

9. What is bag-of-words?
Bag-of-words is a process to identify topics in a text. It basically counts the frequency of the token in a text. Example below to help you understand the simple concept of bag-of-words

para = “The game of cricket is complicated. Cricket is more complicated than Football”

The – 1
game – 1
of-1
cricket-1
is-2
complicated-2
Cricket – 1
than – 1
Football – 1

As you can see, the word cricket is counted two times as bag-of-words is case sensitive.

10. Use the same paragraph used above and print the top 3 most common words
The code is self explanatory and is given below:

word2 = word_tokenize(para)
lower_case = [t.lower() for t in word2]
bag_of_words = Counter(lower_case)
print(bag_of_words.most_common(3))

[(‘the’, 4), (‘,’, 4), (‘data’, 3)]

11. Give an example of Lemmatization in Python
x = “running”
import nltk
nltk.download(‘wordnet’)
lem.lemmatize(x,”v”
Output
‘run’

12. What is tf-idf?
term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.

13. What is the difference between lemmatization and stemming?
Lemmatization gets to the base of the word whereas stemming just chops the tail of the word to get the base form. Below example will serve you better:

See is the lemma of saw, but if you try to get the stem of saw, then it will return ‘s’ as the stem.
See is the lemma of seeing, stemming seeing will get you see.

14. What is the flow of creating a Naïve Bayes model?

from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
nb_classifier.fit(count_train,y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred,labels=[‘FAKE’,’REAL’])
print(cm)

15. Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”


data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

#Tokenize the words and apply POS
def token_POS(token):
    token = nltk.word_tokenize(token)
    token = nltk.pos_tag(token)
    return token
token = token_POS(data) token

Output

16. Create a 3-gram of the sentence below
“The Data Monk was started in Bangalore in 2018″


def ngrams(text, n):
    token = text.split()
    final = [] 
    for i in range(len(token)-n+1):
        final.append(token[i:i+n])
    return final ngrams(“The Data Monk was started in Bangalore in 2018”,3)

Output

17. What is the right order for a text classification model components?

Text cleaning
Text annotation
Text to predictors
Gradient descent
Model tuning

18. Write a regular expression for removing special characters and numbers

review is the name of the data set and Review is the name of the column

final = []

for i in range(0,16):     x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )

19. Convert all the text into lower case and split the words
final = []

for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()     x = x.split()

20. What is CountVectorizer?
CountVectorizer is  a class from sklearn.feature_extraction.text. It converts a selection of text documents to a matrix of token counts.

If you feel comfortable with NLP, then you can go through 80 more interview questions which are available on Amazon

100 Questions to Understand NLP using Python

Keep coding 🙂

XtraMous


















TF-IDF and Word correlation

What is tf-idf?
Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.

The weight will be low in two cases:-
a. When the term frequency is low i.e. number of occurrence of a word is low
b. When N is equal to dfi, then the log will be close to zero

So, using (b), if a word occurs in all the document, then the log value will be low

If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-

(5/100)*log(200/20)

Take an example to take a sentence and break it into tokens i.e. each word
text = “The Data Monk will help you learn and understand Data Science”

tokens = word_tokenize(text)
print (tokens)       

[‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

Take the same sentence and get the POS tags
 from nltk import word_tokenize, pos_tag

text = “The Data Monk will help you learn and understand Data Science”
tokens = word_tokenize(text)
print (pos_tag(tokens))        

[(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”


data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

Take an example to take a sentence and break it into tokens i.e. each word
text = “The Data Monk will help you learn and understand Data Science”

tokens = word_tokenize(text)
print (tokens)       

[‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

Take the same sentence and get the POS tags
 from nltk import word_tokenize, pos_tag

text = “The Data Monk will help you learn and understand Data Science”
tokens = word_tokenize(text)
print (pos_tag(tokens))        

[(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

#Tokenize the words and apply POS
def token_POS(token):
token = nltk.word_tokenize(token)
token = nltk.pos_tag(token)
return token
token = token_POS(data) token

Output

Natural Language Processing in Python? What and Hows about NLP

  • NLP stands for Natural Language Processing which is a subdomain of Data Science and it helps you in extracting insights from organized or un-organized data.

    Have you ever wondered how a chatbot interacts with you so efficiently that half of the time even you don’t know if it is a bot or a human? That is the power of NLP

    Every organization asks for feedbacks and reviews in their survey forms or on their website. Do you think they have the time to go through all the texts to extract the sentiment of a customer?
    The short answer is ‘NO’, most of the time they will hire someone to work on these texts and get them the required information. This is where NLP is useful for us

    Have you ever wondered how few emails are shifted directly to your spam folder and most of the time these emails are spam. NLP makes this happen for you

    You search something on Google and you get a lot of relevant suggestion. This is NLP running in the back-end.

    There are multiple such examples where NLP is directly making our life easier.

Why Python?
Python has really strong library support for especially NLP and the community support of Python is also strong. So, if you don’t have a constraint on language selection, then do choose Python for any NLP project

What are the important algorithms of NLP?
Following are important algorithms and processes which are used in Natural Language, we will cover most of these in the upcoming days:
1. TF-IDF
2. N-Gram
3. Word Correlation
4. Stemming
5. Lemmatization
6. Sentiment Analysis
7. Parts of Speech Tagging
8. Named Entity Recognition
9. Semantic Text Similarity
10. Language Identification
11. Text Summarisation

What are the important Python libraries which help in NLP?
The most important library is no doubt NLTK, followed by SpaCy, TextBlob, CoreNLP, Gensim, and Polyglot.

Where to use which library?
Just knowing the name of the library will not help you, following are the lump sum idea of which type of work is more convenient in which library:-

1. NLTK – This is your goto library for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
2. TextBlob – Sentiment analysis, pos-tagging, or noun phrase extraction
3. SpaCy – It was designed for production usage – that’s why it’s so much more accessible than NLTK
4. polyglot – It offers a broad range of analysis and impressive language coverage. It requests the usage of a dedicated command in the command line through the pipeline mechanisms
5. Gensim – It can handle large text collections with the help of efficiency data streaming and incremental algorithms, which is more than we can say about other packages that only target batch and in-memory processing
6. CoreNLP – The library is really fast and works well in product development environments

How does a standard NLP project go like?
Though the below steps are not mandatory, but we mostly follow this approach:
Step 1 – Get the raw data
Step 2 – Remove the special characters
Step 3 – Remove the stop words
Step 4 – Perform a TF-IDF which gets you the most important words of the document. The term TF refers to Term Frequency which simply calculates the frequency of each word. IDF stands for Inverse Document Frequency which removes the commonly occurring words with high frequency. So, what is left is the important words of the document. Easy fizzy
Step 5 – Depending on the aim of the project, we try to look for bi-gram or n-gram which gives you words which occurs together like, Revenue Dashboard, Online Activity, etc.
Bi-gram is when you are looking for 2 nearby words, similarly, 3-gram will get you words like TheDataMonk Revenue Dashboard, etc.
Step 6 – Depending on the requirement, we move forward in either clustering data or looking for sentiments or users, etc.

Step 1 to 5 will give you an overview of the important terms and associated terms. This is the most basic Exploratory Data Analysis with which we start with 🙂

We would like to cover Regular Expression in brief here

What is Regular Expression?
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.

Regular expressions are a generalized way to match patterns with sequences of characters.

Many of you must have come across SQL questions where you need to get the data of customers whose name starts with A and in the WHERE condition you write something like,

WHERE Customer_Name LIKE ‘A%’

Well !! This is the basic Regular Expression where you request the query to get you a specific result. The way we write Regular Expression in Python is a bit different. Check out the table below:-To use Regular Expression, first, you need to “import re” package
And the following 4 functions quite useful for using your regex 1. findall – It returns a complete list of all the matches
2. search – It returns a match object
3. split – Splits the string wherever there is a match
4. sub – It replaces one or many matches of the regex

Following are some important metacharacter and special sequence

RegExDescription
w+Get all the words
dDigits
sSpaces
SAnything but white spaces
+One or more occurrences
^Starts with
$Ends with
*Zero or more occurences
+One or more occurrences
|Either Or
[]A set of Character
Special sequence

Let’s get down on some questions to understand the basics of how to write a regex1. re.split(‘s+’,’My name is Data Monk’)
‘My’ ‘name’ ‘is’ ‘Data’ ‘Monk’ – The above function took the regex s+ to get all the words from the given string and split it 2. end_Sentence = r'[.?!]’
print(re.split(end_Sentence, String)
The above line of codes will split the document wherever a sentence is ending with a full stop, question mark, or an exclamation mark
3. [a-z A-Z 0-9 -.]
This will match all the upper case, lower case, digits, – and . 4. r”[.*]”
Since it contains an asterisk, so it will match anything and everything You can find many more RegEx exercise questions on different websites. Do practice a few 

Keep Learning 🙂

XtraMous





Understanding Linear Regression

Linear Regression is one of simplest, basic algorithm you would wish to start your learning with. But let me remind you, the concept of linear regression is with us way before machine learning and AI caught our attention! I am sure you could recall studying its basic version in your high school or engineering mathematics or statistics class.
So let me share with you one simple scenario where linear regression comes into picture!

We are a family of two living in Bangalore city. Every day we approximately make 5 chapatis for dinner. I can cook them easily. Let’s say we purchased a house of our own here (Haha..In my dreams!).

Anyway, now every day either friends, colleagues or nosey relatives visit us to see our house and we courteously offer them to have dinner with us. Since I am new to cooking and I don’t want to be known as a bad host, every day I keep fretting about the availability of sufficient oil, ghee or flour once the guests arrive. How do I plan the dinner ahead? Once I know how many people are coming, I can try and estimate how many chapatis I’ll need to cook, can keep track of the required materials and be happy!

Figure (1) below shows my record of past 1 week.

What do we observe here?
Our data consists of two counts. One is the count of guests arriving and one is the count of chapatis I am cooking for dinner and both share a sort of linear relationship. So does the blue line shooting straight across the data points in the above X-Y plot ring any bell?? Yes, it is the “best-fit line” plotted to help us estimate approx. how many chapatis are cooked each day. It can help us find the number of chapatis required to be cooked any day in future based on the guests’ head count. Now plotting this line was easy on MS-PowerPoint. But what happens behind the scenes actually is called simple linear regression. This best-fit line tells that approximately in the absence of any guests arriving I’ll be cooking 6.66 chapatis, rounding up to 7 and in the presence of guests, I’ll need 3 chapatis per head. This best-fit line helps us in identifying the trend in the data.


Simple Linear Regression


In standard form, the “best-fit” line is given by 𝑦=𝑚𝑥+𝑐 where, 𝑚 is the slope of the line which tells us how the value of 𝑦 is linearly proportional to the value of 𝑥, thus making 𝑦, a dependent variable and 𝑥, an independent variable. Here 𝑐 is the intercept which gives a value of 𝑦 in absence of any input value 𝑥.
Calculation of 𝑚 and 𝑐:
So how to find this “best-fit” line for the data given?

We can start with some random values (by an intelligent guess!) for 𝑚 and 𝑐 and plot an initial line. Then we compute the perpendicular distance of each point from the line, which is known as residual, difference between the true value of 𝑦 and the approximate value. It’s a no brainer that the residual should be as low as possible. Figure (2) shows the residual values marked by red lines. These values can be positive or negative and simply indicate the error in approximation. So we will minimize the sum of squares of residuals (cost) by changing values of 𝑚 and 𝑐 step-by-step. When the cost stops decreasing we fix that 𝑚 and 𝑐 as our final result. And this entire calculation can be done in excel sheet when we have only a bunch of data points.

In the simplest words, linear regression means inferring the relationship between dependent and independent variables. Using simple linear regression we knowingly or unknowingly take decisions or draw conclusions in our day-to-day life. For example, by estimating the salary for a new joiner based on the years of experience, estimating the commute-time based on the traffic in the area and weather conditions, estimating the availability of parking lot in a shopping center based on the day being a weekday, weekend or festival, sales of sarees during wedding season and festivals versus offseason. All these are the examples where linear regression can help make some estimation or inference based on the data gathered during the similar past events.
When linear regression gets used for machine learning, its dynamics changes because of the Big Data. To make a prediction about a certain event based on data in the range of thousands, obviously will make the computations more tedious and importance is given more to the predictive power of the algorithm than the underlying feature dependencies. Let’s look into that now.
Machine learning stands by its name. It is how a machine learns something useful and productive from the given data.

For a 3-5 year old boy buying chocolate is all about enjoying its taste. As he grows old and accompanies his father for chocolate shopping, he learns how his father pays more for bigger size chocolates or a large number of chocolates. It simply learns by observation and curiosity and soon he will learn how to buy chocolates for himself, how much to pay and all. Similarly, the computer needs to see and learn how one variable in the data is related to another one. For example, how the price of the house is related to the area of the house and number of bedrooms so that in future, it can predict the price of the house whose price data is not available to it earlier. The data it uses for learning is known as training data. And the data on which its learning is evaluated is known as test data. Now while teaching the machine, it needs to be given the “true answers” as well. This is known as supervised learning. It observes what it needs to learn. Linear regression algorithm comes under supervised machine learning because the algorithm needs both 𝑥 and 𝑦 variables to learn the relationship between them and find the best-fitting line.

Multiple Linear Regression

Earlier we saw how we can estimate one variable based on the other variable. What if a particular event is dependent on multiple variables? Simple. It can be solved by multiple linear regression. For example, you are estimating a yearly budget of advertising for your product sales. And you have different modes of advertisement like Radio, TV, and Newspaper. So how will you divide the budget? As a statistical consultant, you will have to answer various questions:
If there is any relationship between advertising budget and sales? And if so, how strong? Which media of the three, contributes the most to the produce sales? And if you find that there is any linear relationship between advertising expenses in media outlets and sales, then linear regression is an appropriate tool to use for advising the client about adjusting the advertising budgets, thereby indirectly improving sales. The figure below shows linear relationship between sales and advertisement media TV. Same we can analyze for Newspaper and Radio. So here we have 3 features/independent variables to take care of.
Observe the plot below for one variable:

Fig3: Advertising data, best line fitting for the regression of sales onto TV for nearly 200 markets

General Equation for Linear Regression
Now you must have got a pretty good idea about linear regression in general and how it rules our world and helps statisticians and analysts make business decisions. So let’s talk about how to train a linear regression algorithm for supervised learning. Trust me, we can walk through it seamlessly!
For supervised learning we need to provide dataset of the following format:

Equation (1) tells us how the weights/parameters determine the effect of features on the prediction. So if weight 𝑤𝑖 is a very large value, then feature 𝑥𝑖 will have larger impact on our prediction 𝑦̂ and if 𝑤𝑖=0 then the feature 𝑥𝑖 will have no impact on our prediction 𝑦̂. Similarly, if 𝑤𝑖 is positive then 𝑦̂ will be directly proportional to 𝑥𝑖 and if 𝑤𝑖 is negative then 𝑦̂ will be inversely proportional to 𝑥𝑖.
Now the model needs to learn from the data as to which features influence the target value and by how much and accordingly the weights are updated to reach at the targeted true value.

Figure (4) shows that supervised linear regression model takes in the dataset of the form (𝑖𝑛𝑝𝑢𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠,𝑡𝑟𝑢𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑣𝑎𝑙𝑢𝑒) and determines appropriate model weights or parameters that can give most accurate prediction for the output.

So how are these weights/parameters determined?

Remember the calculation of 𝑚 and 𝑐 we studied earlier? Something of that sort only!
We want the “predicted value 𝑦̂” to be closer to the “true value 𝑦”. So while learning if the residual (𝑦̂(𝑖)−𝑦(𝑖)), also known as the prediction error is more, the model should suffer a higher cost.


The goal of our model is to do whatever it takes to minimize the loss/cost it suffers. It has to reach a set of parameter values such that for the given data it can perform the best possible prediction. This process is known as model optimization.
One of the most popular algorithms which can minimize a function governed by a set of parameters is Gradient Descent.

Gradient Descent Algorithm

I know it sounds eccentric but its concept is very simple. Let’s understand it with an analogy.
Suppose you are on a mountain, exhausted after trekking on rough terrain. It’s dawn, there are fog and thin air. You wish to climb down, reach near the lake in the beautiful green valley. Since the visibility is poor you have to place your steps slowly and probably with the help of a stick or some kind of support. The best way is to measure the ground nearby and check where the land seems to descend and move in that direction. After every step, you will smartly calculate the direction like this and slowly and steadily reach the lake. Finally, you have fresh air, water and lots of greenery to soothe your eyes!
Now take the cost function and plot it with respect to model parameters. It will be some sort of peaks and valleys curve and your goal is to find the parameters where the cost function is minimum, so you have to reach the valley point in the curve. Look at the figure below for understanding in simpler terms.

Here, J represents the cost function and w represents the parameter
Bear with a little calculus concept!

The gradient of the cost function for a given point signifies the slope of the tangent to the curve at that point. The slope of the tangent gives us the direction of rising. Model’s idea is to reach the minimum so it moves in the opposite direction of the gradient.

Here, 𝛼 is real number called learning rate which signifies by how much the weight update is done.

Once the model processes through the entire training data, it has learned the optimal values of parameters and its performance can now be evaluated using the test data.

Of course, linear regression is an extremely simple and limited learning algorithm, but I hope that this article has got you curious about more such learning models, about how they work and how they can be trained with the help of a given data to later perform complex tasks. That was my goal. Happy Learning 🙂

Keep Coding 🙂
Vishwa Dadhania

How to calculate the performance of your model?

To understand the performance of a model you need to start with Confusion matrix. If you have escaped it below, then you can go through it again.

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.

It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

It is this breakdown that overcomes the limitation of using classification accuracy alone.

In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations.

Such as a disease state or event from no disease state or no event.

In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.

This gives us:

  • true positive” for correctly predicted event values.
  • false positive” for incorrectly predicted event values.
  • true negative” for correctly predicted no-event values.
  • false negative” for incorrectly predicted no-event values.

A confusion matrix is typically computed in any machine learning classifier such us logistic regression, decision tree, support vector machine, naive bayes etc. to calculate a cross-tabulation of observed (true) and predicted classes (model). There are several metrics such as precision and recall that helps us interpret the accuracy of the model and choose the best model. 

Sensitivity = A/(A+C)

Specificity = D/(B+D)

Prevalence = (A+C)/(A+B+C+D)

Positive Predicted Value (PPV) =

(sensitivity * prevalence)/((sensitivity*prevalence) + ((1-specificity)*(1-prevalence)))

Negative Predicted Value (NPV) =

(specificity * (1-prevalence))/(((1-sensitivity)*prevalence) + ((specificity)*(1-prevalence)))

Detection Rate = A/(A+B+C+D)

Detection Prevalence = (A+B)/(A+B+C+D)

Balanced Accuracy = (sensitivity+specificity)/2

Precision = A/(A+B)

Recall = A/(A+C)

We are using the Thyroid example to understand how this confusion matrix is important to us. Suppose our test data set has 100 rows and the values in the Confusion matrix are
true positive – 45

false positive – 5

true negative– 5

false negative – 45

So, the accuracy of your model will be (45+45)/(45+5+5+45) i.e. number of 
correct prediction divided by total prediction which is 90%.

False positive shows that there were 5 people who did not have Thyroid but our model projected it as suffering from it. 

This was a revision to the things which we have already discussed. There are four other ways to evaluate a model

1. Classification accuracy
2. Logarithmic Loss
3. Area under ROC curve
4. Classification report

1. Classification accuracy

Classification accuracy is the number of correct predictions made as a ratio of all predictions made.

This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

from sklearn import model_selection
seed = 143
kfold = model_selection.KFold(n_splits=5, random_state=seed)
model = LogisticRegression()
scoring = ‘accuracy’
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

a. model is Logistic Regression
b. kfold value is set to 5 splits i.e. cross-validation of the result will be done on 5 partitions. We will learn more about cross validation later
c. scoring parameter is accuracy

2. Logarithmic Loss

Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.

The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
seed = 143
kfold = model_selection.KFold(n_splits=5, random_state=seed)
model = LogisticRegression()
scoring = ‘neg_log_loss’
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

3. Area Under ROC curve

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.

The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.

  • Sensitivity is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly.
  • Specificity is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly.

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
seed = 143
kfold = model_selection.KFold(n_splits=5, random_state=seed)
model = LogisticRegression()
scoring = ‘roc_auc’
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

4. Classification Report

Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The classification_report() function displays the precision, recall, f1-score and support for each class.

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)


The codes are self-explanatory. For more explanation you can look for Wikipedia links. In case you have doubt in any part of the code, then don’t worry. We will deal with these in the upcoming days.

Keep Coding 🙂

XtraMous

Before you start Modeling – Feature Engineering

Feature Engineering is one place where you have to put in a lot of efforts. In the beginning of any project, you will have very less data, but then you need to dig in and torture the data set to get more columns. Let’s take few example to see how feature engineering is done.

Let’s take the famous Titanic data set which have the following columns and data types

PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object

This data set is already complete and once you start building a supervised learning model to predict who survived the accident, then you will get a decent accuracy. But as a Data Scientist, your job is to make a model as good as possible. Let’s see what all columns can we create

First of all you must know the concept of one hot encoding where you turn a categorical column with suppose 3 categories in 3 different columns with binary (0/1) input. We will deal with it below

1. The name of the passengers are given as Nitin, Mr. Kamal. So, from here we can definitely get the title of each passenger and create a new categorical column with only titles. There are 10+ titles like Mr, Miss, Doc, Mrs, etc. So we first take the frequency of each title and then merge the low frequency categories into one ‘other’ category. Congrats, you created your first column.

2. There were a lot of passengers who did not had cabin in their room and others had cabin number in the column ‘cabin’. You can create a new column with value 0 and 1 if the passenger has cabin or not. So you have another categorical variable. Yeahhh

3. You can also put categories to Age

4. You can put categories to Fare

5. You can add Parch and SibSp to get the family size of the passenger. Maybe a single person survived more than a passenger with larger family size

#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']

6. We have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.

#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

There was one more project where we were supposed to predict the number of Burgers sold by McDonalds and we only had a monthly level data. We tried ARIMA model but there was a need to test Linear Regression and ARIMAX. So we needed few more columns. So these were the columns which we came up with

1. We created 4 binary columns for seasons i.e. Summer, Winter, Spring, and Monsoon.

2. We created a separate column for Month number

3. A column for year number which we converted into factor so that model does not interpret that 2018 is higher than 2005 because these are not numbers but categories

4. We created a flag of number of weekends. The hypothesis was that a month with more weekend will have more number of burgers sold

5. We also created a number of days column in the data set. So a month with 31 days will sell more Burgers than the one with 28 days

We ran an ensemble model using Linear Regression, ARIMA, and ARIMAX to get a good accuracy.

To practice more, pic up any data set and under the problem statement. Then you can create more columns to boost the performance of the model.

If you want to learn a complete modeling experience on real data set, then you can go through the book given below

Complete Linear Regression and ARIMA forecasting using R

You can also learn the complete project in a conversational way below

The Monk who knew Linear Regression

Keep coding 🙂

XtraMous

What are training and test datasets?

Suppose you have 1000 rows of data and you want to make a model to predict whether a person is suffering from Thyroid. There are 15 columns with 15th one being a binary variable of 0(Normal) and 1 (Thyroid).

Firstly you have to figure out the model which you want to make and then you need to get the important features for the model. Once you have these, you will have to train your model. This is the time when you have to decide the number of rows you want to use for training the dataset.

Typically, the training dataset is 70-80% of the total. When you provide these data to your model, the model will start making rules accordingly.

For ex. Suppose we have weight, blood sugar, and waist size as the important features. If there are people with Thyroid and attributes like [100, 300, 42] where 100 is the weight in kilograms, 300 is the blood sugar level and 42 is the waist size. If this row is selected in the training dataset then the model will create a rule saying that high weight, blood sugar level, and large waist size results in Thyroid. We will have some 700-800 such rows in our training dataset to make multiple rules.
The dataset on which you train your model or the dataset which the model uses to make rules are called training dataset.

We started with 1000 rows and trained our model on 750 rows. Now the rest 250 rows will be used to test the accuracy of the model. See, we already have the output of these 250 rows, but we have to feed the data to our model without the output column and this way it will start applying the rules created so far from the training dataset. Now you will have 16 columns in your test dataset where the 16th one is the predicted values. The 15th and 16th column contains the actual and predicted values. So, you can create a confusion matrix to understand the accuracy and performance of your model.

Bottom line is that you divide the total dataset into 80:20 or 70:30 or anything around this (train:test) and then build your model on the larger chunk of data and check it against the smaller one. Once you have the result for testing the data, then make a confusion matrix to understand the result.

The code in Python to split the complete dataset into train and test is given below

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

train_test_split is a function present in the sklearn.model_selection package. In the train_test_split() function you need to provide 3 parameters.
X denotes a dataset with all the rows and independent variables, y is the outcome of the dependent variable(Thyroid in this case) and test_size is 0.25 i.e. Training dataset will have 75% rows and the test size will be 25%.

This function will create 4 datasets i.e. X_train, y_train, X_test, and y_test.
X_train – 75% of the data and 14 columns (excluded Outcome column)
y_train – 25% of the data with only one column i.e. Output column
X_test – 75% of the data with the same columns as X_train
y_test – Contains the result with which you have to match the model’s result

Now let’s talk about the confusion matrix in brief

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.

It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

It is this breakdown that overcomes the limitation of using classification accuracy alone.

In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations.

Such as a disease state or event from no disease state or no event.

In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.

This gives us:

  • true positive” for correctly predicted event values.
  • false positive” for incorrectly predicted event values.
  • true negative” for correctly predicted no-event values.
  • false negative” for incorrectly predicted no-event values.

A confusion matrix is typically computed in any machine learning classifier such us logistic regression, decision tree, support vector machine, naive bayes etc. to calculate a cross-tabulation of observed (true) and predicted classes (model). There are several metrics such as precision and recall that helps us interpret the accuracy of the model and choose the best model.

Sensitivity = A/(A+C)

Specificity = D/(B+D)

Prevalence = (A+C)/(A+B+C+D)

Positive Predicted Value (PPV) =

(sensitivity * prevalence)/((sensitivity*prevalence) + ((1-specificity)*(1-prevalence)))

Negative Predicted Value (NPV) =

(specificity * (1-prevalence))/(((1-sensitivity)*prevalence) + ((specificity)*(1-prevalence)))

Detection Rate = A/(A+B+C+D)

Detection Prevalence = (A+B)/(A+B+C+D)

Balanced Accuracy = (sensitivity+specificity)/2

Precision = A/(A+B)

Recall = A/(A+C)

We are using the Thyroid example to understand how this confusion matrix is important to us. Suppose our test data set has 100 rows and the values in the Confusion matrix are
true positive – 45

false positive – 5

true negative– 5

false negative – 45

So, the accuracy of your model will be (45+45)/(45+5+5+45) i.e. number of
correct prediction divided by total prediction which is 90%.

False positive shows that there were 5 people who did not have Thyroid but our model projected it as suffering from it.

To learn a lot more about what interview questions are asked in Data Science interviews(Myntra, Flipkart, Accenture, Bookmyshow, Oyo, etc.), you can go through our best seller

What do they ask in Data Science interview
What do they ask in Data Science interview Part 2

Keep coding 🙂

XtraMous

Flipkart Interview Questions

Position – Data Scientist
Location – Bangalore

Number of Rounds – 5

Round 1 – Aptitude and Logical Reasoning
Round 2 – Case Study (Non-Elimination Round)
Round 3 – Technical Interview  SQL and Python
Round 4 – Project discussion

Round 5 – Human Resource

Round 1 – Aptitude and Logical Reasoning
Sample Questions are given below
1. . A bag contains 7 white, 3 red and 5 blue balls. Three balls are drawn at random from the bag. The probability that all of them are red is?
2. Two unbiased coins are tossed, What is the probability of getting at most one head?
3. Two cards are drawn from a pack of 52 cards, What is the probability that both of the cards are being Kings?

Questions were mostly around time and work, puzzles, and there was a guesstimate question.

The first round was an elimination round and the cut-off was somewhere around 12 out of 15 questions
Around 80% were eliminated in this round.

Round 2 – Case Study (Non-Elimination Round)
The case study was to increase the number of conversions of freemium customers to premium customers for a telecom company in India.

Round 3 – Technical Interview  SQL and Python

1. Employee Table

Emp_Number
Employee_Name
Job
Manager_ID
Hire_Date
Salary
Commission
Department_Number

2. Department Table

Department_Number
Department_Name
Location

a. List the employees who joined before 2018
b. List the employees whose annual salary is between 25000 and 50000
c. List the employees who joined in January
d. List the employees who are senior to their own Manager
e. Find details of highest paid employee
f. Get detail of the senior most employee
g. Find the total salary given to Manager

Python:-

What graphs do you use for basic EDA?
How useful is box-plot graph?
What are the basic checks you do for cleaning the data?
How to join tables in python?
How to access the items of a dictionary?
When to use list, set or dictionaries ?

There were more questions on plotting and basic EDA


Round 4 – Project discussion
A Complete overview of the project was asked first and then there were some coding questions. Every question was related to the Project

Round 5 -Human Resource
Why do you want to quit the current job?
Have you led a team in the past?
You might have to work during the festive season, are you okay with it?
Salary negotiation

Do go through the following books to get the complete list of questions and answers to various Data Science Interviews
1. What do they ask in a Data Science Interview – Flipkart, Myntra, OYO Rooms, Tredence, and Meredith India
2. What do they ask in a Data Science Interview Part 2 – Sapient, Deloitte, Amazon, Accenture, and Book My Show









Supply Chain Analytics in Python

Let’s take a case study of Supply Chain optimization.

There is a Restaurant which serves Mega Pizza (40”).  It has one oven, 3 bakers, and 1 packer. Following is the time required by each Pizz

  Number Pizza A Pizza B Pizza C Working Days
Oven 1 Oven 1 Day 0.5 Day 1 Day 30 Days
Baker 3 Bakers 1 Day 2 Days 2 Days 30 Days
Packer 2 Packers 1 Day 1 Day 1 Day 20 Days
Profit   $30 $40 $50  

Now you have to maximize the Profit using PuLP library. Use decision variables, objective functions, and constraints.

How much pizza of each type should we make in 30 days.

First let’s look into the coding part in Python

from pulp import *
model = LpProblem(“Maximize Pizza Profit”, LpMaximize)

#Declare Decision Variable
A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
B = LpVariable(‘B’,lowbound=0, upbound = None, cat=’Integer’)
C = LpVariable(‘C’,lowbound=0,upbound = None, cat=’Integer’)

#Define Objective function
#For Oven
model += 1*A + 0.5*B + 1*C <=  30
#For Baker
model += 1*A+2*B+2*C <=90
#For Packer
model += 1*A+1*B+1*C <= 40

#Solve Model
model.solve()
print(“Produce {} Pizza A”.format(A.varValue))
print(“Produce {} Pizza B”.format(B.varValue))
print(“Produce {} Pizza C”.format(C.varValue))


Now let’s understand the code

from pulp import *
Here you are importing the complete package

model = LpProblem(“Maximize Pizza Profit”, LpMaximize)
Here you are defining the model using LpProblem function. The LpMaximize will look for maximizing the value i.e. Profit. If you want to get the minimum value from the model then use LpMinimize. We can use LpMinimize when we are talking about reducing the wastage.

A = LpVariable(‘A’,lowbound=0,upbound = None,cat=’Integer’)
Here we define each Variable using LpVariable function. Lowbound refers to the lowest possible value of the variable.
Pizza can not be negative so we have given the value 0, Upbound is the maximum value of the variable.
None will ensure that the upbound could be anything
cat is the characteristic of the variable. It could be integer, categorical, or Binary

model += 1*A + 0.5*B + 1*C <=  30
This is the constraint for Oven. A requires 1 day, B requires 0.5 Day, and C requires 1 Day. The <=30 is the constraint which is because there is one oven which will work for 30 days

model += 1*A+2*B+2*C <=90
Similar to the above, the Baker will need 1, 2, and 2 days for A,B, and C respectively. And there are 3 Bakers which work 30 days. Thus constraint is 30*3 = 90

#For Packer
model += 1*A+1*B+1*C <= 40

A packer takes 1,1,and 1 day for A,B, and C pizza. And there are 2 Packers which works 20 days each. Thus constraint is 40


Supervised Learning Overview

The word “Supervised” means monitoring. A supervised learning algorithm is one in which you train a data set on output and then the model takes up these inputs and predicts the outcome. Confusing?

Let’s try an example
You own a restaurant and you have collected various information about the customers like Name, Status, Job, Salary, Address, Home town, Food item they ordered, etc.
Now you want to make a recommendation engine where a new customer’s data is used to give that customer a free dish. You took the data of all the customers and fed it into your model. Now this model knows that if a person is from Punjab( State in India) and is 26 years old, then there is a high chance of him ordering Paratha(Sorry if I am typecasting :P)

So, you already have the historic data and most importantly you know the output for each row of data. Using this historic data you created a model which learns and makes a recommendation in the real time. This whole process is based on the fact that “The model creates a set of rule which enables it to understand the nature of the data and it can then use these set of rules for further prediction”

Interestingly most of the work you will do in your Data Science job will revolve around Supervised Learning.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance

The most important Supervised Learning algorithms are:-
1. Support Vector Machines
2. Linear Regression
3. Logistic Regression
4. Naive Bayes
5. Linear Discriminant Analysis (LDA)
6. Decision Tree
7. K-Nearest Neighbor
8. Neural Network
9. Similarity Training

You will learn about each of these algorithms one by one, but first let’s look into the process involved in building these models

Step 1 – Gather your data
Step 2 – Clean the data. It will occupy a lot of your time
Step 3 – Feature Engineering. You might need to create or derive new features from the already present data set. The input object is transformed into a feature vector, which contains a number of features that are descriptive of the object.
Step 4 – Determine which algorithm you want to implement on your data set
Step 5 – Run the model on the training data set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
Step 6 – Evaluate the performance or accuracy of the model. If everything is fine, then run the model on the test dataset

Above we saw the list of Supervised Learning Algorithms. Supervised Learning problems can further be divided into two categories:-
a. Classification – A classification problem is such where the output variable is a categorical variable. If you are predicting different disease on the basis of symptoms, then that will fall under Classification

b. Regression – Regression is used when you need to predict continuous values like Number of customers coming to a restaurant, the number of visitors on a website, etc.

Some of the applications of Supervised Learning:-

1. Use a predictive algorithm to find out which student will get how much marks
2. Use Logistic Regression to find out which customer will in-cash his insurance policy
3. Predicting prices of House
4. Weather forecasting
5. Classification of emails (Spam and non-spam)
6. In supervised learning for image processing, for example, an AI system might be provided with labeled pictures of vehicles in categories such as cars and trucks. After a sufficient amount of observation, the system should be able to distinguish between and categorize unlabeled images, at which time training can be said to be complete.

Supervised Learning is like learning from a teacher. He will teach you the ways to answer questions and will evaluate your learning. You can expect the same types of questions to appear in the examination i.e. your testing condition. And you answer according to your understanding. Your marks is your accuracy.

Courtesy – Big Data Made Simple

We will use Python to train our Supervised Learning algorithm in the next few Days.

Keep Learning 🙂

XtraMous