TF-IDF in Python Archives

What is NLP?
NLP stands for Natural Language Processing and it is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.
What are the uses of NLP?
Natural Language Processing is useful in various domains like Chat bots, Extracting insights from feedback and surveys, text-classification, etc.
What are the different algorithms in NLP?
NLP is used to analyze text, allowing machines to understand how human’s speak.
This human-computer interaction enables real-world applications like
a. automatic text summarization
b. sentiment analysis
c. topic extraction
d. named entity recognition
e. parts-of-speech tagging
f. relationship extraction
g. stemming, and more.
NLP is commonly used for text mining, machine translation, and automated question answering.
What problems can NLP solve?
NLP can solve many problems like, automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
What is Regular Expression?
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
Regular expressions are a generalized way to match patterns with sequences of characters.
What are the different applications of Regular Expression in Data Science?
a. Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you.
b. Social websites feed like the Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
c. Speech engines like Apple Siri.
d. Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.
What are the packages in Python to help in Regular ExpressionThe package which we commonly use for regular expression is re. We can import the package using following command

import re
What is match function?
import re
re.match(‘ni’,’nitin’)

Match=’ni’
What are the common patterns used in regular expression?
\w+ -> word
\d -> digit
\s -> space
\* ->wildcard
+ or * -> greedy match
\S -> anti space i.e. it matches anything which is not a space
[A-Z] – matches all the character in the range of capital A and capital Z
What are the important functions to use in Regular Expression?
findall() – It finds all the patterns in a string
search() – It search for a pattern
match() – It matches an entire string or a sub string
split() – It splits a string in Regular Expression. It returns a list object
What is the difference between match and search function?
Match tries to match the string from beginning whereas search matches it wherever it finds the pattern. The below example will help you understand better

import re
print(re.match(‘kam’, ‘kamal’))
print(re.match(‘kam’, ‘nitin kamal’))
print(re.search(‘kam’,’kamal’))
print(re.search(‘kam’,’nitin kamal’))
<re.Match object; span=(0, 3), match=’kam’>
None
<re.Match object; span=(0, 3), match=’kam’>
<re.Match object; span=(6, 9), match=’kam’>
Guess the output of the following
import re
re.split(‘\s’,’The Data Monk is cool’)

[‘The’,’Data’,’Monk’,’is’,’cool’]
Work in finding the output of the following
regx = r”\w+”
strx = “This isn’t my pen”
re.findall(regx,strx)

[‘This’, ‘isn’, ‘t’, ‘my’, ‘pen’]
How to write a regular expression to match some specific set of characters in a string?
special_char = r”[?/}{‘;]“
The above Regular Expression will take all the characters between []
Write a regular expression to split a paragraph every time it finds an exclamation mark

import re
exclamation = r”[!]”
strr = “Data Science comprises of innumerable topics! The aim of this 100 Days series is to get you started assuming ! that you have no prior! knowledge of any of these topics. “
excla = re.split(exclamation,strr)
print(excla)

[‘Data Science comprises of innumerable topics’, ‘ The aim of this 100 Days series is to get you started assuming ‘, ‘ that you have no prior’, ‘ knowledge of any of these topics. ‘]
Get all the words starting with a capital letter

capital = r”[A-Z]\w+”
print(re.findall(capital,strr))

[‘Data’, ‘Science’, ‘The’, ‘Days’]
Find the output of the following code?
digit = “12 34 98”
find_digit = r”\d+”
print(re.findall(find_digit,digit))

[’12’, ’34’, ’98’]
What is tokenization?
Tokenization is one of the most important part of NLP. It simply means to break down the string into smaller chunks. It breaks the paragraph into words, sentences, etc.
What is NLTK?
NLTK stands for Natural Language Toolkit Library and it is a package in Python which is very commonly used for tokenization.

from nltk.tokenize import word_tokenize
word_tokenize(“This is awesome!”)

[‘This’, ‘is’, ‘awesome’, ‘!’]
What are the important nltk tokenizer?

sent_tokenize – Tokenize a sentence
tweet_tokenize – This one is exclusively for tweets which can come handy if you are trying to do sentiment analysis by looking at a particular hashtag or tweets
regexp_tokenize – tokenize a string or document based on a regular expression pattern
What is the use of the function set() ?
The data type set is a collection. It contains an unordered collection of unique and immutable objects. So when you extract a set of words from a novel, then it will get you the distinct words from the complete novel. It is a very important function and it will continue to come in the book as you go ahead.
Tokenize the paragraph given below in sentence.
para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”

import nltk.tokenize import sent_tokenize
import nltk.tokenize import word_tokenize

para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”
sent = sent_tokenize(para)
print(sent)

[‘This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization.’, ‘This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead.’, ‘Be with Piyush and Pihu to understand Data Science and Machine Learning.’]
Now get all the words from the above paragraph

word = word_tokenize(para)

[‘This’, ‘is’, ‘the’, ‘story’, ‘about’, ‘Piyush,29’, ‘,’, ‘Senior’, ‘Data’, ‘Scientist’, ‘at’, ‘Imagine’, ‘Incorporation’, ‘and’, ‘myself’, ‘,’, ‘Pihu,24’, ‘,’, ‘Junior’, ‘Data’, ‘Scientist’, ‘at’, ‘the’, ‘same’, ‘organization’, ‘.’, ‘This’, ‘is’, ‘about’, ‘the’, ‘journey’, ‘of’, ‘Piyush’, ‘once’, ‘he’, ‘retired’, ‘from’, ‘his’, ‘job’, ‘,’, ‘after’, ‘being’, ‘unsatisfied’, ‘with’, ‘the’, ‘way’, ‘his’, ‘career’, ‘was’, ‘moving’, ‘ahead’, ‘.’, ‘Be’, ‘with’, ‘Piyush’, ‘and’, ‘Pihu’, ‘to’, ‘understand’, ‘Data’, ‘Science’, ‘and’, ‘Machine’, ‘Learning’, ‘.’]
Now get the unique words from the above paragraph
word=set(word_tokenize(para))
print(word)

{‘retired’, ‘ahead’, ‘the’, ‘about’, ‘with’, ‘Piyush,29’, ‘Senior’, ‘Piyush’, ‘being’, ‘Science’, ‘was’, ‘Imagine’, ‘at’, ‘journey’, ‘way’, ‘same’, ‘and’, ‘Pihu’, ‘Pihu,24’, ‘Learning’, ‘from’, ‘story’, ‘he’, ‘Be’, ‘Machine’, ‘once’, ‘to’, ‘unsatisfied’, ‘Junior’, ‘of’, ‘career’, ‘Data’, ‘moving’, ‘is’, ‘understand’, ‘.’, ‘myself’, ‘after’, ‘job’, ‘,’, ‘Incorporation’, ‘Scientist’, ‘organization’, ‘This’, ‘his’}
What is the use of .start() and .end() function?

Basically .start() and .end() helps you find the starting and ending index of a search. Below is an example:

x = re.search(“Piyush”,para)
print(x.start(),x.end())

24 30
What is the OR method?
OR method, as the name suggests is used to give condition to the regular expression. See the example below:-

x = r”\d+ | \w+”

The above regex will get you all the words and numbers, but it will ignore other characters like punctuation, ampersand, etc.
What are the advance tokenization techniques?
Take for example [A-Za-z]+, this will get you all the alphabets regardless of upper or lowercase alphabets
How to write a regex to match spaces or commas?
(/s+|,) – The /s+ will get you one or more spaces, and the pipe will mark an OR operator to take the comma into consideration
How to include special characters in a regex?
If you have any experience with regular expression or SQL queries, then this syntax will look familiar. You need to give a backward slash before any special character like below

(\,\.\?) – This will consider comma, full stop and question mark in the text
What is the difference between (a-z) and [A-Z]?
This is a very important concept, when you specify (a-z), it will only match the string “a-z”. But when you specify [A-Z] then it covers all the alphabet between upper case A and Z.
Once again go through the difference between search() and match() function
Search() will find your desired regex expression anywhere in the string, but the match always looks from the beginning of the string. If a match() function hits a comma or something, then it will stop the operation then and there itself. Be very particular on selecting a function out of these
What is topic modeling?
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
What is bag-of-words?
Bag-of-words is a process to identify topics in a text. It basically counts the frequency of the token in a text. Example below to help you understand the simple concept of bag-of-words

para = “The game of cricket is complicated. Cricket is more complicated than Football”

The – 1
game – 1
of-1
cricket-1
is-2
complicated-2
Cricket – 1
than – 1
Football – 1

As you can see, the word cricket is counted two times as bag-of-words is case sensitive.
How to counter the case sensitive nature of bag-of-words?
It’s a logical question, just convert every word in lower or upper case and then count the words. Look for question 35 to convert every word in lower case using loop.
What is counter?
A counter is a container that keeps count of number of times equivalent values are added. It looks similar to dictionary in Python. Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts.
How to import Counter in Python?
Counter is present in the Collection package, you can use it directly by importing it like below:

from collections import Counter
Use the same paragraph used above and print the top 3 most common words
The code is self explanatory and is given below:

word2 = word_tokenize(para)
lower_case = [t.lower() for t in word2]
bag_of_words = Counter(lower_case)
print(bag_of_words.most_common(3))

[(‘the’, 4), (‘,’, 4), (‘data’, 3)]
What is text preprocessing?
text pre processing is a complete process to make the text ready for analysis by removing stop words, common punctuations, spelling mistakes, etc. Before any analysis you are suppose to process the text.
What are the commonly used methods of text preprocessing?
Converting the complete text in either lower or upper case
Tokenization
Lemmatization/Stemming
Removing stop words
How to tokenize only words from a paragraph while ignoring the numbers and other special character?

x = “Here is your text. Your 1 text is here”
from nltk.corpus import stopwords
only_alphabet = [w for w in word_tokenize(x.lower())
if w.isalpha()]
print(only_alphabet)

w.isalpha() function will check if the word has only text in it and will remove the numbers

Output
[‘here’, ‘is’, ‘your’, ‘text’, ‘your’, ‘text’, ‘is’, ‘here’]
What are stop words?
Stop words are common occurring words in a text which have high frequency but less importance. Words like the, are, is, also, he, she, etc. are some of the examples of English stop words.
How to remove stop words from my text?
from nltk.corpus import stopwords
para = “Your text here. Here is your text”
tokens = [w for w in word_tokenize(para.lower)
if w.isalpha()]
stoppy = [t for t in tokens
if t not in stopwords.words(‘english’)]
What is Lemmatization?
Lemmatization is a technique to keep words in its base form or dictionary form of the word. Example will help you understand better

The lemma of better will be good.
The word “walk” is the base form of the word “Walking”
Give an example of Lemmatization in Python
x = “running”
import nltk
nltk.download(‘wordnet’)
lem.lemmatize(x,”v”)

Output
‘run’
How to lemmatize the texts in your paragraph?
Use the module WordNetLemmatizer from nltk.stem

from nltk.stem import WordNetLemmatizer
lower_tokens = word_tokenize(para)
lower_case = [t.lower() for t in lower_tokens]
only_alphabet = [t for t in lower_case if t.isalpha()]
without_stops = [x for x in only_alphabet if x not in stopwords.words(“English”)
lemma = WordNetLemmatizer()
lemmatized = [lemma.lemmatize(t) for t in without_stops]
What is gensim?
Gensim is a very popular open-source NLP library. It is used to perform complex tasks like:-
a. Building document or word vectors
b. Topic identification
What is a word vector?
Word vector is a representation of words which helps us in observing relationships between words and documents. Based on how the words are used in text, the word vector help us to get meaning and context of the words. Example, the word vector will connect Bangalore to Karnataka and Patna to Bihar where Bangalore and Patna are capital of the Indian state Karnataka and Bihar.

These are multi-dimensional mathematical representation of words created using deep learning method. They give us insight into relationships between words in a corpus.
What is LDA?
LDA is used for topic analysis and modeling. It is used to extract the main topics from a dataset. LDA stands for Latent Dirichlet Allocation. Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.
What is gensim corpus?
Gensim corpus converts the tokens in bag or words. It gives result in a list of (token id, token reference). The gensim dictionary can be updated and reused
What is stemming?
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.
Give an example of stemming in Python
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
x = “running”
stem.stem(x)

Output
‘run’
What is tf-idf?
Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.

The weight will be low in two cases:-
a. When the term frequency is low i.e. number of occurrence of a word is low
b. When N is equal to dfi, then the log will be close to zero

So, using (b), if a word occurs in all the document, then the log value will be low

If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-

(5/100)*log(200/20)

53. How to create a tf-idf model using gensim?

from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tf_idf_weights = tfidf([doc])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
print(dictionary.get(term_id), weight)

54. What is Named Entity Recognition?
It is a process of identifying important named entity texts in a document. Ex. organization, dashboard names, work of arts, etc.
It is present in the ne_chunk_sents() function in nltk package. It can be used as below:-

chunk_Sent = nltk.ne_chunk_sents(Part_Of_Speech_sentence_token, binary = True)

55. What is POS?
Part of Speech tag in Natural Language Processing is used to tag a word according to its use in the sentence. It tags the word as a part of speech.
It is present as pos_tag() in nltk package. You can feed the tokenized word in a loop to get the POS tag for each word like below:-

pos = [nltk.pos_tag(x) for x in tokenized_word_variable]

56. What is the difference between lemmatization and stemming?
Lemmatization gets to the base of the word whereas stemming just chops the tail of the word to get the base form. Below example will serve you better:

See is the lemma of saw, but if you try to get the stem of saw, then it will return ‘s’ as the stem.
See is the lemma of seeing, stemming seeing will get you see.

54. What is spacy package?
Spacy is a very efficient package present in Python which helps in easy pipeline creation and finding entities in tweets and chat messages.

55. How to initiate the English module in spacy?
import spacy
x = spacy.load(‘en’,tagger=False,parser=False,matcher=False)

56. Why should one prefer spacy over nltk for named entity recognition?
Spacy provides some extra categories, other than the one provided by nltk.

These categories are:-
-NORP
-Cardinal
-money
-Work of art
-Language
-Event

So, you can try spacy for NER according to your need

57. What are the different packages which uses word vectors?
Spacy and gensim are the two packages which we have covered so far that uses word vectors.

58.What if your text is in various different languages? Which package can help you in Named Entity Recognition for most of the largely spoken languages?
Polygot is one of the package which supports more than 100 languages and uses word vector for Named Entity Recognition

59.What is supervised learning?
Supervised learning is a form of Machine Learning where your model is trained by looking at a given output for all the inputs. The model is trained on this input-output combination and then the learning of the model is tested on the test dataset. Linear Regression and Classification are two examples of supervised learning.

60. How can you use Supervised Learning in NLP?
Suppose you have a chat data and looking at the keyword you have specified the sentiment of the customer. Now you have got a set of data which have complete chat and the sentiment associated with the chat. Now you can use supervised learning to train the data on this dataset and then use it while there is alive chat to identify the ongoing sentiment of the customer.

61. What is Naïve-Bayes model?
Naive Bayes classifiers are linear classifiers that are known for being simple yet very efficient. The probabilistic model of naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent.

62.What is the flow of creating a Naïve Bayes model?
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
nb_classifier.fit(count_train,y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred,labels=[‘FAKE’,’REAL’])
print(cm)

Let’s take some sample text and try to implement basic algorithms first

63. What is POS?
POS stands for Parts of Speech tagging and it is used to tag the words in your document according to Parts of Speech. So, noun, pronoun, verb, etc. will be tagged accordingly and then you can filter what you need from the dataset. If I am just looking for names of people mentioned in the comment box then I will look for mainly Nouns. This is a basic but very important algorithm to work with.

64. Take an example to take a sentence and break it into tokens i.e. each word
text = “The Data Monk will help you learn and understand Data Science”
tokens = word_tokenize(text)
print (tokens)
[‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

65. Take the same sentence and get the POS tags

from nltk import word_tokenize, pos_tag
text = “The Data Monk will help you learn and understand Data Science” tokens = word_tokenize(text)
print (pos_tag(tokens))

[(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

66. Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

#Tokenize the words and apply POS
def token_POS(token):
token = nltk.word_tokenize(token)
   token = nltk.pos_tag(token)
   return token
token = token_POS(data)

Output

67. What is NER?
NER stands for Named Entity Recognition and the work of this algorithm is to extract specific chunk of data from your text data. Suppose you want to get all the Nouns from the dataset . It is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes. Etc.

68. What are some of the common tags in POS. You need to know the meaning of the tags to use it in your regular expression
DT – Detreminer
FW – Foreign word
JJ – Adjective
JJR – Comparative Adjective
NN – Singular Noun
NNS – Plural Noun
RB – Adverb
RBS – Superlative Adverb
VB – Verb

You can get the complete list on the internet.

69. Implement NER on the tokenized and POS tagged sentence used above.
nltk.download(‘maxent_ne_chunker’)
nltk.download(‘words’)
ne_chunked_sents = nltk.ne_chunk(token)
named_entities = []
for tagged_tree in ne_chunked_sents:
    if hasattr(tagged_tree, ‘label’):
        entity_name = ‘ ‘.join(c[0] for c in tagged_tree.leaves())
        entity_type = tagged_tree.label() # get NE category
named_entities.append((entity_name, entity_type))
print(named_entities)

[(‘Data Monk’, ‘ORGANIZATION’), (‘Bangalore’, ‘GPE’), (‘Data Science’, ‘PERSON’), (‘Amazon’, ‘ORGANIZATION’)]

Code Explanation
nltk.download will import maxent_ne_chunker which is used to break the sentence into named entity chunks and nltk.download(‘words’) will download the dictionary

We already have a variable token which contains POS tagged tokens. nltk.ne_chunk(token) will tag the tokens to Named entity chunks.

function hasattr()is used to check if an object has the given named attribute and return true if present, else false.

.leaves() function is used to get the leaves of the node and label() will get you the NER label

70.What are n-grams?
A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.

71. Create a 3-gram of the sentence below
“The Data Monk was started in Bangalore in 2018″

def ngrams(text, n):
    token = text.split()
    final = []
    for i in range(len(token)-n+1):
        final.append(token[i:i+n])
    return final
ngrams(“The Data Monk was started in Bangalore in 2018”,3)

Output

72. What is the right order for a text classification model components?

Text cleaning
Text annotation
Text to predictors
Gradient descent
Model tuning

73. What is CountVectorizer?
CountVectorizer is a class from sklearn.feature_extraction.text. It converts a selection of text documents to a matrix of token counts.

———————————————————

Let’s take up a project and try to solve it using NLP. Here we will only create the dataset and will apply Random forest and NLP to train our dataset to identify the sentiment of a review

Objective of the project is to predict the correct tag i.e. whether people liked the food or not using NLP and Random Forest.

74. How to create a dataset? What to write in it?
Open an excel file and save it as Reviews (in the csv format). Now make two columns in the sheet like the one given below

Review	Liked
This restaurant is awesome	1
Food not good	0
Ambience was wow	1
The menu is good	1
Base was not good	0
Very bad	0
Wasted all the food	0
Delicious	1
Great atmosphere	1
Not impressed with the food	0
Nice	1
Bad taste	0
Great presentation	1
Lovely flavor	1
Polite staff	1
Bad management	0

Basically you can write the review of anything like Movies, food, restaurant, etc. Just make sure to keep the format like this. Thus your dataset is ready.

75. What all packages do I need to import for this project?
It’s always good to start with importing all the necessary packages which you might use in the project

import re
import pandas as pd
import numpy as np
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

We will discuss each of these as we tackle the problem

76. How to import a csv file in Python?
Importing csv file in python requires importing pandas library and using read_csv function

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)

77. Let’s view the top and bottom 5 lines of the file to make sure we are good to go with the analysis
Use the commands given below
review.head() and review.tail()

78. Now we will clean the dataset. Will start with removing numbers and punctuations. Write a regular expression for removing special characters and numbers

review is the name of the data set and Review is the name of the column

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )

79. What is sub() method?
The re.sub() function in the re module can be used to replace substrings.

The syntax for re.sub() is re.sub(pattern,repl,string).

That will replace the matches in string with repl.

80. Convert all the text into lower case and split the words
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()

81. Now we want to stem the words. Do you remember the definition of stemming?
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.

final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

82. What does the above snippet do?
port = PorterStemmer() allocates the stemming function to the variable port
port.stem(words) for words in x –
It takes all the words individually. Also remove the words which are stopwords.

x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]

The above loop will get all the non-stop words and stem the words

83. Create the final dataset with only stemmed words.
final = []
for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)
    final.append(x)

Let’s see how the final dataset looks like after removing the stop words and stemming the text

84. How to use the CountVectorizer() function? Explain using an example
from sklearn.feature_extraction.text import CountVectorizer
corpus = [‘The Data Monk helps in providing resource to the users’,
‘It is useful for people making a career in Data Science’,
‘You can also take the 100 days Challenge of TDM’]
counter = CountVectorizer()
X = counter.fit_transform(corpus)
print(counter.get_feature_names())
print(X.toarray())

get_feature_name() will take all the words from the above dataset and will arrange it in an alphabetical order
fit_transform() will transform each line of the dataset as compared to the result of get_feature_name()
toArray will change the datatype to Array

Lets understand the output

[‘100’, ‘also’, ‘can’, ‘career’, ‘challenge’, ‘data’, ‘days’, ‘for’, ‘helps’, ‘in’, ‘is’, ‘it’, ‘making’, ‘monk’, ‘of’, ‘people’, ‘providing’, ‘resource’, ‘science’, ‘take’, ‘tdm’, ‘the’, ‘to’, ‘useful’, ‘users’, ‘you’]

         [[0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 2 1 0 1 0]
         [0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0]
         [1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1]]

The first output is the 26 unique words from the 3 lines of document arranged in alphabetical order.
The next three contains the presence of the above words in the document. 0 present in the 1,2,3, and 4^th place of the first row suggests that the words 100, also, can, and career are not present in the first line of the input.
Similarly 2 present on the 22^nd position shows that the word “the” is present twice in the first row of input
The first row of input is “The Data Monk helps in providing resource to the users”

85. Now let’s apply CountVectorizer on our dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(final).toarray()

max_feature = 1500 will make sure that at max 1000 words are put into the master array. In case you are planning to apply this on a huge dataset, then do increase the max_feature component.
X will have the same array of occurrence across all the features as we have seen in the above example

86. How to separate the dependent variable?
As we know we want to see whether the review was positive or not. So the dependent variable here is the second column and we have put the value of the second column in a different variable i.e. y

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values

So, X has the array containing an array of occurrence of different words across all the words and y has the binary value where 1 denotes like and 0 denotes did not like

87. Now we need to split the complete data set into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

You already know about X and y, the test_size will divide the train and test dataset in 75:25 ratio respectively
Now you will have to train the model on X_train and y_train.

88. Random forest is one of the best model to work on supervised learning. By the way, what is Random forest?
Before we start with explaining a forest, we need to know what is a tree? Random forest is made of decision trees. To illustrate the concept, we’ll use an everyday example: predicting the tomorrow’s maximum temperature for our city. To keep things straight, I’ll use Seattle, Washington, but feel free to pick your own city.
In order to answer the single max temperature question, we actually need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for this problem might be 30–70 degrees (Fahrenheit) if we do not know the time of year before we begin. Gradually, through a set of questions and answers we reduce this range until we are confident enough to make a single prediction.

Since temperature is highly dependent on time of year, a decent place to start would be: what is the season? In this case, the season is winter, and so we can limit the prediction range to 30–50 degrees because we have an idea of what the general max temperatures are in the Pacific Northwest during the winter. This first question was a great choice because it has already cut our range in half. If we had asked something non-relevant, such as the day of the week, then we could not have reduced the extent of predictions at all and we would be back where we started. Nonetheless, this single question isn’t quite enough to narrow down our estimate so we need to find out more information. A good follow-up question is: what is the historical average max temperature on this day? For Seattle on December 27, the answer is 46 degrees. This allows us to further restrict our range of consideration to 40–50 degrees. Again, this was a high-value question because it greatly reduced the scope of our estimate.

We need to have similar questions and once we put everything in a flow we will get a decision tree.
So, to arrive at an estimate, we used a series of questions, with each question narrowing our possible values until we were confident enough to make a single prediction. We repeat this decision process over and over again in our daily lives with only the questions and answers changing.

89. What is Random Forest?
Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.

You can read a lot on Medium.com for the explanation of Decision Tree and Random Forest in layman’s term

90. Let’s create our Random forest model here
model = RandomForestClassifier(n_estimators = 10,

criterion = ‘entropy’)

model.fit(X_train, y_train)

91. Define n_estimator
n_estimator is basically the number of trees you want to create in your forest. Try to vary the number of trees in this forest.
In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, i.e. at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.

Random forests are ensemble methods, and you average over many trees. Similarly, if you want to estimate an average of a real-valued random variable (e.g. the average heigth of a citizen in your country) you can take a sample. The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample.

92. Define criterion. Why did you use entropy and not gini?
Gini is intended for continuous attributes and Entropy is for attributes that occur in classes.
Gini is to minimize misclassification

Entropy is for exploratory analysis
Entropy is a little slower to compute

93. What is model.fit()?
model.fit() helps you in create your model. The two parameters are that of training dataset i.e. X_train and y_train. It will take the values or the output of the reviews and will create a lot of decision trees to fit the output on the basis of input. These rules will be applied to your testing dataset to get the results

94. Let’s predict the output for the testing dataset
y_pred = model.predict(X_test)

You have just created the model on X_train and y_train. Now you need to predict the output for X_test. We already have the output for these, but we want our model to predict the answer so that we can match the answers or output

95. Now let’s check the confusion matrix to see how many of our outputs were correct
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

96. Lastly, what is a confusion matrix and how to know the accuracy of the model?
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Let’s take an example of a confusion matrix

So, our rows contain real values for a binary classifier and the columns have our predicted values. 50 and 100 show that the predicted and actual values were correctly identified. 10 and 5 show that the predicted values were not correct. Explore precision, recall, etc.

As far as accuracy is concerned, the formula is simple = (50+100)/(50+10+5+100)
i.e. total correct prediction divided by all the prediction.

Our model had very less dataset. The confusion matrix resulted in the following

Therefore accuracy = (1+3)/(1+0+0+3) = 100% accuracy
Yeahhhh..we are perfect

Complete code

import re
import pandas as pd
import numpy as np
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’)
review.tail()
final = []

for i in range(0,16):
    x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
    x = x.lower()
    x = x.split()
    port = PorterStemmer()
    x = [port.stem(words) for words in x
         if not words in set(stopwords.words(‘english’))]
    x = ‘ ‘.join(x)
    final.append(x)

cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(final).toarray()
y = review.iloc[:,1].values
print(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 501,
                            criterion = ‘entropy’)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
cm = confusion_matrix(y_test, y_pred)
cm

Damn !! I got out in the nervous 90’s 😛

This is all you need to hop on a real life problem or a hackathon. Do comment you find any flaw in the code.

Keep Learning 🙂

The Data Monk

Tag: TF-IDF in Python

100 Natural Language Processing Questions in Python

Subscribe to our newsletter