Let’s start with the 4th set of Python Interview Questions. Mostly, the product-based companies in India ask these questions in the first round of interviews for Business Intelligence Engineer/Data Analyst/Business Analyst/Data Scientist
18. To check if a particular element is there in the list
l = ['a','b','d'] count = l.count('d') print(count)
or
l_set = set(l) if 'b' in l_set: print("Mil gaya")
or Use a naive method of loop
for i in range(0,len(l)): if l[i] == 'd': print("Mil gaya at",i)
19. Get the sum of elements in a list
l = [1,3,5,6] s = 0 for i in range(0,len(l)): s = s+l[i] print(s)
or
list1 = [9,8,7,6] def sumOfList(list, size): if (size == 0): return 0 else: return list[size - 1] + sumOfList(list, size - 1) total = sumOfList(list1, len(list1)) print("Sum of all elements in given list: ", total)
20. Multiple all the numbers in a list
l = [1,3,5,6] m = 1 for i in range(0,len(l)): s = s*l[i] print(s)
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions Link –The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study Link –The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out Link –The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms Link –The Data Monk Instagram page
Today we will take some Python Array Interview Questions. We will try to cover the basic questions and then will move towards some interview questions that are frequently asked in the analytics domain.
Companies like Zynga, Myntra, Housing, Ola, Oyo, etc. emphasize quite a bit on the Python skills of the candidate (at least moderate skills) for the first round. In our series of 200 questions, we will try to cover all the types of questions asked with solutions.
Python Array Interview Questions
11. Finding the sum of elements of the array
a = [1,4,5,7] s = 0 for i in a: s = s+i; print("Sum of array is = ",s)
12. The Largest element of an array
a = [1,5,7,3,4,9,18,222] num = len(a) def lar(a): largest = a[0] for i in range(2,num): if largest <= a[i]: largest = a[i] return largest print("The largest element is = ",lar(a))
13. Rotate an array
x = [1,2,5,6,7]
y = len(x)
z = []
for i in range(0,(y)):
z.append(x[y-i-1])
print(z)
14. Split the array at a particular point
a = [1,4,6,12,54] d = 3 x = [] y = [] for i in range(0,len(a)): if i<d: x.append(a[i]) else: y.append(a[i]) print(x+y)
15. Interchange the first and the last element in a list
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions Link –The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study Link –The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out Link –The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms Link –The Data Monk Instagram page
Python Interview Questions for Analysts will help you understand the type of questions that you can expect in an analytics interview. Product-based companies mostly focus on asking at least 5-7 Python questions around the basic logic like Palindrome, rotating an array, sum of diagonal, string, Armstrong numbers, etc. to check the experience of the candidate in Python. We will have a complete set of 200 questions to make sure you can handle these questions like a breeze
x = int(input("Enter a number = "))
if x > 1:
for i in range(2,int(x/2)+1):
if (x%i == 0):
print("Not a Prime Number")
break
else:
print("Prime Number")
else:
print("Not a prime number, infact it is a negative number")
7. Prime number in a list of numbers with starting and endpoint
x = int(input("Enter a starting point = ")) y = int(input("Enter an ending point = ")) def prime(x,y): prime = [] for i in range(x,y): if (i == 0 or i == 1): continue else: for j in range(2,int(i/2)+1): if(i%j == 0): break else: prime.append(i) return prime print("The list of prime numbers are = ", prime(x,y))
8.Fibonacci Series 0,1,1,2,3,5,8,13…
x = int(input("Enter the n-th fibonacci series number = "))
def fib(x):
if x<=0:
print("Incorrect number")
elif x==1:
return 0
elif x==2:
return 1
else:
return fib(x-1)+fib(x-2)
print("Fibonacci number on the n-th place is ", fib(x))
9. check if a given number is a perfect square or not
n = int(input("Enter a number = ")) def sq(n): s = int(math.sqrt(n)) return s*s == n print("The status of number = " , sq(n))
10. Print ASCII value of a character in python
x = input("Take a character") print("The ASCII Value of the character is ", ord(x))
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions Link –The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study Link –The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out Link –The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms Link –The Data Monk Instagram page
Python Analytics interview questions In this series, we will add some of the basic questions to start with Python. Slowly we will move to moderate level DSA questions followed by Pandas, Numpy, and OS module. All you need to do is to install Python in your system and start from scratch.
We will try to solve each question in multiple ways, if you are missing out on the basics then do consult Tutorialspoint or w3school. Invest around 6 hours on these websites and you should be good to go
Python Analytics interview questions
Python Basic Question
1. Take input of two numbers from users and find out the maximum
a = input("Enter first number :") b = input("Enter second number ") if a>b: print(a + " is greater than " + b) else: print(b + " is greater than "+a)
or
a = 30 b = 40 print(a if a>b else b)
or
a = 10 b = 30 c = max(a,b) print(c)
2. Print the factorial of any number
a = 10 def fact(n): if (n == 1 or n==0): return 1 else: return n*fact(n-1) print(fact(5))
or
import math print(math.factorial(5))
3. Get square of the number if the number is odd and cube if even
x = int(input("Enter a natural number = "))
def sum_of_natural(x):
ss = 0
for i in range(1,x):
ss = ss+(i*i)
return ss
print("Sum of square of natural numbers are = " , sum_of_natural(x))
5. Find if a number is an Armstrong number.
Example – 153 is an Armstrong number as 1^3+5^3+3^3 = 153
num = int(input("Enter a number = ")) s = 0 x = num while (x >0 ): digit = x%10 s = s+(digitdigitdigit) x = x//10 print(s) print("Armstrong" if s == num else "Not Armstrong")
The Data Monk services
We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now
YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions Link –The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study Link –The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out Link –The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms Link –The Data Monk Instagram page
What is NLP? NLP stands for Natural Language Processing and it is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.
What are the uses of NLP? Natural Language Processing is useful in various domains like Chat bots, Extracting insights from feedback and surveys, text-classification, etc.
What are the different algorithms in NLP? NLP is used to analyze text, allowing machines to understand how human’s speak. This human-computer interaction enables real-world applications like a. automatic text summarization b. sentiment analysis c. topic extraction d. named entity recognition e. parts-of-speech tagging f. relationship extraction g. stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.
What problems can NLP solve? NLP can solve many problems like, automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
What is Regular Expression? A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. Regular expressions are a generalized way to match patterns with sequences of characters.
What are the different applications of Regular Expression in Data Science? a. Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you. b. Social websites feed like the Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts. c. Speech engines like Apple Siri. d. Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.
What are the packages in Python to help in Regular ExpressionThe package which we commonly use for regular expression is re. We can import the package using following command
import re
What is match function? import re re.match(‘ni’,’nitin’)
Match=’ni’
What are the common patterns used in regular expression? \w+ -> word \d -> digit \s -> space \* ->wildcard + or * -> greedy match \S -> anti space i.e. it matches anything which is not a space [A-Z] – matches all the character in the range of capital A and capital Z
What are the important functions to use in Regular Expression? findall() – It finds all the patterns in a string search() – It search for a pattern match() – It matches an entire string or a sub string split() – It splits a string in Regular Expression. It returns a list object
What is the difference between match and search function? Match tries to match the string from beginning whereas search matches it wherever it finds the pattern. The below example will help you understand better
Guess the output of the following import re re.split(‘\s’,’The Data Monk is cool’)
[‘The’,’Data’,’Monk’,’is’,’cool’]
Work in finding the output of the following regx = r”\w+” strx = “This isn’t my pen” re.findall(regx,strx)
[‘This’, ‘isn’, ‘t’, ‘my’, ‘pen’]
How to write a regular expression to match some specific set of characters in a string? special_char = r”[?/}{‘;]“ The above Regular Expression will take all the characters between []
Write a regular expression to split a paragraph every time it finds an exclamation mark
import re exclamation = r”[!]” strr = “Data Science comprises of innumerable topics! The aim of this 100 Days series is to get you started assuming ! that you have no prior! knowledge of any of these topics. “ excla = re.split(exclamation,strr) print(excla)
[‘Data Science comprises of innumerable topics’, ‘ The aim of this 100 Days series is to get you started assuming ‘, ‘ that you have no prior’, ‘ knowledge of any of these topics. ‘]
Get all the words starting with a capital letter
capital = r”[A-Z]\w+” print(re.findall(capital,strr))
[‘Data’, ‘Science’, ‘The’, ‘Days’]
Find the output of the following code? digit = “12 34 98” find_digit = r”\d+” print(re.findall(find_digit,digit))
[’12’, ’34’, ’98’]
What is tokenization? Tokenization is one of the most important part of NLP. It simply means to break down the string into smaller chunks. It breaks the paragraph into words, sentences, etc.
What is NLTK? NLTK stands for Natural Language Toolkit Library and it is a package in Python which is very commonly used for tokenization.
from nltk.tokenize import word_tokenize word_tokenize(“This is awesome!”)
[‘This’, ‘is’, ‘awesome’, ‘!’]
What are the important nltk tokenizer?
sent_tokenize – Tokenize a sentence tweet_tokenize – This one is exclusively for tweets which can come handy if you are trying to do sentiment analysis by looking at a particular hashtag or tweets regexp_tokenize – tokenize a string or document based on a regular expression pattern
What is the use of the function set() ? The data type set is a collection. It contains an unordered collection of unique and immutable objects. So when you extract a set of words from a novel, then it will get you the distinct words from the complete novel. It is a very important function and it will continue to come in the book as you go ahead.
Tokenize the paragraph given below in sentence. para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.”
para = “This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization. This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead. Be with Piyush and Pihu to understand Data Science and Machine Learning.” sent = sent_tokenize(para) print(sent)
[‘This is the story about Piyush,29, Senior Data Scientist at Imagine Incorporation and myself, Pihu,24, Junior Data Scientist at the same organization.’, ‘This is about the journey of Piyush once he retired from his job, after being unsatisfied with the way his career was moving ahead.’, ‘Be with Piyush and Pihu to understand Data Science and Machine Learning.’]
Basically .start() and .end() helps you find the starting and ending index of a search. Below is an example:
x = re.search(“Piyush”,para) print(x.start(),x.end())
24 30
What is the OR method? OR method, as the name suggests is used to give condition to the regular expression. See the example below:-
x = r”\d+ | \w+”
The above regex will get you all the words and numbers, but it will ignore other characters like punctuation, ampersand, etc.
What are the advance tokenization techniques? Take for example [A-Za-z]+, this will get you all the alphabets regardless of upper or lowercase alphabets
How to write a regex to match spaces or commas? (/s+|,) – The /s+ will get you one or more spaces, and the pipe will mark an OR operator to take the comma into consideration
How to include special characters in a regex? If you have any experience with regular expression or SQL queries, then this syntax will look familiar. You need to give a backward slash before any special character like below
(\,\.\?) – This will consider comma, full stop and question mark in the text
What is the difference between (a-z) and [A-Z]? This is a very important concept, when you specify (a-z), it will only match the string “a-z”. But when you specify [A-Z] then it covers all the alphabet between upper case A and Z.
Once again go through the difference between search() and match() function Search() will find your desired regex expression anywhere in the string, but the match always looks from the beginning of the string. If a match() function hits a comma or something, then it will stop the operation then and there itself. Be very particular on selecting a function out of these
What is topic modeling? In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
What is bag-of-words? Bag-of-words is a process to identify topics in a text. It basically counts the frequency of the token in a text. Example below to help you understand the simple concept of bag-of-words
para = “The game of cricket is complicated. Cricket is more complicated than Football”
The – 1 game – 1 of-1 cricket-1 is-2 complicated-2 Cricket – 1 than – 1 Football – 1
As you can see, the word cricket is counted two times as bag-of-words is case sensitive.
How to counter the case sensitive nature of bag-of-words? It’s a logical question, just convert every word in lower or upper case and then count the words. Look for question 35 to convert every word in lower case using loop.
What is counter? A counter is a container that keeps count of number of times equivalent values are added. It looks similar to dictionary in Python. Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts.
How to import Counter in Python? Counter is present in the Collection package, you can use it directly by importing it like below:
from collections import Counter
Use the same paragraph used above and print the top 3 most common words The code is self explanatory and is given below:
word2 = word_tokenize(para) lower_case = [t.lower() for t in word2] bag_of_words = Counter(lower_case) print(bag_of_words.most_common(3))
[(‘the’, 4), (‘,’, 4), (‘data’, 3)]
What is text preprocessing? text pre processing is a complete process to make the text ready for analysis by removing stop words, common punctuations, spelling mistakes, etc. Before any analysis you are suppose to process the text.
What are the commonly used methods of text preprocessing? Converting the complete text in either lower or upper case Tokenization Lemmatization/Stemming Removing stop words
How to tokenize only words from a paragraph while ignoring the numbers and other special character?
x = “Here is your text. Your 1 text is here” from nltk.corpus import stopwords only_alphabet = [w for w in word_tokenize(x.lower()) if w.isalpha()] print(only_alphabet)
w.isalpha() function will check if the word has only text in it and will remove the numbers
What are stop words? Stop words are common occurring words in a text which have high frequency but less importance. Words like the, are, is, also, he, she, etc. are some of the examples of English stop words.
How to remove stop words from my text? from nltk.corpus import stopwords para = “Your text here. Here is your text” tokens = [w for w in word_tokenize(para.lower) if w.isalpha()] stoppy = [t for t in tokens if t not in stopwords.words(‘english’)]
What is Lemmatization? Lemmatization is a technique to keep words in its base form or dictionary form of the word. Example will help you understand better
The lemma of better will be good. The word “walk” is the base form of the word “Walking”
Give an example of Lemmatization in Python x = “running” import nltk nltk.download(‘wordnet’) lem.lemmatize(x,”v”)
Output ‘run’
How to lemmatize the texts in your paragraph? Use the module WordNetLemmatizer from nltk.stem
from nltk.stem import WordNetLemmatizer lower_tokens = word_tokenize(para) lower_case = [t.lower() for t in lower_tokens] only_alphabet = [t for t in lower_case if t.isalpha()] without_stops = [x for x in only_alphabet if x not in stopwords.words(“English”) lemma = WordNetLemmatizer() lemmatized = [lemma.lemmatize(t) for t in without_stops]
What is gensim? Gensim is a very popular open-source NLP library. It is used to perform complex tasks like:- a. Building document or word vectors b. Topic identification
What is a word vector? Word vector is a representation of words which helps us in observing relationships between words and documents. Based on how the words are used in text, the word vector help us to get meaning and context of the words. Example, the word vector will connect Bangalore to Karnataka and Patna to Bihar where Bangalore and Patna are capital of the Indian state Karnataka and Bihar.
These are multi-dimensional mathematical representation of words created using deep learning method. They give us insight into relationships between words in a corpus.
What is LDA? LDA is used for topic analysis and modeling. It is used to extract the main topics from a dataset. LDA stands for Latent Dirichlet Allocation. Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.
What is gensim corpus? Gensim corpus converts the tokens in bag or words. It gives result in a list of (token id, token reference). The gensim dictionary can be updated and reused
What is stemming? Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.
Give an example of stemming in Python from nltk.stem.porter import PorterStemmer stem = PorterStemmer() x = “running” stem.stem(x)
Output ‘run’
What is tf-idf? Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.
The weight will be low in two cases:- a. When the term frequency is low i.e. number of occurrence of a word is low b. When N is equal to dfi, then the log will be close to zero
So, using (b), if a word occurs in all the document, then the log value will be low
If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-
(5/100)*log(200/20)
53. How to create a tf-idf model using gensim?
from gensim.models.tfidfmodel import TfidfModel tfidf = TfidfModel(corpus) tf_idf_weights = tfidf([doc])
# Sort the weights from highest to lowest: sorted_tfidf_weights sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True) # Print the top 5 weighted words for term_id, weight in sorted_tfidf_weights[:5]: print(dictionary.get(term_id), weight)
54. What is Named Entity Recognition? It is a process of identifying important named entity texts in a document. Ex. organization, dashboard names, work of arts, etc. It is present in the ne_chunk_sents() function in nltk package. It can be used as below:-
55. What is POS? Part of Speech tag in Natural Language Processing is used to tag a word according to its use in the sentence. It tags the word as a part of speech. It is present as pos_tag() in nltk package. You can feed the tokenized word in a loop to get the POS tag for each word like below:-
pos = [nltk.pos_tag(x) for x in tokenized_word_variable]
56. What is the difference between lemmatization and stemming? Lemmatization gets to the base of the word whereas stemming just chops the tail of the word to get the base form. Below example will serve you better:
See is the lemma of saw, but if you try to get the stem of saw, then it will return ‘s’ as the stem. See is the lemma of seeing, stemming seeing will get you see.
54. What is spacy package? Spacy is a very efficient package present in Python which helps in easy pipeline creation and finding entities in tweets and chat messages.
55. How to initiate the English module in spacy? import spacy x = spacy.load(‘en’,tagger=False,parser=False,matcher=False)
56. Why should one prefer spacy over nltk for named entity recognition? Spacy provides some extra categories, other than the one provided by nltk.
These categories are:- -NORP -Cardinal -money -Work of art -Language -Event
So, you can try spacy for NER according to your need
57. What are the different packages which uses word vectors? Spacy and gensim are the two packages which we have covered so far that uses word vectors.
58.What if your text is in various different languages? Which package can help you in Named Entity Recognition for most of the largely spoken languages? Polygot is one of the package which supports more than 100 languages and uses word vector for Named Entity Recognition
59.What is supervised learning? Supervised learning is a form of Machine Learning where your model is trained by looking at a given output for all the inputs. The model is trained on this input-output combination and then the learning of the model is tested on the test dataset. Linear Regression and Classification are two examples of supervised learning.
60. How can you use Supervised Learning in NLP? Suppose you have a chat data and looking at the keyword you have specified the sentiment of the customer. Now you have got a set of data which have complete chat and the sentiment associated with the chat. Now you can use supervised learning to train the data on this dataset and then use it while there is alive chat to identify the ongoing sentiment of the customer.
61. What is Naïve-Bayes model? Naive Bayes classifiers are linear classifiers that are known for being simple yet very efficient. The probabilistic model of naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent.
62.What is the flow of creating a Naïve Bayes model? from sklearn import metrics from sklearn.naive_bayes import MultinomialNB # Instantiate a Multinomial Naive Bayes classifier: nb_classifier nb_classifier = MultinomialNB() # Fit the classifier to the training data nb_classifier.fit(count_train,y_train) # Create the predicted tags: pred pred = nb_classifier.predict(count_test) # Calculate the accuracy score: score score = metrics.accuracy_score(y_test,pred) print(score) # Calculate the confusion matrix: cm cm = metrics.confusion_matrix(y_test,pred,labels=[‘FAKE’,’REAL’]) print(cm)
Let’s take some sample text and try to implement basic algorithms first
63. What is POS? POS stands for Parts of Speech tagging and it is used to tag the words in your document according to Parts of Speech. So, noun, pronoun, verb, etc. will be tagged accordingly and then you can filter what you need from the dataset. If I am just looking for names of people mentioned in the comment box then I will look for mainly Nouns. This is a basic but very important algorithm to work with.
64. Take an example to take a sentence and break it into tokens i.e. each word text = “The Data Monk will help you learn and understand Data Science” tokens = word_tokenize(text) print (tokens) [‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]
65. Take the same sentence and get the POS tags
from nltk import word_tokenize, pos_tag text = “The Data Monk will help you learn and understand Data Science” tokens = word_tokenize(text) print (pos_tag(tokens))
66. Take the following line and break it into tokens and tag POS using function data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”
#Tokenize the words and apply POS def token_POS(token): token = nltk.word_tokenize(token) token = nltk.pos_tag(token) return token token = token_POS(data)
Output
67. What is NER? NER stands for Named Entity Recognition and the work of this algorithm is to extract specific chunk of data from your text data. Suppose you want to get all the Nouns from the dataset . It is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes. Etc.
68. What are some of the common tags in POS. You need to know the meaning of the tags to use it in your regular expression DT – Detreminer FW – Foreign word JJ – Adjective JJR – Comparative Adjective NN – Singular Noun NNS – Plural Noun RB – Adverb RBS – Superlative Adverb VB – Verb
You can get the complete list on the internet.
69. Implement NER on the tokenized and POS tagged sentence used above. nltk.download(‘maxent_ne_chunker’) nltk.download(‘words’) ne_chunked_sents = nltk.ne_chunk(token) named_entities = [] for tagged_tree in ne_chunked_sents: if hasattr(tagged_tree, ‘label’): entity_name = ‘ ‘.join(c[0] for c in tagged_tree.leaves()) entity_type = tagged_tree.label() # get NE category named_entities.append((entity_name, entity_type)) print(named_entities)
Code Explanation nltk.download will import maxent_ne_chunker which is used to break the sentence into named entity chunks and nltk.download(‘words’) will download the dictionary
We already have a variable token which contains POS tagged tokens. nltk.ne_chunk(token) will tag the tokens to Named entity chunks.
function hasattr()is used to check if an object has the given named attribute and return true if present, else false.
.leaves() function is used to get the leaves of the node and label() will get you the NER label
70.What are n-grams? A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text.
71. Create a 3-gram of the sentence below “The Data Monk was started in Bangalore in 2018″
def ngrams(text, n): token = text.split() final = [] for i in range(len(token)-n+1): final.append(token[i:i+n]) return final ngrams(“The Data Monk was started in Bangalore in 2018”,3)
Output
72. What is the right order for a text classification model components?
Text cleaning Text annotation Text to predictors Gradient descent Model tuning
73. What is CountVectorizer? CountVectorizer is a class from sklearn.feature_extraction.text. It converts a selection of text documents to a matrix of token counts.
———————————————————
Let’s take up a project and try to solve it using NLP. Here we will only create the dataset and will apply Random forest and NLP to train our dataset to identify the sentiment of a review
Objective of the project is to predict the correct tag i.e. whether people liked the food or not using NLP and Random Forest.
74. How to create a dataset? What to write in it? Open an excel file and save it as Reviews (in the csv format). Now make two columns in the sheet like the one given below
Review
Liked
This restaurant is awesome
1
Food not good
0
Ambience was wow
1
The menu is good
1
Base was not good
0
Very bad
0
Wasted all the food
0
Delicious
1
Great atmosphere
1
Not impressed with the food
0
Nice
1
Bad taste
0
Great presentation
1
Lovely flavor
1
Polite staff
1
Bad management
0
Basically you can write the review of anything like Movies, food, restaurant,
etc. Just make sure to keep the format like this. Thus your dataset is ready.
75. What all packages do I need to import for this project? It’s always good to start with importing all the necessary packages which you might use in the project
import re import pandas as pd import numpy as np import nltk nltk.download(‘stopwords’) from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix
We will discuss each of these as we tackle the problem
76. How to import a csv file in Python? Importing csv file in python requires importing pandas library and using read_csv function
77. Let’s view the top and bottom 5 lines of the file to make sure we are good to go with the analysis Use the commands given below review.head() and review.tail()
78. Now we will clean the dataset. Will start with removing numbers and punctuations. Write a regular expression for removing special characters and numbers
review is the name of the data set and Review is the name of the column
final = [] for i in range(0,16): x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] )
79. What is sub() method? The re.sub() function in the re module can be used to replace substrings.
The syntax for re.sub() is re.sub(pattern,repl,string).
That will replace the matches in string with repl.
80. Convert all the text into lower case and split the words final = [] for i in range(0,16): x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] ) x = x.lower() x = x.split()
81. Now we want to stem the words. Do you remember the definition of stemming? Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is also a part of queries and Internet search engines.
final = [] for i in range(0,16): x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] ) x = x.lower() x = x.split() port = PorterStemmer() x = [port.stem(words) for words in x if not words in set(stopwords.words(‘english’))]
82. What does the above snippet do? port = PorterStemmer() allocates the stemming function to the variable port port.stem(words) for words in x – It takes all the words individually. Also remove the words which are stopwords.
x = [port.stem(words) for words in x if not words in set(stopwords.words(‘english’))]
The above loop will get all the non-stop words and stem the words
83. Create the final dataset with only stemmed words. final = [] for i in range(0,16): x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] ) x = x.lower() x = x.split() port = PorterStemmer() x = [port.stem(words) for words in x if not words in set(stopwords.words(‘english’))] x = ‘ ‘.join(x) final.append(x)
Let’s see how the final dataset looks like after removing the stop words and stemming the text
84. How to use the CountVectorizer() function? Explain using an example from sklearn.feature_extraction.text import CountVectorizer corpus = [‘The Data Monk helps in providing resource to the users’, ‘It is useful for people making a career in Data Science’, ‘You can also take the 100 days Challenge of TDM’] counter = CountVectorizer() X = counter.fit_transform(corpus) print(counter.get_feature_names()) print(X.toarray())
get_feature_name() will take all the words from the above dataset and will arrange it in an alphabetical order fit_transform() will transform each line of the dataset as compared to the result of get_feature_name() toArray will change the datatype to Array
The first output is the 26 unique words from the 3 lines of document arranged in alphabetical order. The next three contains the presence of the above words in the document. 0 present in the 1,2,3, and 4th place of the first row suggests that the words 100, also, can, and career are not present in the first line of the input. Similarly 2 present on the 22nd position shows that the word “the” is present twice in the first row of input The first row of input is “The Data Monk helps in providing resource to the users”
85. Now let’s apply CountVectorizer on our dataset from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features = 1000) X = cv.fit_transform(final).toarray()
max_feature = 1500 will make sure that at max 1000 words are put into the master array. In case you are planning to apply this on a huge dataset, then do increase the max_feature component. X will have the same array of occurrence across all the features as we have seen in the above example
86. How to separate the dependent variable? As we know we want to see whether the review was positive or not. So the dependent variable here is the second column and we have put the value of the second column in a different variable i.e. y
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features = 1500) X = cv.fit_transform(final).toarray() y = review.iloc[:,1].values
So, X has the array containing an array of occurrence of different words across all the words and y has the binary value where 1 denotes like and 0 denotes did not like
87. Now we need to split the complete data set into train and test
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
You already know about X and y, the test_size will divide the train and test dataset in 75:25 ratio respectively Now you will have to train the model on X_train and y_train.
88. Random forest is one of the best model to work on supervised learning. By the way, what is Random forest? Before we start with explaining a forest, we need to know what is a tree? Random forest is made of decision trees. To illustrate the concept, we’ll use an everyday example: predicting the tomorrow’s maximum temperature for our city. To keep things straight, I’ll use Seattle, Washington, but feel free to pick your own city. In order to answer the single max temperature question, we actually need to work through an entire series of queries. We start by forming an initial reasonable range given our domain knowledge, which for this problem might be 30–70 degrees (Fahrenheit) if we do not know the time of year before we begin. Gradually, through a set of questions and answers we reduce this range until we are confident enough to make a single prediction.
Since temperature is highly dependent on time of year, a decent place to start would be: what is the season? In this case, the season is winter, and so we can limit the prediction range to 30–50 degrees because we have an idea of what the general max temperatures are in the Pacific Northwest during the winter. This first question was a great choice because it has already cut our range in half. If we had asked something non-relevant, such as the day of the week, then we could not have reduced the extent of predictions at all and we would be back where we started. Nonetheless, this single question isn’t quite enough to narrow down our estimate so we need to find out more information. A good follow-up question is: what is the historical average max temperature on this day? For Seattle on December 27, the answer is 46 degrees. This allows us to further restrict our range of consideration to 40–50 degrees. Again, this was a high-value question because it greatly reduced the scope of our estimate.
We need to have similar questions and once we put everything in a flow we will get a decision tree. So, to arrive at an estimate, we used a series of questions, with each question narrowing our possible values until we were confident enough to make a single prediction. We repeat this decision process over and over again in our daily lives with only the questions and answers changing.
89. What is Random Forest? Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.
Every person comes to the problem with different background knowledge and may interpret the exact same answer to a question entirely differently. In technical terms, the predictions have variance because they will be widely spread around the right answer. Now, what if we take predictions from hundreds or thousands of individuals, some of which are high and some of which are low, and decided to average them together? Well, congratulations, we have created a random forest! The fundamental idea behind a random forest is to combine many decision trees into a single model.
You can read a lot on Medium.com for the explanation of Decision Tree and Random Forest in layman’s term
90. Let’s create our Random forest model here model = RandomForestClassifier(n_estimators = 10,
criterion =
‘entropy’)
model.fit(X_train, y_train)
91. Define n_estimator n_estimator is basically the number of trees you want to create in your forest. Try to vary the number of trees in this forest. In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, i.e. at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.
Random forests are ensemble methods, and you average over many trees. Similarly, if you want to estimate an average of a real-valued random variable (e.g. the average heigth of a citizen in your country) you can take a sample. The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample.
92. Define criterion. Why did you use entropy and not gini? Gini is intended for continuous attributes and Entropy is for attributes that occur in classes. Gini is to minimize misclassification
Entropy is for exploratory analysis Entropy is a little slower to compute
93. What is model.fit()? model.fit() helps you in create your model. The two parameters are that of training dataset i.e. X_train and y_train. It will take the values or the output of the reviews and will create a lot of decision trees to fit the output on the basis of input. These rules will be applied to your testing dataset to get the results
94. Let’s predict the output for the testing dataset y_pred = model.predict(X_test)
You have just created the model on X_train and y_train. Now you need to predict the output for X_test. We already have the output for these, but we want our model to predict the answer so that we can match the answers or output
95. Now let’s check the confusion matrix to see how many of our outputs were correct from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
96. Lastly, what is a confusion matrix and how to know the accuracy of the model? A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
Let’s take an example of a confusion matrix
So, our rows contain real values for a binary classifier and the columns have our predicted values. 50 and 100 show that the predicted and actual values were correctly identified. 10 and 5 show that the predicted values were not correct. Explore precision, recall, etc.
As far as accuracy is concerned, the formula is simple = (50+100)/(50+10+5+100) i.e. total correct prediction divided by all the prediction.
Our model had very less dataset. The confusion matrix resulted in the following
Therefore accuracy = (1+3)/(1+0+0+3) = 100% accuracy Yeahhhh..we are perfect
Complete code
import re import pandas as pd import numpy as np import nltk nltk.download(‘stopwords’) from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix
review = pd.read_csv(‘C://Users//User//Downloads//Restaurant_Reviews.csv’) review.tail() final = []
for i in range(0,16): x = re.sub(‘[^a-zA-Z]’,’ ‘,review[‘Review’][i] ) x = x.lower() x = x.split() port = PorterStemmer() x = [port.stem(words) for words in x if not words in set(stopwords.words(‘english’))] x = ‘ ‘.join(x) final.append(x)
cv = CountVectorizer(max_features = 1500) X = cv.fit_transform(final).toarray() y = review.iloc[:,1].values print(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 501, criterion = ‘entropy’)
model.fit(X_train, y_train) y_pred = model.predict(X_test) y_pred cm = confusion_matrix(y_test, y_pred) cm
Damn !! I got out in the nervous 90’s 😛
This is all you need to hop on a real life problem or a hackathon. Do comment you find any flaw in the code.