TF-IDF and Word correlation

What is tf-idf?
Term frequency and inverse document frequency. It is to remove the most common words other than stop words which are there in a particular document, so this is document specific.

The weight will be low in two cases:-
a. When the term frequency is low i.e. number of occurrence of a word is low
b. When N is equal to dfi, then the log will be close to zero

So, using (b), if a word occurs in all the document, then the log value will be low

If the word “abacus” is present 5 times in a document containing 100 words. The corpus has 200 documents, with 20 documents mentioning the word “abacus”. The formula for tf-idf will be :-

(5/100)*log(200/20)

Take an example to take a sentence and break it into tokens i.e. each word
text = “The Data Monk will help you learn and understand Data Science”

tokens = word_tokenize(text)
print (tokens)

[‘The’, ‘Data’, ‘Monk’, ‘will’, ‘help’, ‘you’, ‘learn’, ‘and’, ‘understand’, ‘Data’, ‘Science’]

Take the same sentence and get the POS tags
from nltk import word_tokenize, pos_tag

text = “The Data Monk will help you learn and understand Data Science”
tokens = word_tokenize(text)
print (pos_tag(tokens))

[(‘The’, ‘DT’), (‘Data’, ‘NNP’), (‘Monk’, ‘NNP’), (‘will’, ‘MD’), (‘help’, ‘VB’), (‘you’, ‘PRP’), (‘learn’, ‘VB’), (‘and’, ‘CC’), (‘understand’, ‘VB’), (‘Data’, ‘NNP’), (‘Science’, ‘NN’)]

Take the following line and break it into tokens and tag POS using function
data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

data = “The Data Monk was started in Bangalore in 2018. Till now it has more than 30 books on Data Science on Amazon”

#Tokenize the words and apply POS
def token_POS(token):
token = nltk.word_tokenize(token)
token = nltk.pos_tag(token)
return token
token = token_POS(data) token

Output

The Data Monk services

We are well known for our interview books and have 70+ e-book across Amazon and The Data Monk e-shop page . Following are best-seller combo packs and services that we are providing as of now

YouTube channel covering all the interview-related important topics in SQL, Python, MS Excel, Machine Learning Algorithm, Statistics, and Direct Interview Questions
Link – The Data Monk Youtube Channel
Website – ~2000 completed solved Interview questions in SQL, Python, ML, and Case Study
Link – The Data Monk website
E-book shop – We have 70+ e-books available on our website and 3 bundles covering 2000+ solved interview questions. Do check it out
Link – The Data E-shop Page
Instagram Page – It covers only Most asked Questions and concepts (100+ posts). We have 100+ most asked interview topics explained in simple terms
Link – The Data Monk Instagram page
Mock Interviews/Career Guidance/Mentorship/Resume Making
Book a slot on Top Mate

The Data Monk e-books

We know that each domain requires a different type of preparation, so we have divided our books in the same way:

1. 2200 Interview Questions to become Full Stack Analytics Professional – 2200 Most Asked Interview Questions
2.Data Scientist and Machine Learning Engineer -> 23 e-books covering all the ML Algorithms Interview Questions
3. 30 Days Analytics Course – Most Asked Interview Questions from 30 crucial topics

You can check out all the other e-books on our e-shop page – Do not miss it

For any information related to courses or e-books, please send an email to nitinkamal132@gmail.com

Register Now

Login

Lost Password