I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience
3+ years at Mu Sigma
2 years at OYO
1 year and counting at The Data Monk
I am an active trader and a logically sarcastic idiot :)
Follow Me
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Limitations of BOW
Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.
Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.
Consider the below two sentences.
1. “John likes to watch movies. Mary likes movies too.”
2. “John also likes to watch football games.”
These two sentences can be also represented with a collection of words.
“Bag of Words” approach is used in NLP to represent a particular document or a
corpus of text.
Example:
He lives in Mumbai. He is in Pune.
Bag of words will represent the following document as
He-2
lives – 1
in -2
Mumbai -1
is – 1
Pune – 1.
This approach is used in finding the similarity between the documents,
by representing every document as the vector and taking measures like
cosine similarity.
The “Bag of Words” approach is used in NLP to convert “text” to “Numerical vector”.
Suppose you have 4 reviews (r1,r2,r3,r4) each review is called as “document”.
The collection of all the documents is a corpus.
From the corpus select the unique words and make a dictionary where “words” are keys and its value is “occurrence of the word in corpus”
eg.
Documents:
r1: The datamonk is very helpful.
r2: The datamonk has a good collection of questions.
Corpus:
[“The datamonk is very helpful”,
“The datamonk has good collection of questions”]
Unique Words:
[‘collection’, ‘datamonk’, ‘good’, ‘has’, ‘helpful’, ‘is’, ‘of’, ‘questions’, ‘the’, ‘very’]
Now “Unique Words” will be used as “features”.
Now based on the occurrence you can create a vector for each review.
v1:[0 1 0 0 1 1 0 0 1 1]
Basically BoW can be thought of as counting of unique words.
Answers ( 3 )
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Limitations of BOW
Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.
Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.
Consider the below two sentences.
1. “John likes to watch movies. Mary likes movies too.”
2. “John also likes to watch football games.”
These two sentences can be also represented with a collection of words.
1. [‘John’, ‘likes’, ‘to’, ‘watch’, ‘movies.’, ‘Mary’, ‘likes’, ‘movies’, ‘too.’]
2. [‘John’, ‘also’, ‘likes’, ‘to’, ‘watch’, ‘football’, ‘games’]
The above vocabulary from all the words in a document, with their respective word count, will be used to create the vectors for each of the sentences.
Now the final BOW is:
{“John”:2,”likes”:3,”to”:2,”watch”:2,”movies”:2,”Mary”:1,”too”:1, “also”:1,”football”:1,”games”:1}
“Bag of Words” approach is used in NLP to represent a particular document or a
corpus of text.
Example:
He lives in Mumbai. He is in Pune.
Bag of words will represent the following document as
He-2
lives – 1
in -2
Mumbai -1
is – 1
Pune – 1.
This approach is used in finding the similarity between the documents,
by representing every document as the vector and taking measures like
cosine similarity.
The “Bag of Words” approach is used in NLP to convert “text” to “Numerical vector”.
Suppose you have 4 reviews (r1,r2,r3,r4) each review is called as “document”.
The collection of all the documents is a corpus.
From the corpus select the unique words and make a dictionary where “words” are keys and its value is “occurrence of the word in corpus”
eg.
Documents:
r1: The datamonk is very helpful.
r2: The datamonk has a good collection of questions.
Corpus:
[“The datamonk is very helpful”,
“The datamonk has good collection of questions”]
Unique Words:
[‘collection’, ‘datamonk’, ‘good’, ‘has’, ‘helpful’, ‘is’, ‘of’, ‘questions’, ‘the’, ‘very’]
Now “Unique Words” will be used as “features”.
Now based on the occurrence you can create a vector for each review.
v1:[0 1 0 0 1 1 0 0 1 1]
Basically BoW can be thought of as counting of unique words.