Question

BOX8 | What is bag-of-words?

Question

With example

in progress 0

Machine Learning TheDataMonk 55 years 3 Answers 1132 views Grand Master 0

About TheDataMonkGrand Master

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)

Follow Me

Answers ( 3 )

Leave an answer

Name*

E-Mail*

Website

Attachment

Browse

Featured image

Browse

Answer*

Previous question

Next question

Ramya Mamidipaka · Answer 1 · June 21, 2020

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Limitations of BOW
Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

Consider the below two sentences.

1. “John likes to watch movies. Mary likes movies too.”
2. “John also likes to watch football games.”
These two sentences can be also represented with a collection of words.

1. [‘John’, ‘likes’, ‘to’, ‘watch’, ‘movies.’, ‘Mary’, ‘likes’, ‘movies’, ‘too.’]
2. [‘John’, ‘also’, ‘likes’, ‘to’, ‘watch’, ‘football’, ‘games’]

The above vocabulary from all the words in a document, with their respective word count, will be used to create the vectors for each of the sentences.

Now the final BOW is:
{“John”:2,”likes”:3,”to”:2,”watch”:2,”movies”:2,”Mary”:1,”too”:1, “also”:1,”football”:1,”games”:1}

swap007 Grand Master · Answer 2 · June 21, 2020

“Bag of Words” approach is used in NLP to represent a particular document or a
corpus of text.
Example:
He lives in Mumbai. He is in Pune.
Bag of words will represent the following document as
He-2
lives – 1
in -2
Mumbai -1
is – 1
Pune – 1.
This approach is used in finding the similarity between the documents,
by representing every document as the vector and taking measures like
cosine similarity.

alokgarg4 Newbie · Answer 3 · June 22, 2020

The “Bag of Words” approach is used in NLP to convert “text” to “Numerical vector”.
Suppose you have 4 reviews (r1,r2,r3,r4) each review is called as “document”.
The collection of all the documents is a corpus.
From the corpus select the unique words and make a dictionary where “words” are keys and its value is “occurrence of the word in corpus”
eg.
Documents:
r1: The datamonk is very helpful.
r2: The datamonk has a good collection of questions.
Corpus:
[“The datamonk is very helpful”,
“The datamonk has good collection of questions”]
Unique Words:
[‘collection’, ‘datamonk’, ‘good’, ‘has’, ‘helpful’, ‘is’, ‘of’, ‘questions’, ‘the’, ‘very’]

Now “Unique Words” will be used as “features”.
Now based on the occurrence you can create a vector for each review.
v1:[0 1 0 0 1 1 0 0 1 1]

Basically BoW can be thought of as counting of unique words.

Register Now

Login

Lost Password

BOX8 | What is bag-of-words?

About TheDataMonkGrand Master

Related questions

What kind of jobs or career opportunities are present in the Machine Learning domain?

Random Forest

Can you use Linear Regression for Classification?

What are the assumptions of Linear Regression?

What is correlation and what is its range?

Answers ( 3 )

Leave an answer