I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience
3+ years at Mu Sigma
2 years at OYO
1 year and counting at The Data Monk
I am an active trader and a logically sarcastic idiot :)
Follow Me
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.
Some nltk tokenizers: TweetTokenizer,MWETokenizer, sent_tokenize, word_tokenize
Tokenization is basically splitting a string into different parts (tokens) based
upon a particular delimiter.
Each word is a token when a sentence is tokenized into words. Each sentence can also be a token,
if you tokenized the sentences out of a paragraph.
Different types of tokenizers:
SpaceTokenizer – Tokenizes on the basis of space
WordPunctTokenizer() – Tokenizes on the basis of Alphabets and Non-alphabets
TweetTokenizer() – we are able to convert the stream of words into small small tokens
so that we can analyse the audio stream.
StanfordTokenizer() – Follows Stanford Standard for generating tokens.
TabTokenizer() – Tokenizes on the basis of TAB
LineTokenizer() – Tokenizes every line.
Answers ( 2 )
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.
Some nltk tokenizers: TweetTokenizer,MWETokenizer, sent_tokenize, word_tokenize
Tokenization is basically splitting a string into different parts (tokens) based
upon a particular delimiter.
Each word is a token when a sentence is tokenized into words. Each sentence can also be a token,
if you tokenized the sentences out of a paragraph.
Different types of tokenizers:
SpaceTokenizer – Tokenizes on the basis of space
WordPunctTokenizer() – Tokenizes on the basis of Alphabets and Non-alphabets
TweetTokenizer() – we are able to convert the stream of words into small small tokens
so that we can analyse the audio stream.
StanfordTokenizer() – Follows Stanford Standard for generating tokens.
TabTokenizer() – Tokenizes on the basis of TAB
LineTokenizer() – Tokenizes every line.