BOX8 | What is tokenization and What are the important nltk tokenizer?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
It will take less than 1 minute to register for lifetime. Bonus Tip - We don't send OTP to your email id Make Sure to use your own email id for free books and giveaways
Answers ( 2 )
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.
Some nltk tokenizers: TweetTokenizer,MWETokenizer, sent_tokenize, word_tokenize
Tokenization is basically splitting a string into different parts (tokens) based
upon a particular delimiter.
Each word is a token when a sentence is tokenized into words. Each sentence can also be a token,
if you tokenized the sentences out of a paragraph.
Different types of tokenizers:
SpaceTokenizer – Tokenizes on the basis of space
WordPunctTokenizer() – Tokenizes on the basis of Alphabets and Non-alphabets
TweetTokenizer() – we are able to convert the stream of words into small small tokens
so that we can analyse the audio stream.
StanfordTokenizer() – Follows Stanford Standard for generating tokens.
TabTokenizer() – Tokenizes on the basis of TAB
LineTokenizer() – Tokenizes every line.