Answers ( 2 )

  1. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.

    Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

    Some nltk tokenizers: TweetTokenizer,MWETokenizer, sent_tokenize, word_tokenize

  2. Tokenization is basically splitting a string into different parts (tokens) based
    upon a particular delimiter.
    Each word is a token when a sentence is tokenized into words. Each sentence can also be a token,
    if you tokenized the sentences out of a paragraph.
    Different types of tokenizers:
    SpaceTokenizer – Tokenizes on the basis of space
    WordPunctTokenizer() – Tokenizes on the basis of Alphabets and Non-alphabets
    TweetTokenizer() – we are able to convert the stream of words into small small tokens
    so that we can analyse the audio stream.
    StanfordTokenizer() – Follows Stanford Standard for generating tokens.
    TabTokenizer() – Tokenizes on the basis of TAB
    LineTokenizer() – Tokenizes every line.

Leave an answer

Browse
Browse