- NLP stands for Natural Language Processing which is a subdomain of Data Science and it helps you in extracting insights from organized or un-organized data.
Have you ever wondered how a chatbot interacts with you so efficiently that half of the time even you don’t know if it is a bot or a human? That is the power of NLP
Every organization asks for feedbacks and reviews in their survey forms or on their website. Do you think they have the time to go through all the texts to extract the sentiment of a customer?
The short answer is ‘NO’, most of the time they will hire someone to work on these texts and get them the required information. This is where NLP is useful for us
Have you ever wondered how few emails are shifted directly to your spam folder and most of the time these emails are spam. NLP makes this happen for you
You search something on Google and you get a lot of relevant suggestion. This is NLP running in the back-end.
There are multiple such examples where NLP is directly making our life easier.
Python has really strong library support for especially NLP and the community support of Python is also strong. So, if you don’t have a constraint on language selection, then do choose Python for any NLP project
What are the important algorithms of NLP?
Following are important algorithms and processes which are used in Natural Language, we will cover most of these in the upcoming days:
3. Word Correlation
6. Sentiment Analysis
7. Parts of Speech Tagging
8. Named Entity Recognition
9. Semantic Text Similarity
10. Language Identification
11. Text Summarisation
What are the important Python libraries which help in NLP?
The most important library is no doubt NLTK, followed by SpaCy, TextBlob, CoreNLP, Gensim, and Polyglot.
Where to use which library?
Just knowing the name of the library will not help you, following are the lump sum idea of which type of work is more convenient in which library:-
1. NLTK – This is your goto library for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
2. TextBlob – Sentiment analysis, pos-tagging, or noun phrase extraction
3. SpaCy – It was designed for production usage – that’s why it’s so much more accessible than NLTK
4. polyglot – It offers a broad range of analysis and impressive language coverage. It requests the usage of a dedicated command in the command line through the pipeline mechanisms
5. Gensim – It can handle large text collections with the help of efficiency data streaming and incremental algorithms, which is more than we can say about other packages that only target batch and in-memory processing
6. CoreNLP – The library is really fast and works well in product development environments
How does a standard NLP project go like?
Though the below steps are not mandatory, but we mostly follow this approach:
Step 1 – Get the raw data
Step 2 – Remove the special characters
Step 3 – Remove the stop words
Step 4 – Perform a TF-IDF which gets you the most important words of the document. The term TF refers to Term Frequency which simply calculates the frequency of each word. IDF stands for Inverse Document Frequency which removes the commonly occurring words with high frequency. So, what is left is the important words of the document. Easy fizzy
Step 5 – Depending on the aim of the project, we try to look for bi-gram or n-gram which gives you words which occurs together like, Revenue Dashboard, Online Activity, etc.
Bi-gram is when you are looking for 2 nearby words, similarly, 3-gram will get you words like TheDataMonk Revenue Dashboard, etc.
Step 6 – Depending on the requirement, we move forward in either clustering data or looking for sentiments or users, etc.
Step 1 to 5 will give you an overview of the important terms and associated terms. This is the most basic Exploratory Data Analysis with which we start with 🙂
We would like to cover Regular Expression in brief here
What is Regular Expression?
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
Regular expressions are a generalized way to match patterns with sequences of characters.
Many of you must have come across SQL questions where you need to get the data of customers whose name starts with A and in the WHERE condition you write something like,
WHERE Customer_Name LIKE ‘A%’
Well !! This is the basic Regular Expression where you request the query to get you a specific result. The way we write Regular Expression in Python is a bit different. Check out the table below:-To use Regular Expression, first, you need to “import re” package
And the following 4 functions quite useful for using your regex 1. findall – It returns a complete list of all the matches
2. search – It returns a match object
3. split – Splits the string wherever there is a match
4. sub – It replaces one or many matches of the regex
Following are some important metacharacter and special sequence
|w+||Get all the words|
|S||Anything but white spaces|
|+||One or more occurrences|
|*||Zero or more occurences|
|+||One or more occurrences|
|||A set of Character|
Let’s get down on some questions to understand the basics of how to write a regex1. re.split(‘s+’,’My name is Data Monk’)
‘My’ ‘name’ ‘is’ ‘Data’ ‘Monk’ – The above function took the regex s+ to get all the words from the given string and split it 2. end_Sentence = r'[.?!]’
The above line of codes will split the document wherever a sentence is ending with a full stop, question mark, or an exclamation mark
3. [a-z A-Z 0-9 -.]
This will match all the upper case, lower case, digits, – and . 4. r”[.*]”
Since it contains an asterisk, so it will match anything and everything You can find many more RegEx exercise questions on different websites. Do practice a few
Keep Learning 🙂