Text Analytics in R – TheDataMonk

Home » R - The language of Data Science » Text Analytics in R

Text Analytics in R

Text Analytics is the crunching of texts in order to get some insights out of it. The target text could be of twitter, whatsapp group, website or anything where there is a lot of text.

I am planning to write a book on it, but before that I want to let you know how to go for text analytics:-

The process of text analytics in R involves the following:-

1. Installing all the required packages
2. Get the data in R
3. Clean the data by removing special characters
4. Stem the document
5. Remove stop words – Stop words are those which bias your algorithm with some common words like I, we, the, a, an, etc. Mostly these are prepositions. You can also remove specific texts
6. Create a word cloud
7. Get term-frequency and inverse document frequency of the text file
8. Apply sentiment analysis algorithm

Below is the code for the same. Try to understand it or comment in the section below

install.packages(“tm”)
require(tm)
require(NLP)
dataset <- readLines(“C:\\Users\\Location\\Downloads\\chat.txt”)
print(dataset)
library(wordcloud)
install.packages(“wordcloud”)
require(wordcloud)
install.packages(“syuzhet”)
require(syuzhet)
install.packages(“SnowballC”)
require(SnowballC)

##corpus is nothing but a collection of documents
docs <- Corpus(VectorSource(dataset))
docs
trans <- content_transformer(function(x,pattern) gsub(pattern,” “, x))
docs <- tm_map(docs,trans,”/”)
docs <- tm_map(docs,trans,”@”)
docs <- tm_map(docs,trans,”\\|”)
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs,removePunctuation)
docs <- tm_map(docs,stripWhitespace)
docs <- tm_map(docs,stemDocument)
docs <- tm_map(docs,removeWords,stopwords(“english”))

##create document term matrix
dtm <- TermDocumentMatrix(docs)
mat <- as.matrix(dtm)
v <- sort(rowSums(mat),decreasing = TRUE)

#convert document term matrix into data frame

d <- data.frame(words=names(v),freq=v)
head(d,10)

#wordcloud
set.seed(1234)
wordcloud(words = d$words, freq=d$freq, min.freq = 1, max.words = 50,
random.order = FALSE, rot.per = 0.35, colors=brewer.pal(8,”Dark2″))

#get sentiments
sentiment <- get_nrc_sentiment(dataset)
text <- cbind(dataset,sentiment)
text

#Get the sentiment words by category
Total_Sentiment <- data.frame(colSums(text[,c(2:11)]))
names(Total_Sentiment) <- “count”
Total_Sentiment <- cbind(“sentiment”=rownames(Total_Sentiment),Total_Sentiment)
rownames(Total_Sentiment) <- NULL

ggplot(data=Total_Sentiment,aes(x=sentiment,y=count))+
geom_bar(aes(fill=sentiment),stat=”identity”)+
theme(legend.position = “none”)+
xlab(“sentiment”)+ylab(“Total Count”)+ggtitle(“Total Sentiment Score”)

Please do comment if you have any error. For the link to the book….Keep on looking here


Leave a comment

Your email address will not be published. Required fields are marked *