R – The language of Data Science – TheDataMonk

Home » R – The language of Data Science

Category Archives: R – The language of Data Science

Text Analytics in R

Text Analytics is the crunching of texts in order to get some insights out of it. The target text could be of twitter, whatsapp group, website or anything where there is a lot of text.

I am planning to write a book on it, but before that I want to let you know how to go for text analytics:-

The process of text analytics in R involves the following:-

1. Installing all the required packages
2. Get the data in R
3. Clean the data by removing special characters
4. Stem the document
5. Remove stop words – Stop words are those which bias your algorithm with some common words like I, we, the, a, an, etc. Mostly these are prepositions. You can also remove specific texts
6. Create a word cloud
7. Get term-frequency and inverse document frequency of the text file
8. Apply sentiment analysis algorithm

Below is the code for the same. Try to understand it or comment in the section below

dataset <- readLines(“C:\\Users\\Location\\Downloads\\chat.txt”)

##corpus is nothing but a collection of documents
docs <- Corpus(VectorSource(dataset))
trans <- content_transformer(function(x,pattern) gsub(pattern,” “, x))
docs <- tm_map(docs,trans,”/”)
docs <- tm_map(docs,trans,”@”)
docs <- tm_map(docs,trans,”\\|”)
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs,removePunctuation)
docs <- tm_map(docs,stripWhitespace)
docs <- tm_map(docs,stemDocument)
docs <- tm_map(docs,removeWords,stopwords(“english”))

##create document term matrix
dtm <- TermDocumentMatrix(docs)
mat <- as.matrix(dtm)
v <- sort(rowSums(mat),decreasing = TRUE)

#convert document term matrix into data frame

d <- data.frame(words=names(v),freq=v)

wordcloud(words = d$words, freq=d$freq, min.freq = 1, max.words = 50,
random.order = FALSE, rot.per = 0.35, colors=brewer.pal(8,”Dark2″))

#get sentiments
sentiment <- get_nrc_sentiment(dataset)
text <- cbind(dataset,sentiment)

#Get the sentiment words by category
Total_Sentiment <- data.frame(colSums(text[,c(2:11)]))
names(Total_Sentiment) <- “count”
Total_Sentiment <- cbind(“sentiment”=rownames(Total_Sentiment),Total_Sentiment)
rownames(Total_Sentiment) <- NULL

theme(legend.position = “none”)+
xlab(“sentiment”)+ylab(“Total Count”)+ggtitle(“Total Sentiment Score”)

Please do comment if you have any error. For the link to the book….Keep on looking here

Learn to program in R in 6 hours with interview questions

R is an open source programming language mainly used for statistical computing and also graphics. It’s a tool which helps in the development of machine learning spaces and graphics and as a matter of fact, machines are being drastically used all over the world.

The fact is clear to us that as the usage of machines grow every single day the language R will too, grow! It is the most popular language used in the field of statistics. Data experts use this tool to develop financial and climate models that help drive our economies and communities.

R is used in many professions such as software development, business analysis, statistical reporting, scientific research etc. Employees working in these areas will sometime or the other have to deal with this language [especially in data related jobs]. The more data incentive jobs increases, more processing will take place and eventually this increases the necessity for using R.

For the people who think this might be boring, hold on! It is a really fun tool to experiment your graphics, generate different types of charts and create plots. For those people who still think it is boring, let’s come to the salary part. Dice technology salary survey which was conducted last year has ranked R as the highest paying skill. The pay is high and you don’t need to break your heads to get going and learn this language.

A person does not need to have any advance level skill to learn R. He/she needs to know some concepts of mathematics such as probability and statistics. It’s a very easy language to learn and its possible even if you are pursuing any degree. You don’t have to be a hardcore programmer to learn this language. People who are currently doing very well in R are not computer science students {not all of them}. They are from very unique professional backgrounds and have become this successful because of their interests and not because they had taken up computer science in their degree level.

It isn’t a complicated language and is very much getting in the scenario of future jobs. Learningthis language will definitely boost up your resume and and your career. R is worth learning for these reasons and more. Thedatamonk.com provides you a perfect platform to understand this topic well. It will give you an insight on whether you should spend your time and learn this language or not. The data monk has published a book which will give you the perfect insight in the world of R. The books name is 100 questions to learn R in 6 hours. Its available at amazon at a very reasonable prize and is worth a buy for your future. Here’s the link to the book-


Do check out the book and leave your comments and reviews on http://thedatamonk.com/ You can also leave your review and comments on the amazon site after purchasing the book!

Planning is bringing the future into the present so that you can do something about it now- Alan Lakein.