﻿﻿﻿﻿ The Data Monk – TheDataMonk

Home » Articles posted by The Data Monk

# Author Archives: The Data Monk

## Forecasting in R using different algorithms like Holt Winters, ARIMA and UCM

Hi All,

Today we will talk about Forecasting algorithms. Before we dig deep into the topic, let’s understand why do we actually need forecasting?
Forecasting is the process of predicting the future by looking at the previous pattern(historic data). In the back of the mind, a businessman knows when to boost the storage and when to play safe. Everyone has used forecasting sometime or the other.
Remember when you use to gauge through past year question papers and you used to predict the chapters which will dominate in the coming exam. It’s also forecasting.

Let’s talk about different terminologies in forecasting:-
1. Stationarity – Before you use any forecasting model, you must make the data stationary. A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.

Seasonality, cyclicity and trend – Look at the below graph, Trend is like the overall movement of the data. Here the trend is slightly increasing over time.
Seasonality – A definite hiccup or dig which follows a pattern is called seasonality. Here we have 12 seasonal values, where first there is an increase and then a decrease in values
Cyclicity – It’s almost equivalent to seasonality but it looks at seasonality on a longer time period. There is clearly a cyclic pattern in every 4 time period

Now we have the following data – Month number and number of buckets of KFC being sold in one of the outlet in India.
Schema – MonthStartDate and BucketVolume

## Data set and type of data sets for modeling

Q.) What is data set?
A.) Data set is a complete data which you use for your project. Dataset includes data from multiple data bases and tables combined together. The dataset for modeling can be divided into 2 parts- Train and Test dataset

Q.) What is train dataset?
A.) When you are building a model, you use some part of the dataset to train your model. This train dataset sets an example for your model to help it behave in a consistent manner.

For example, if you have a restaurant data for last 13 months, then this is your complete data set. You can divide the dataset in 80:20 ratio and can take around 11 months of data for training your model. You build this model on the above 80% dataset

Q.) What is test dataset?
A.) Now the rest 20% of the data is used to test your model before getting into the real time. Taking the above example forward, the last 2 months of data out of the 13 months, will not be used in training the model. So what ever prediction your model is doing will be tested against already known values for the last 2 months, to check the effectiveness of the algorithm

Q.) What comes under implementation of models?
A.) Implementation mainly means to understand the requirement of the stakeholders and to mold the model to meet the business requirement. For example. You can build a forecasting algorithm in R, but then you might have to implement it in PowerBI(Business Intelligence tool from Microsoft) to make it more consumable or you might have to develop an app to meet the requirement, etc.

Q.)Why data cleaning plays an important role?
A.) We are back to cleaning data. Once I participated in one of the Kaggle competition which required applying different text analytics algorithm to see sentiment of the text. I had done a similar project in the past on a clean data and I had the code ready for it. But, it took me almost a couple of days to clean the data and only a couple of hours to run the model.

The reason why cleaning is important is because you won’t get a good result on a dirty dataset and chances are that you might reject a particular algorithm just because it does not show you expected result, while on the other hand the algorithm was correct but your unclean data was running the case here

For more such questions, go here

## Statistics gyan

What is confusion matrix?
Confusion matrix is a 2×2 matrix consisting of True/False and Positive Negative. This matrix is typically used in prediction world to understand the effectiveness of an algorithm.  The first part of the object is Actual and the second part of each object if Predicted. In True-Positive object, the first True is for actual and the Positive is for Predicted

 True – Positive True – Negative False – Positive False – Negative

Q.) What is True-Positive?
A.) This means Actual value is true and predicted is also positive. Example, if we have to predict the disease whether present in a patient using some model. Then, the 1st block suggests the cases for which we predicted yes and they actually were suffering from the disease.

For a predictive model or a classifier – This value should be high

Q.) What is True-Negative?
A.) This means Actual value is true and predicted is negative. Example, We predicted that a patient is not suffering from a disease and he is found not suffering as well.

For a predictive model or a classifier – This value should also be high

Q.) What is False-Positive?
A.) This is also known as Type-1 error. Here we predicted yes, but they don’t actually have the disease.

This indicates an error in your algorithm. And since we almost always deal with sensitive data, so this value should be as low as possible. Suppose we predicted that a patient is suffering from diabetes and the doctor prescribed based on our algorithm and later found out that the patient was not suffering from diabetes. So this will raise concern

Q.) What is False-Negative?
A.) This is also known as Type-2 error. Here we predicted no, but they actually have the disease.

This is the major concern of an algorithm. No matter how accurate the model is, if the accuracy for False-Negative is low, then the model should not be introduced.

This indicates an error in your algorithm. Suppose we predicted that a patient is not suffering from Cancer and later found out that the patient was suffering, then there is not much use of the algorithm.

Let me know if you need more example to understand this. For more such question, go here

## Text Analytics in R

Text Analytics is the crunching of texts in order to get some insights out of it. The target text could be of twitter, whatsapp group, website or anything where there is a lot of text.

I am planning to write a book on it, but before that I want to let you know how to go for text analytics:-

The process of text analytics in R involves the following:-

1. Installing all the required packages
2. Get the data in R
3. Clean the data by removing special characters
4. Stem the document
5. Remove stop words – Stop words are those which bias your algorithm with some common words like I, we, the, a, an, etc. Mostly these are prepositions. You can also remove specific texts
6. Create a word cloud
7. Get term-frequency and inverse document frequency of the text file
8. Apply sentiment analysis algorithm

Below is the code for the same. Try to understand it or comment in the section below

install.packages(“tm”)
require(tm)
require(NLP)
print(dataset)
library(wordcloud)
install.packages(“wordcloud”)
require(wordcloud)
install.packages(“syuzhet”)
require(syuzhet)
install.packages(“SnowballC”)
require(SnowballC)

##corpus is nothing but a collection of documents
docs <- Corpus(VectorSource(dataset))
docs
trans <- content_transformer(function(x,pattern) gsub(pattern,” “, x))
docs <- tm_map(docs,trans,”/”)
docs <- tm_map(docs,trans,”@”)
docs <- tm_map(docs,trans,”\\|”)
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs,removePunctuation)
docs <- tm_map(docs,stripWhitespace)
docs <- tm_map(docs,stemDocument)
docs <- tm_map(docs,removeWords,stopwords(“english”))

##create document term matrix
dtm <- TermDocumentMatrix(docs)
mat <- as.matrix(dtm)
v <- sort(rowSums(mat),decreasing = TRUE)

#convert document term matrix into data frame

d <- data.frame(words=names(v),freq=v)

#wordcloud
set.seed(1234)
wordcloud(words = d\$words, freq=d\$freq, min.freq = 1, max.words = 50,
random.order = FALSE, rot.per = 0.35, colors=brewer.pal(8,”Dark2″))

#get sentiments
sentiment <- get_nrc_sentiment(dataset)
text <- cbind(dataset,sentiment)
text

#Get the sentiment words by category
Total_Sentiment <- data.frame(colSums(text[,c(2:11)]))
names(Total_Sentiment) <- “count”
Total_Sentiment <- cbind(“sentiment”=rownames(Total_Sentiment),Total_Sentiment)
rownames(Total_Sentiment) <- NULL

ggplot(data=Total_Sentiment,aes(x=sentiment,y=count))+
geom_bar(aes(fill=sentiment),stat=”identity”)+
theme(legend.position = “none”)+
xlab(“sentiment”)+ylab(“Total Count”)+ggtitle(“Total Sentiment Score”)

Please do comment if you have any error. For the link to the book….Keep on looking here

## Basic statistics terms definitions in layman language

A population is any specific collection of objects of interest.

A sample is any subset or subcollection of the population, including the case that the sample consists of the whole population, in which case it is termed a census.

A measurement is a number or attributes computed for each member of a population or of a sample. The measurements of sample elements are collectively called the sample data.

“N” is usually used to indicate the number of subjects in a study. Example: If you have 76 participants in a study, N=76.

A parameter is a number that summarizes some aspect of the population as a whole. A statistic is a number computed from the sample data.

Quantitative data are numerical measurements that arise from a natural numerical scale.

Statistics is a collection of methods for collecting, displaying, analyzing, and drawing conclusions from data.

Correlation – It is the degree to which two factors appear to be related. Correlation should not be confused with causation. Just because two factors are reported as being correlated, you cannot say that one factor causes the other. For example, you might find a correlation between going to the library at least 40 times per semester and getting high scores on tests. However, you cannot say from these findings what about going to the library, or what about people who go to libraries often, is responsible for higher test scores.

Median – The score that divides the results in half – the middle value.

Descriptive statistics is the branch of statistics that involves organizing, displaying, and describing data.

Inferential statistics is the branch of statistics that involves drawing conclusions about a population based on information contained in a sample taken from that population.

r-value is the way in which correlation is reported statistically (a number between -1 and +1). Generally, r-values should be >+/-.3 in order to report a significant correlation.

Qualitative data are measurements for which there is no natural numerical scale, but which consist of attributes, labels, or other nonnumerical characteristics.

Stay tuned to our website for more Statistics gyan..For puzzles, case studies and statistics question :-

100 puzzles and case studies to crack data science interview

## Data Science puzzles in interview questions

In the current scenario, getting your first break into analytics can be difficult. Around 30% of analytics companies (especially the top ones) evaluate candidates on their prowess at solving puzzles. It implies that you are logical, creative and good with numbers.

The ability to bring a unique perspective to solving business problems can provide you a huge advantage over other candidates. Such abilities can only be developed with regular practice and consistent efforts.

below are some common puzzling questions which are generally asked during interviews.

Two trains X and Y (80 km from each other) are running towards each other on the same track with a speed of 40km/hr. A bird starts from the train X and travels towards train Y with a constant speed of 100km/hr. Once it reaches train Y, it turns and starts moving toward train X. It does this till the two trains collide with each other. Find the total distance traveled by the bird?

Solution : Velocity of approach for the two trains = (40 + 40) km/hr

Total time the trains will take to collide = 80km/80km/hr = 1 hour

Total distance travelled by the bird = 100km/hr * 1hr = 100 km.

You have two beakers – one of 4 liters and other of 5 liters. You are expected to pour exactly 7 liters in a bucket. How will you complete the task?

Step 1 : Fill in 5-liter beaker and empty it in the 4-liter beaker. You are left with 1 liter in the 5-liter beaker. Pour this 1 liter in the bucket.

Step 2 : Repeat step 1 and you will have 2 liters in the bucket.

Step 3 : Fill in the 5-liter beaker and add to the bucket.You now have 7 liters in the bucket.

There are 5 pirates on a ship. Pirates have hierarchy C1, C2, C3, C4 and C5.C1 designation is the highest and C5 is the lowest. These pirates have three characteristics: a. Every pirate is so greedy that he can even take lives to make more money.  b. Every pirate desperately wants to stay alive. c. They are all very intelligent.There are total 100 gold coins on the ship. The person with the highest designation on the deck is expected to make the distribution. If the majority of the deck does not agree to the distribution proposed, the highest designation pirate will be thrown out of the ship (or simply killed). Only the person with the highest designation can be killed at any moment. What is the right distribution of the coins proposed by the captain so that he is not killed and does make maximum amount?

The solution of this problem lies in thinking through what will happen if all the pirates were thrown one by one and then thinking in reverse order.

Let us name pirates as A,B,C,D and E in hierarchy (A being highest).

If only D and E are left at end, D will simply give 0 coins to E and still escape because majority cannot be reached. Hence, even if E gets 1 coin he will give his vote to the distributor.

If C, D and E are there on the deck, C will simply give one coin to E to get his vote. And D  simply gets nothing. Hence, even if D gets 1 coin he will give his vote to the distributor.

If B,C,D and E are there on the deck, B will simply give one coin to D to get his vote. C & E simply gets nothing.

If A,B,C,D and E are there on the deck, A simply gives 1 coin each to C and E to get their votes.

Hence, in the final solution A gets 98 coins and only C & E get 1 coin each.

There are 3 mislabeled jars, with apple and oranges in the first and second jar respectively. The third jar contains a mixture of apples and oranges. You can pick as many fruits as required to precisely label each jar. Determine the minimum number of fruits to be picked up in the process of labeling the jars.

This is another tricky puzzle where you must really churn your brain. A noticeable aspect in this puzzles is the fact that there’s a circular misplacement, which implies if apple is wrongly labelled as Apple, Apple can’t be labelled as Orange, i.e., it has to be labeled as A+O. We are acquainted with the fact that everything is wrongly placed, which means A+O jar contains either Apple or Orange (but not both). The candidate picks one fruit from A+O, and let’s assume he gets an apple. He labels the jar as apple, however, jar labelled Apple can’t have A+O. Thus, the third jar left in the process should be labelled A+O. Basically, picking only one fruit

To crack Data Science/Business Analyst interviews, you need to be good at puzzles and case studies.

You can take a look on my book for more such puzzles, before your interview

100 Puzzles and case studies to crack data science interview

## Data Science Interview Questions

Data Science is not an easy field to get into. This is something all data scientists will agree on. Apart from having a degree in mathematics/statistics or engineering, a data scientist also needs to go through intense training to develop all the skills required for this field. Apart from the degree/diploma and the training, it is important to prepare the right resume for a data science job and to be well versed with the data science interview questions and answers. So we have put some important questions below.

How would you create a taxonomy to identify key customer trends in unstructured data?

The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results and improving over the time.

Python or R – Which one would you prefer for text analytics?

The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.

Which technique is used to predict categorical responses?

Classification technique is used widely in mining for classifying data sets.

What is logistic regression? Or State an example when you have used logistic regression recently.

Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

What are Recommender Systems?

A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

Why data cleaning plays a vital role in the analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point in time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at a time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and a spending can be considered as an example of bivariate analysis.

What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. The random variables are distributed in the form of a symmetrical bell-shaped curve.

What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

What are Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

What is power analysis?

An experimental design technique for determining the effect of a given sample size.

What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources, and multiple agents.

Are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.

Do gradient descent methods always converge to the same point?

No, they do not because in some cases it reaches local minima or a local optimal point. You don’t reach the global optimal point. It depends on the data and starting conditions

For more such questions, do give this book a try

100 Questions to crack data science interview

100 Questions to crack business analyst interview

## Is data science a risky career opportunity ?

Many people see a data science career as an easy path to wealth, fame, and glory, but the reality is that data science is hard to understand, but also one of the most worth doing thing.

• You need to know some math and statistics, since you’ll be analysing data.
• You need some programming skills, since you’ll be writing programs or at least composing queries and scripts to perform that analysis.
• You need communications skills, since your work is likely to be highly collaborative and cross-functional.

Heres complex flowchart showing what a data scientist do basically:

Before you dive in data scientist wonder, you must know that you are going to deal with these concepts. The reason why it is blurred is because I don’t want you people to think or get demotivated thinking about the hefty work ahead.

1. Fundamentals
2. Statistics
3. Programming
4. Machine Learning
5. Text Mining / Natural Language Processing
6. Data Visualisation
7. Big Data
8. Data Ingestion
9. Data Munging
10. Toolbox

Each area / domain is represented as a “metro line”, with the stations depicting the topics you must learn / master / understand in a progressive fashion. The idea is you pick a line, catch a train and go through all the stations (topics) till you reach the final destination (or) switch to the next line.

Data-savvy youngsters who are thinking about which approach to take their skills may need to take note. There’s no risk at all if you want more job opportunities, and perhaps more job security, becoming a data scientist might be a better career choice. So unless analytics drives business impact, it is not analytics, it is just statistics, it is just data science.

With the phenomenal growth and the significance of big data-will grow bigger. The stack of data will keep going up at a quick pace and it is anticipated that our capability to turn big data into structured information that can be used by businesses will likewise enhance dramatically in the upcoming years.

It is risky because the data science field is relatively young and evolving fast, which could potentially make some skills obsolete / less useful in rather disruptive ways.

It is not risky because the demand for data scientists (with different skills) from many different industries will keep being very strong in near future

If the job were easy, there wouldn’t be

such a demand for people who can do it, But if you are fresher and having the right aptitude and attitude, the rest can all be learned.

End if you have the skills and have already proven yourself, just knuckle down and get the work done.

Keep Learning.

## What is data science and why its important for you?

Data science is a disciplinary blend of data inference, algorithms development, and technology in order to solve relative complex problems.

At the core is data. Stacks of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value and eventually make something out of it.

Data science and  discovery of data insight

This aspect of data science is all about uncovering findings from data. Diving in at a raw level to mine and understand complex behaviours, trends, and inferences. It’s about surfacing hidden insight that can help enable companies to make smarter business decisions to increase their profit. For example:

Netflix data mines movie viewing patterns to understand what drives user interest, and uses that to make decisions on which Netflix original series to produce and make sequel to.

Target identifies what are major customer segments within it’s base and the unique shopping behaviours within those segments, which helps to guide messaging to different market audiences.

Proctor & Gamble utilises time series models to more clearly understand future demand, which help plan for production levels more optimally.

How do data scientists mine out insights? It starts with data exploration. When given a challenging question, data scientists become detectives. They investigate leads and try to understand pattern or characteristics within the data. This requires a big dose of analytical creativity.

Then as needed, data scientists may apply quantitative technique in order to get a level deeper – e.g. inferential models, segmentation analysis, time series forecasting, synthetic control experiments, etc. The intent is to scientifically piece together a forensic view of what the data is really saying.

This data-driven insight is central to providing strategic guidance. In this sense, data scientists act as consultants, guiding business stakeholders on how to act on findings.

How data mining and sorting algorithms finds and engineer your decisions

Amazon’s recommendation engines suggest items for you to buy, determined by their complex algorithms. Netflix recommends movies to you. Spotify recommends music to you and so on.

Gmail’s spam filter is data product – an algorithm behind the scenes processes incoming mail and whether decides if a message is junk or not.

Computer vision used for self-driving cars is also data product – machine learning algorithms are able to recognise traffic lights, other cars on the road, pedestrians, or any obstacle etc.

Data scientists play a central role in developing data product. This involves building out algorithms, as well as testing, refinement, and technical deployment into production systems. In this sense, data scientists serve as technical developers, building assets that can be leveraged at wide scale.

## 100 Questions to Crack Business Analyst Interview

What is the difference between data science and business analysis?

What is the profile for you which you are appearing for the interview next week ?

Data science is dealing with the data, to get insights from the data and change the way your client has been using this data, Business analysis is about understanding the data and the business. As a business analyst you are supposed to take a more holistic view of the problem to solve the bigger picture.

Both data science and business analytics job profile might look the same, but it’s actually a lot different.

100 Questions To Crack Business Interview contains exact 100 question which you should try to master before going through any Business Analyst, Big Data, Data Science interview. This book contains question in random order, so you have to cover the complete 30 page without chucking any page. This book will be a great help to prepare for any related topic interview.

Know the difference between the two – Here

Thanks,

The Data Monk