Basic Statistics

Data Science is 60% Mathematics and 40% coding and other things. Obviously, the numbers are random, but it makes complete sense.
You need to defend your analysis with statistics.
On a serious note, your honeymoon period is over.
The statistics material will require more concentration and we will try to make sure that you understand each concept inside out, but if you are unable to grasp something, feel free to comment or google about it. There is an ample amount of good resource on the internet. Try those out as well. We will also post some useful links wherever required to help you understand things better.

Statistics has a very basic and effective definition

“It is a science of collection, analysis, interpretation, presentation, and organization of data to help in making EFFECTIVE DECISION”

Suppose you sell books online on Kindle(much like TheDataMonk), so you will have a set of customers who would have bought your book, you know their email id. Now one day you decide to run a campaign to send emails to those customers or maybe some coupon codes with discount. You can’t send a coupon to everyone, so you will decide on selecting a group of customers for this campaign. Now you need to know the effectiveness of this campaign, here you need to know statistics with your organized data to make a better decision in the future.

One more example – You told your client that you will be using a particular algorithm for a problem. You already know that you are receiving a good result with this algorithm, but then also you need to prove the same to the Client. In this situation, statistics will come to your rescue.

Basically, there are two types of statistics:-
1. Descriptive Statistics – It is a method of organizing, summarizing, and presenting data. If you are answering questions like below, then its a part of descriptive statistics:-
a. How many SQL books were sold?
b. How many unique visitors come to the website daily?
c. How many pages of the books were read on Kindle?
d. What value is the middle of the range of revenue from Kindle?

Graphical representation also comes in Descriptive statistics. The use of histogram, pie charts, etc. are used in descriptive analysis.

2. Inferential Statistics – When you will search the meaning of the word “inferential”, this is what comes up

It basically means that this part of statistics deals with creating inference out of a sample and then applies it on the population.
Population is the total number of data points in an analysis and
Sample is the selected number of data points which are randomly picked from the population.

Example – You office has a capacity of 10000 employees and you need to know the average age of the office. In this case you will not ask each one of them, instead you will take a random sample of may be 2000 employees and will consider the average of this sample to be the average age of the office.

Some of the basic statistical terms are given below:-
1. Experiement – It is a planned activity which is carried out on a sample to get a result.
2. Parameter –
A numerical value which summarizes in the entire population
3. Variable – A characteristic of each individual item in a population. Example- salary, age, experience, etc. of an employee
4. Statistics – Summary of the sample data
5. Data – A value associated with one element of a population

If we consider our book selling website, we can assume that variables here are – 
a. Number of people reading book online
b. Number of people buying book online
c. Number of people putting it in the cart but not buying

The above are the examples of a variable.

Now, if I say we sell around 30 books per day, then that’s a data. The campaign which we discussed is an experiment.
A parameter is something which summarizes a data, what is the average number of unique visitor on the website www.thedatamonk.com?
What is the average number of time spent on the website by a visitor?
These parameters take the complete data point into consideration.

Now, try to understand the term statistics. You work in a 14-floor building and you need to know the average age of the office. You take 50 employees from each floor and get the average age. Inferential statistics suggests that it is the average age of the population based on your sample.
Statistics is the mean, median, variance, etc. of this sample alone.

Variable is one of the most important terms used in statistics. Variable is of two types:-
a. Quantitative -> Measurable. Example – Time spent by a user on the website. We can divide the measurable quantity in two parts:-
  1. Discrete – These are natural numbers like Number of users visiting a website, number of cars owned by a businessman
  2. Continuous – These are real numbers like the weight of a person (88.3 Kg), the power of a lens, etc.
b. Qualitative -> Non-measurable. Example – Visitors from which country is better for the website. We can divide the non-measurable quantity in two parts:-
  1. Nominal – If you ask a sample their favorite color and you get three responses – Red, Green, and Black. You further replace these colors with the number like 1,2, and 3. But, you can’t say that 3 is better than 1 or more than 1. 
  2. Ordinal – Zomato delivery person brings your food and you rate him 5, next day a different person delivers you and you rate him 2. It does mean that the first guy was better but it does not mean that the first one was 2.5 times better than the second delivery boy

Measure of Central Tendency

When you start exploring data, you will always need something which can define your complete dataset. There are three measures in statistics which is used to summarize a dataset:-

1. Mean – You already know that mean is nothing but the average of the complete dataset. It is also one of the most used term in the world of statistics. 

2. Median – You arrange the numbers in an increasing or decreasing order and pick the middle term if the total term is odd, calculate the avergae of the two middle terms if the total term is even. This will give you median of a dataset.

3. Mode – There are 100 students and they scored in the range 40 to 90. There were 40 students who scored exactly 45 which is the maximum number of students scoring the same marks. 45 will be the Mode of the dataset.

It’s easy to understand the meaning of mean, median, and mode. But there are two questions which need to be answered:-
1. When to use mean, median, and mode?
2. Is there a relationship between these three terms?

Let’s start with the second question, yes there is a rough relationship between the three terms:-

                                    Mean – Median = 3(Mean-Mode)

 Mean gives you the average of a dataset, so if you want to ask what will be the bill of the customer coming next in your restaurant, so can safely use mean of all the bills.

Suppose you are working in a startup and you need to know the salary of the average employee. The salary of a peon and CEO will bias the mean, thus you can’t use mean. Median removes the outlier from both the sides(lower and higher) and presents a better picture of the dataset.

When you are in a class, distributing answer sheet with marks to the students. Till now, 60% of the students scored 80 marks, so we can say that the Mode is 80 and there is a high chance that the next person will get 80 as well.

Use Mean when your data is not biased or when the sample is huge.
Use Median when your data could be biased or the sample is small.
USe Mode when you need to say the expected value of the upcoming data point.

One interesting example We found on Quora was that if a lottery costs you 1 dollar then the mean of return will be ~60 cents because of the heavy prize money. But the median and mode, both will be zero which is true also considering the very low chance of winning a lottery.

Next we will go through the Measure of Dispersion.

Measure of dispersion shows how variable the data is i.e. how spread out the distribution is. There are two measures of dispersion:-
1. Range – Difference between the maximum and the minimum value 
2. Variance and Standard Deviation

You should definitely know how to calculate some of the basic terms manually. While coding you will have in-built functions in every language, but it’s good to know the actual meaning of these building blocks of statistics.

Variance is the arithmetic mean of the squared deviation from the mean. This is a very standard definition. We will try to break this down to make it easier to understand and remember.

Take 5 numbers – 1,2,3,4,5
What is the mean? 15/5 = 3
 SQUARED DEVIATION from the mean means (1-3)(1-3), (2-3)(2-3), (3-3)(3-3),(4-3)(4-3),(5-3)(5-3)

Now you have to take the mean of the number above to compute the variance. One small catch, you will have to divide it with N-1. You don’t just have to learn the formula, but to understand the reason behind it.

In a population variance formula, you will divide the above calculation with N, whereas in a sample variance formula you will divide it by N-1. Below is the reason why?

A sample mean is not the same as the population mean because it’s not necessary to have the same values in the sample as in the population.

 Let’s add one more term and make it more confusing 😛

Degree of Freedom – The degree of freedom for a calculation is the number of values in the final calculation of a statistics/parameter that are free to vary.

Suppose there is a dataset with 1000 numbers and the mean of the complete population is 10. Now you select a sample from the population:-

10,12,8,9,12 – The mean here is 10.2. This is understandable as the sample is not a replica of the population but just a mere representation of the population. Calculating the Squared Deviation from Mean will be

= (10-10.2)(10-10.2)+(12-10.2)(12-10.2)… = 12.8

Whereas when we use the population mean instead of the sample mean, we get the following value:-
=(10-10)(10-10)+(12-10)(12-10).. = 13

Using sample mean will always lead to downward bias of the value. To counter this, we use N-1 instead of N in the sample variance formula. Thus the downward bias is somewhat fixed.

Why is it downward biased?
The sample mean is closely associated with the values given in the sample whereas the population mean is away from the sample mean.

We can also understand why we are using N-1 instead of N by understanding the concept of Degree of Freedom in both Sample and Population variance.

Suppose you have 4 values and you know the mean of the following value, then how many numbers do you need to find out to complete the set of 4 numbers. Example – The numbers are 1,2,y and x and the mean is 8, Here you just need to know the value of either x or y to find out the complete dataset. 

Since the sample is a variable, so each time we need to know N-1 values to know all the N values in the sample.
It can not happen with the population variance since all the values are fixed.

Basically, N-1 is used to remove biasness from the dataset in a sample variance. We use N when calculating the population variance because there are no biasness.

Standard Deviation is nothing but the square root of the variance

Mu is the Population Mean
A bar on X is used to demote Sample Mean

Population Variance is denoted by Sigma square
Sample Variance is denoted by S square

Population Standard Deviation is denoted by Sigma
Variance Standard Deviation is denoted by S

Try to grasp the basics and then only move forward.

Keep learning 🙂

XtraMous

Author: TheDataMonk

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)