Difference in the formula of variance calculated for population and sample, Why?
If you don’t understand the question, then google it. Explore, what is variance, sample, population, etc.. Once you understand the concepts then come back to the question.
The question checks your concept of degree of freedom and whether you have understood variance or not
When you calculate variance of population the denominator is N whereas while calculating variance of sample the denominator is N-1. Why?
Answers ( 18 )
We use N for population variance and n-1 for sample variance (n is sample size) to get an unbiased sample variance. Because variance is calculated generally as sum[(obs – mean)^2]/(number of obs.) but sometimes sample mean comes out to be very different than population mean and hence denominator of the given equation is small for sample and hence it underestimate the sample variance. To correct it we use the n-1 instead of n.
We know that sample mean is the expectation of population mean and because of which while picking up the sample the mean is known to us and hence if we are choosing n unit sample then we can pick n-1 points independently and last value will be a linear combination of sample mean and n-1 values, leading to n-1 degrees of freedom.
Thus, while calculating variance of sample we use denominator as n-1.
When we calculate the Population variance, we already must have Population mean that why there is no constraint that we have find. That why, we divide the sum of squares of observation that deviates from Population mean by N(Population size)
But on the contrary, when we calculate the sample variance, before that we have to find sample mean as a constraint so 1 degree of freedom will reduce from sample size and thats why we divide the sum of square of sample observation that deviates from sample mean by n-1 (sample size -1)
N-1 in sample variance is used to remove bias. Population Variance uses population mean and sample size of population. Whereas for sample variance, we are using sample mean. The deviation of observation is calculated from sample mean and not population mean. Hence, it might be biased while estimating population variance due to which using N-1 corrects this.
When it comes to calculating variance in sample we divide it by (n-1) instead of n as we do it in population. When it comes to sample , we take only a portion of all the samples and that somehow underestimates the effect of all parameters which can be proved by experiments.
So to negate this effect , we divide it by a slightly lesser value, in this case ( n-1) , so when the denominator shrinks, the overall value of the variance increases and it comes closer to the value of population variance.
When it comes to population variance , we take the full size and degree of freedom doesn’t comes into play. In sample variance , degree of freedom n-1 comes into play since we are taking sample size which is less then the population size. The reason behind the denominator n-1 is the first n-1 samples are independent of each and the next sample can be calculated by using these n-1 samples with the mean given already. Note that sample variance is greater than the population variance.
Degree of freedom is defined as the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample.
For example, We have a table containing data of 6 people. We know their mean height and height of 5 people. We can calculate the height of 6th people by using mean height and height of other 5 people. Here, the constraint is 1. So, the degree of freedom is n-1 (i.e.,) 6-1 = 5.
So, when we calculate the sample variance, we use sample mean and (n-1) instead of n. We want sample variance to be an unbiased estimator of population variance.
Mathematical fact:
The deviation around the sample mean tend to be a bit smaller than population mean. So, If we use n instead of n-1, we would be underestimating the population variance.
So, we use (n-1) instead of n, while calculating the sample variance to get an unbiased estimate of population variance.
Variance is defined as the sum of squares of the distance of observations from the mean divided by total number of observations.
Since, the squared distance from the population mean is always greater than or equal to the squared distance from the sample mean in a sample. The calculation of Sample Variance as similar to the Population Variance will give an underestimated result (especially in the cases where population mean lies outside the range of sample observations).
So to get an Unbiased Sample Variance, we divide the squared distance by Number of obs. – 1 (n-1) instead of n
When I calculate population variance, I then divide the sum of squared deviations from the mean by the number of items in the population (in example 1 I was dividing by 12). When I calculate sample variance, I divide it by the number of items in the sample less one. In our example 2, I divide by 99. So (n−1) is a smaller number than (n). When you divide by a smaller number you get a larger number. Therefore when you divide by (n−1) the sample variance will work out to be a larger number.
Sample is a small chunck of population.
Population can be considered whole of data sets for which study has to be implemented on whereas Sample is subset of population to apply statistics so that study could be generalised to larger population.
Variance is how the data values are seggregated around mean value of data.
Since, Sample does not acquire complete data or in short all the population , hence we deduct the 1 from denominator to compensate for lack of data.
We tend to take upper bound by deducting 1 from denominator leading to higher variance value of sample than population size.
(sum of difference in values with their mean)/ n is the variance of population
(sum of difference in values with their mean)/ n- 1 is the variance of sample
Sample would eliminate one degree of freedom in explaining the distribution so we use n-1 istead of n as a correction factor
In population variance, there are n degrees of freedom due to the absence of any kind of constraint.
In sample variance, there are (n-1) degrees of freedom because of the presence of one constraint on the formula, that is, for the sample mean. Since the value of the sample mean will be fixed, the possible number of values it will be able to take now (degrees of freedom) is going to be (n-1).
Variance is calculated in following 5 steps:
1. Calculate mean
2. Calculate deviations from the mean.
3. Deviations are squared
4. Squared deviations are summed up
5. Sum is divided by number of items
The difference in the formula occurs in 5th step.
In case of population we divide by N (size of population). In case of sample we divide by N-1 (size of sample – 1).
This is done to compensate for lack of information in sample data. Due to lack of complete information in a sample, we reduce the denominator to increase the variance.
In statistics Bessel’s correction is the use of n-1 instead of n in the formula for the sample variance, this method corrects the bias in the estimation of the population variance.
Sample variance is the unbiased estimator of Population variance.
Why? To get an unbiased sample variance.
Explanation:
(n−1) is a smaller number than (n). When we divide by a smaller number we get a larger number. Therefore when you divide by (n−1) the sample variance will work out to be a larger number.
If the sample variance is larger then there is a greater chance that it captures the true population variance. That is why when we divide by (n−1) we call that an unbiased sample variance, whereas dividing by (n) is called a biased sample variance.
Degrees of freedom refer to the ‘number of values that are free to vary’ .
In case of variance of sample, the sample mean hinders one value of the sample to vary freely, in order to get the mean of those values to be equal to sample mean. For example, if we know the mean of 3 values is 6, and we chose the first two values randomly, we are bound to choose the third one in order to match the mean. Hence, one value is ‘not free to vary’ in case of sample variance. So we divide by n-1 instead of n. However, this is not the case for population variance.
Variance-Variance is the average spread out of data set. It is calculated from from the mean of a data set.
Sample-Suppose we want to know the average height of the population. we cannot calculate entire population, so we take sample data and try to estimate the sample population
Population=Population is the entire group that we want to draw a conclusion
Relation between Population and sample?
Degree of freedom
eg: When we calculate standard deviation of sample , we already calculated sample mean which reduces degree of freedom by 1 from n-1.
If we divide sample variance by n we will be underestimating the actual variance as when we take a lot of sample variance which has N as denominator and sum it its average is not close to population variance.
But when we divide the sample variance by N-1 and sum all the sample variances of different samples , its actually close to population variance. As the sample variance did not get underestimated.