## Guesstimate 9 : The number of people wearing watches in Bangalore

Ans:      Let’s assume population of Bangalore as 10 million and the day today is a working day for every age group

And age group wise population assumption would be

0–15 yrs: 30%; 15–25 yrs: 20%; 25–50 yrs: 30%; >50 yrs: 20%

Income wise population would be

Above poverty line: 75%

The people below poverty has negligible chance to use a watch, so I am eliminating this 25% population from every age group.

Now we have 75% of aforesaid % in every age group. As below poverty is eliminated 0–15 group mostly have school children and infants I assume 5% of them wearing watches today. So the count would be 2.25 million*5%=112500

15–25 age group have majority of higher education students who spend most of their time using mobile phones, so people using watches may be nearly 25% which gives 1.5 million*25%= 375000

25–50 groups have professionals working somewhere for their survival whose generation recently entered the smartphone culture. So most of them use watches. If out of this group, if 90% are couples, we will approximately have 20% housewives in it. As there is less chance of housewives wearing watches we can neglect it. The remaining 80% population say 90% wear watches, then count would be (80%*2.25 million)*90%=1620000

Coming to >50 group say 25% are working somewhere having 90% of them wearing watches again and in remaining 75% let 20% wear watches as we see many of our grandfathers wearing watches even in house. So count would be (25%*1.5 million)*90%+(75%*1.5 million)*20%= 562500

`Total= sum of all these=2.67 million wear watches`

## Guesstimate 8 : How many red colour Swift cars are there in Delhi ?

Let’s start with the population of Delhi which is 2 Crores.

We will divide this population into two groups-

1. Family(80%) = 0.8*20000000 = 16000000 family members

2. Bachelors(Individuals) (20%) = 0.2*20000000 = 4000000

Number of families(assuming 4 members in each) = 16000000/4 = 4000000.

Guessing that 50% of families have cars, so the number of families with cars = 2000000.

In Delhi, we can assume that 25% of the families belong to high class society so they can afford 2 cars on an average and the rest can afford only one car.

`Therefore number of cars with families = 0.25*2000000*2 + 0.75*2000000*1 = 2500000.`

Now let’s say only 10% of the individual population can afford a single car.

`Therefore, number of cars with individuals = 0.10*4000000 = 400000.`
`So the total number of cars in Delhi can be estimated as 2500000+400000 = 2900000 which can be rounded of to 3000000 for simple calculations.`

Since Maruti being the Indian market leaders in car sector so we can safely assume 50% of cars on the roads of Delhi are of Maruti, i.e., 1500000 cars.

Swift is one of the most common and affordable models along with Alto, WagonR, Omni and 800. So let’s assume there are 200000 Swifts.

```White, silver, grey, black are the most common colors. So
75%(approx) of Swifts will be of these colors.```

Now we are left with 50000 Swifts of different colors. Considering red, yellow, blue, maroon and orange as possible other colors,we can guess that there are nearly 10000 red Swifts in Delhi.

Keep Learning 🙂

The Data Monk

## Guesstimate 7 : How to estimate the number of ambulances on the road ?

Ans: Let’s start with the population of the country ~ 1.3 Billion (1300 million)

Rural – 70% = 900 million Urban 400 million

Let’s divide the Urban population into three groups based on income level – Low |High |Upper High

`I would divide as : Low: 30% High - 50% Upper High - 20%So Urban Low: 120 million High - 200 Upper High - 80 million`
`Now out of Urban Low I would assume 10% have driving as occupation (Rise of taxi services etc. (only considering 4 wheeler drivers))= 12 million (120,00,000)`

Out of these I would assume 1–1.5% are ambulance drivers = 120000

So we can say almost 1,20,000 ambulances are available in the country at any time.

Now generally ambulances are available on calling, that means, at least a half of them are always on backup. let’s assume that 70% of them are on idle at any given point of time.

```Also since their average journey time is of 20–25 minutes,
so we can say not more than 10% of ambulances would be on road at any time```

So we can say 10% are on roads at any time = 12000 ambulances on road across Urban India.

## Guesstimate 6 : How do we estimate the area of an Airport ?

Ans: Let’s assume the airport can accommodate 10 planes at once and handle 5 runways.

Average length of runway = 2000m

Width of runway = 50 m

The total area of runway is 2000*50*5 = 500000 sq.m

Length of average plane is assumed as 50m and its wingspan is 40 m.

Total area of 10 planes = 20000+500000 = 520000 sq.m

Now somes assumptions, total area occupied by the building is equal to the area of runways and planes i.e 520000 sq.m

Total area of airport = 104000 sq.m.

There is a lot of empty space on an airport so we assume 40% of the total hence the final area is = 104000/0.6 = 1700000 sq.m.

You can think of different methods to calculate the same, this is just one of the quick way ti estimate the area of an airport

## Guesstimate 5 : How much is the surf excel detergent usage in a day in India?

Approach:
India has a population of approx 1.2B People.

About 20% are BPL and would therefore not use surf excel. Remaining population: o.8*1.2B = 0.96B people.

Assuming a family of 4 people that is 0.24B families.

Rural:Urban = 30:70 (0.072B:0.168B)

Assuming only about 10% of people use surf excel in ruler areas, due to the availability of other cheaper mediums that will be 16M Families.

Due to competition and availability of substitutes in urban areas, assuming surf excel has a market share of 40%, that will be about 28M families.

Total user base: 44M Families.

`Everyday usage must be at least 10 grams, total usage = 440 Million gms of surf excel everyday.`

## Guesstimate 4 : Estimate the total length of roads in your city

Approach:

If we take Mumbai and say Blue Dart, it will take 1 delivery truck for a region like Andheri from the regional distribution centre.

Andheri is assumed as approximately 25km^2 in area.

A delivery truck driver would work for 7 hours a day driving, at a speed of 30–40 kmph (Andheri being congested).

`So kilometers he clocks in a day: 30*7 = 210 km.`

Now generally the delivery schedule is planned in a such a way that all deliveries happen in the most efficient way. (Operations Guys at the firm would be getting paid lakhs for ensuring this shit!, but then technology rules! )

`So assuming up and down journey, I would assume length of the roadways in Andheri area as 210/2 = 105 km.`
`But provisioning for some scope of redundancy during the travel, I would take 70% uniqueness factor.That gives us ~ 70km`
```Also some of the areas would be covered by bike delivery
guys: say 30% of what truck covers (as bikers cover short distances) = ~20km```
```Also 10–15 km for postman/walker delivery personnel : ~
10km```
`So total road length for Andheri ~ 100kmNow, Mumbai city area: 600km^2So Andheri like region = 600km^2 / 25 = 24Assuming a coverage ratio of the courier service to be 90%Regions covered: ~22`
`So total road length ~ 22*100 = 2200km.`

Keep Learning 🙂
The Data Monk

## Guesstimate 3 – How many paan shops are there in India ?

Ans: We will use the method of Demand and Supply

Total population of India = 1.2bn or 1200 Mn

Male = 700 Mn & Female = 600 Mn ( 900 women per 1000 men )

The ratio of Female/Male for Pan consumption is very small.

Female in india those consume pan on regular basis = 2% = 600Mn*.01 -> 12Mn

Divide male on the basis of age groups :

`0–15 ->  Neglect this section`
`16–22 -> 15% of 700Mn = 105Mn -> 5Mn ( This is the college age & as per my own college scenario 5 students consume pan out of the 100)`
`23–50 -> 35%of 700Mn= 240 Mn -> 20% people consume -> .2*240 -> 48Mn`
`51–80 -> 20% of 700Mn=140Mn -> older age, various problems like in teeth -> .05*140-> 7Mn`
`Total demand of pan = 12+5+48+7 = 72 Mn`

Using Supply :

Time taken to make 1 pan = 2min -> 30 in 1 hour

Lets pan shop opens for 10 hours -> 30*10 = 300 pan

Number of pan shops in India= 72 Mn/ 300 = 2,40,000 ~ 2.5 lakh

## Guesstimate 1 – Number of Office chairs sold in India

Q1. Estimate the number of office chairs sold in India.

Approach:-

To estimate the numbers of office chairs sold in India, we will proceed by estimating, working population in India.

Lets say, out of 120 crore population in India, 60% are people that are above the age of 15, i.e. equal to 72 crore.
Approximately 50% people are employed, which is equal to 36 crore of the working population in India.

Lets now assume 20% of the working population work from office where in we exclude the farmers, street vendors and drivers, etc).
Approximately 7.2 crore people in India work from office.

Thus, there are 7 crore people existing office chairs in India. Now we can either assume the time frame or ask the interviewer.

If we assume, we can calculate the chairs sold in a year.

Hence, we can conclude that approximately 70 lakh office chairs are being sold in India.

Q2. How many total Gmail users are there in India?

Ans:  The population of India is – 1300000000.

Internet penetration in India is – 30%

Population Distribution –

• 0-20 : 25%
• 21- 60 : 60%
• >60 : 15%

people of age between 0-20 using internet = 0.25*390000000 = 97500000 =        100000000(aprrox.)

Similarly people of age between 21-60 and >60 using internet will be 230000000 and 60000000 respectively.

People of age 0-20 usually do not need an email account. But we may consider 5% of them do have it. Therefore email users of 0-20 age group = 0.05*100000000 = 5000000.

The major proportion of email users are the employed people. Since in India employment rate is 75% but not all of them need an email account. We will assume 75% of employed people use an email account.

Therefore email users of age 20-60 = 0.75*0.75*230000000 = 130000000(approx.)

Out of 60000000 people aged above 60, 25% of them use an email account. So email users of this age group = 0.25* 60000000 = 15000000.

So total email users in India = 5000000 + 130000000 + 15000000 = 150000000.

Gmail is a free and most prominent email service provider. So the nearly 85% of email users use Gmail.

Therefore number of Gmail users in India = 0.85*150000000 = 127500000.

3. Estimate how many items in a jar ?

Ans:  We will take a circular jar base, to use the simplest mathematics formula we will use the basic shape of a jar to calculate the number of candy pieces.

Also the jae will change its diameter.

Lets assume the diameter is 5 cm wide, hence, radius is 2.5 cm. The area of the base is r^2 times Pi. therefore, area = 2.5*2.5*3.14 = 19.63.

For volume, we will assume a cylinder and later add a few pieces.

We take area and multiply it with the height of the cylinder. Lets estimate this again, making the final results inaccurate as it all depends on how good we are at estimating.

Height of the candy is 7 cm, hence V = 7*19.63 = 137.4 cubic cm.

Lets use 0.8 as a value, as thats the normal size of a candy.

If we split the jar into four sides and count the candy pieces, there are about 25 extra pieces on each side. So we add 100 to the equation. Ans- 272 pieces.

4. How many paan shops are there in India ?

Ans: We will use the method of Demand and Supply

Total population of India = 1.2bn or 1200 Mn

Male = 700 Mn & Female = 600 Mn ( 900 women per 1000 men )

The ratio of Female/Male for Pan consumption is very small.

Female in india those consume pan on regular basis = 2% = 600Mn*.01 -> 12Mn

Divide male on the basis of age groups :

0–15 ->  neglect this section

16–22 -> 15% of 700Mn = 105Mn -> 5Mn ( This is the college age & as per my own college scenario 5 students consume pan out of the 100)

23–50 -> 35%of 700Mn = 240 Mn -> 20% people consume -> .2*240 -> 48Mn

51–80 -> 20% of 700Mn=140Mn -> older age, various problems like in teeth -> .05*140

-> 7Mn

Total demand of pan = 12+5+48+7 = 72 Mn

Using Supply :

time taken to make 1 pan = 2min -> 30 in 1 hour

Lets pan shop opens for 10 hours -> 30*10 = 300 pan

Number of pan shops in india= 72 Mn/ 300 = 2,40,000 ~ 2.5 lakh

5. Estimate the total length of roads in your city

Ans: If we take Mumbai and say Blue Dart, it will take 1 delivery truck for a region like Andheri from the regional distribution center.

Andheri is assumed as approximately 25km^2 in area.

a delivery truck driver would work for 7 hours a day driving, at a speed of 30–40 kmph (Andheri being congested).

So kilometers he clocks in a day: 30*7 = 210 km.

Now generally the delivery schedule is planned in a such a way that all deliveries happen in the most efficient way. (Operations Guys at the firm would be getting paid lakhs for ensuring this shit!, but then technology rules! )

So assuming up and down journey, I would assume length of the roadways in Andheri area as 210/2 = 105 km.

But provisioning for some scope of redundancy during the travel, I would take 70% uniqueness factor.

That gives us ~ 70km

Also some of the aeas would be covered by bike delivery guys: say 3o% of wat truck covers (as bikers cover short distances) = ~20km

Also 10–15 km for postman/walker delivery personnel : ~ 10km

So total road length for Andheri ~ 100km

Now, Mumbai city area: 600km^2

So Andheri like region = 600km^2 / 25 = 24

Assuming a coverage ratio of the courier service to be 90%

Regions covered: ~22

So total road length ~ 22*100 = 2200km.

6. How much is the surf excel detergent usage in a day in India?

Ans:   India has a population of approx 1.2B People.

About 20% are BPL and would therefore not use surf excel. Remaining population: o.8*1.2B = 0.96B people.

Assuming a family of 4 people that is 0.24B families.

Rural:Urban = 30:70 (0.072B:0.168B)

Assuming only about 10% of people use surf excel in ruler areas, due to the availability of other cheaper mediums that will be 16M Families.

Due to competition and availability of substitutes in urban areas, assuming surf excel has a market share of 40%, that will be about 28M families.

Total user base: 44M Families.

Everyday usage must be at least 10 grams, total usage = 440 Million gms of surf excel everyday.

7. How do we estimate the area of an Airport ?

Ans: Let’s assume the airport can accommodate 10 planes at once and handle 5 runways.

Average length of runway = 2000m

Width of runway = 50 m

The total area of runway is 2000*50*5 = 500000 sq.m

Length of average plane is assumed as 50m and its wingspan is 40 m.

Total area of 10 planes = 20000+500000 = 520000 sq.m

Now somes assumptions, total area occupied by the building is equal to the area of runways and planes i.e 520000 sq.m

Total area of airport = 104000 sq.m.

There is a lot of empty space on an airport so we assume 40% of the total hence the final area is = 104000/0.6 = 1700000 sq.m.

8. How to estimate the number of ambulances on the road ?

Ans: Let’s start with the population of the country ~ 1.3 Billion (1300 million)

Rural – 70% = 900 million Urban 400 million

Let’s divide the Urban population into three groups based on income level – Low |High |Upper High

I would divide as : Low: 30% High – 50% Upper High – 20%

So Urban Low: 120 million High – 200 Upper High – 80 million

Now out of Urban Low I would assume 10% have driving as occupation (Rise of taxi services etc. (only considering 4 wheeler drivers))

= 12 million (120,00,000)

Out of these I would assume 1–1.5% are ambulance drivers = 120000

So we can say almost 1,20,000 ambulances are available in the country at any time.

Now generally ambulances are available on calling, that means, at least a half of them are always on backup. let’s assume that 70% of them are on idle at any given point of time.

Also since their average journey time is of 20–25 minutes, so we can say not more than 10% of ambulances would be on road at any time

So we can say 10% are on roads at any time = 12000 ambulances on road across Urban India.

9. How many red colour Swift cars are there in Delhi ?

Ans:  Let’s start with the population of Delhi which is 2 Crores.

We will divide this population into two groups-

1. Family(80%) = 0.8*20000000 = 16000000 family members

2. Bachelors(Individuals) (20%) = 0.2*20000000 = 4000000

Number of families(assuming 4 members in each) = 16000000/4 = 4000000.

Guessing that 50% of families have cars, so the number of families with cars = 2000000.

In Delhi, we can assume that 25% of the families belong to high class society so they can afford 2 cars on an average and the rest can afford only one car.

Therefore number of cars with families = 0.25*2000000*2 + 0.75*2000000*1 = 2500000.

Now let’s say only 10% of the individual population can afford a single car.

Therefore, number of cars with individuals = 0.10*4000000 = 400000.

So the total number of cars in Delhi can be estimated as 2500000+400000 = 2900000 which can be rounded of to 3000000 for simple calculations.

Since Maruti being the Indian market leaders in car sector so we can safely assume 50% of cars on the roads of Delhi are of Maruti, i.e., 1500000 cars.

Swift is one of the most common and affordable models along with Alto, WagonR, Omni and 800. So let’s assume there are 200000 Swifts.

White, silver, grey, black are the most common colors. So 75%(approx) of Swifts will be of these colors.

Now we are left with 50000 Swifts of different colors. Considering red, yellow, blue, maroon and orange as possible other colors,we can guess that there are nearly 10000 red Swifts in Delhi.

10. Guess-estimate the number of people wearing watches in Bangalore.

Ans:      Let’s assume population of Bangalore as 10 million and the day today is a working day for every age group

And age group wise population assumption would be

0–15 yrs: 30%; 15–25 yrs: 20%; 25–50 yrs: 30%; >50 yrs: 20%

Income wise population would be

Above poverty line: 75%

The people below poverty has negligible chance to use a watch, so I am eliminating this 25% population from every age group.

Now we have 75% of aforesaid % in every age group. As below poverty is eliminated 0–15 group mostly have school children and infants I assume 5% of them wearing watches today. So the count would be 2.25 million*5%=112500

15–25 age group have majority of higher education students who spend most of their time using mobile phones, so people using watches may be nearly 25% which gives 1.5 million*25%= 375000

25–50 groups have professionals working somewhere for their survival whose generation recently entered the smartphone culture. So most of them use watches. If out of this group, if 90% are couples, we will approximately have 20% housewives in it. As there is less chance of housewives wearing watches we can neglect it. The remaining 80% population say 90% wear watches, then count would be (80%*2.25 million)*90%=1620000

Coming to >50 group say 25% are working somewhere having 90% of them wearing watches again and in remaining 75% let 20% wear watches as we see many of our grandfathers wearing watches even in house. So count would be (25%*1.5 million)*90%+(75%*1.5 million)*20%= 562500

Total= sum of all these=2.67 million wear watches

## Guesstimate 2 – How many gmail users are there in India ?

Approach

The population of India is – 1,300,000,000 i.e. 1.3B
Internet penetration in India is – 30%

Assumed Population Distribution –

• 0-20 : 25%
• 21- 60 : 60%
• >60 : 15%

People of age between 0-20 using internet
= 0.25*390000000
= 97500000
~100,000,000

Similarly people of age between 21-60 and >60 using internet will be 230,000,000 and 60,000,000 respectively.

Assumption

People of age 0-20 usually do not need an email account. But we may consider 5% of them do have it.

Therefore email users of 0-20 age group = 0.05*100000000 = 5000000.

The major proportion of email users are the employed people. Since in India employment rate is 75% but not all of them need an email account.

We will assume 75% of employed people use an email account.

Therefore email users of age 20-60 = 0.75*0.75*230,000,000 = 130,000,000(approx.)

Out of 60,000,000 people aged above 60, 25% of them use an email account. So email users of this age group = 0.25* 60,000,000 = 15,000,000.

So total email users in India = 5000000 + 130000000 + 15000000 = 150,000,000.

Gmail is a free and most prominent email service provider. So the nearly 85% of email users use Gmail.

Therefore number of Gmail users in India
= 0.85*150000000
= 127,500,000.

## 50 Statistics Questions

Q1. What is a Sample?

A. A data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

Q2. Define Population.

A. In statistics, population refers to the total set of observations that can be made. For example, if we are studying the weight of adult women, the population is the set of weights of all the women in the world

Q3. What is a Data Point?

A. In statistics, a data point (or observation) is a set of one or more measurements on a single member of a statistical population.

Q4. Explain Data Sets.

A. Data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.

Q5. What is meant by the term Inferential Statistics?

A. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

Q6. Give an example of Inferential Statistics

A. You asked five of your classmates about their height. On the basis of this information, you stated that the average height of all students in your university or college is 67 inches.

Q7. What is Descriptive Statistics?

A. Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread).

Q8. What is the range of data?

A1. It tells us how much the data is spread across in a set. In other words, it is defined as the difference between the highest and the lowest value present in the set.

X=[2 3 4 4 3 7 9]

Range(x)%return (9-2)=7

Q9. Define Measurement.

A. Data can be classified as being on one of four scales:

• nominal
• ordinal
• interval
• ratio

Each level of measurement has some important properties that are useful to know. For example, only the ratio scale has meaningful zeros.

Q10. What is a Nominal Scale?

A. Nominal variables (also called categorical variables) can be placed into categories. They don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to have an order then you probably have ordinal variables instead

Q11. What is an Ordinal Scale?

A. The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.

Q12. What is an Interval Scale?

A. An interval scale has ordered numbers with meaningful divisions. Temperature is on the interval scale: a difference of 10 degrees between 90 and 100 means the same as 10 degrees between 150 and 160. Compare that to high school ranking (which is ordinal), where the difference between 1st and 2nd might be .01 and between 10th and 11th .5. If you have meaningful divisions, you have something on the interval scale.

Q13. Explain Ratio Scale.

A. The ratio scale is exactly the same as the interval scale with one major difference: zero is meaningful. For example, a height of zero is meaningful (it means you don’t exist). Compare that to a temperature of zero, which while it exists, it doesn’t mean anything in particular.

Q14. What do you mean by Bayesian?

A. Bayesians condition on the data observed and considered the probability distribution on the hypothesis. Bayesian statistics provides us with mathematical tools to rationally update our subjective beliefs in light of new data or evidence.

Q15. What is Frequentist?

A. Frequentists condition on a hypothesis of choice and consider the probability distribution on the data, whether observed or not. Frequentist statistics uses rigid frameworks, the type of frameworks that you learn in basic statistics, like:

Q16. What is P-Value??

A. In statistical significance testing, it is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

Q17. What is a Confidence Interval?

A. A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method.

Q18. Explain Hypothesis Testing.

A. Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to infer the result of a hypothesis performed on sample data from a larger population.

Q19. What is likelihood?

A. The probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the observed outcomes.

Q20. What is sampling?

A. Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern.

Q21. What are Sampling Methods?

A. There are 4 sampling methods:

• Simple Random
• Systematic
• Cluster
• Stratified

Q22. What is Mode?

A. The mode of a data sample is the element that occurs the most number of times in the data collection.

X=[1 2 4 4 4 4 5 5]

Mode(x)% return 3

Q23. What is Median?

A. It is describes as the numeric value that separates the lower half of sample of a probability from the upper half. It can b easily calculated by arranging all the samples from highest to lowest (or vice-versa) and picking the middle one.

X=[2 4 1 3 4 4 3]

X=[1 2 3 3 4 4 4]

Median(x)% return 3

Q24. What is meant by Quartile?

A. It is a type of quantile that divides the data points into four or less equal parts(quarters). Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest.

Q25. What is Moment?

A. It is the quantitative measure of the shape of a set of points. It comprises of a set of statistical parameters to measure a distribution. Four moments are commonly used:

• Mean
• Skewness
• Variance
• Kurtosis

Q26. What is the Mean of data?

A. The statistical mean refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points.

X=[1 2 3 3  6]

Sum=1+2+3+3+6=15

Mean(x)%return (sum/5)=3

Q27. Define Skewness.

A. Skewness is a measure of the asymmetric of the data around the sample mean. It it is negative, the data are spread out more to the left side of the mean than to the right. The vice-versa also stands true.

Q28. What is Variance?

A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. Variance is the average of the squared distances from each point to the mean.

Q29. Define Standard Deviation.

A. In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Q30. What is Kurtosis?

A. Kurtosis is a measure of how outlier-prone a distribution is. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

Q31. What is meant by Covariance?

A. Covariance measures the directional relationship between the returns on two assets. A positive covariance means that asset returns move together while a negative covariance means they move inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying the correlation between the two variables by the standard deviation of each variable. It gives the measure of how much two variable change together.

Q32. What is Alternative Hypothesis?

A. The Alternative hypothesis (denoted by H1 ) is the statement that must be true if the null hypothesis is false.

Q33. Explain Significance Level.

A. The probability of rejecting the null hypothesis when it is called the significance level α , and very common choices are α = 0.05 and α = 0.01.

Q34. Do you know what is Binary search?

A. For binary search, the array should be arranged in ascending or descending order. In each step, the algorithm compares the search key value with the key value of the middle element of the array. If the keys match, then a matching element has been found and its index, or position, is returned. Otherwise, if the search key is less than the middle element’s key, then the algorithm repeats its action on the sub-array to the left of the middle element or, if the search key is greater, on the sub-array to the right.

Q35. Explain Hash Table.

A. A hash table is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.

Q36. What is Null Hypothesis?

A. The null hypothesis (denote by H0 ) is a statement about the value of  a population parameter (such as mean), and it must contain the condition of equality and must be written with the symbol =, ≤, or ≤.

Q37. When You Are Creating A Statistical Model How Do You Prevent Over-fitting?

A.  It can be prevented by cross-validation

Q38. What do you mean by Cross-vlidation?

A. Cross-validation, it’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice

Q39. What is Linear regression?

A. A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship.

Q40. What are the assumptions required for linear regression?

A. There are four major assumptions:

1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
2.  The errors or residuals of the data are normally distributed and independent from each other
3.  There is minimal multi-co linearity between explanatory variables
4.  Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

Q41. What is Multiple Regression?

A. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.  A dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term.  Multiple regression requires two or more predictor variables, and this is why it is called multiple regression.

Q42. What is a Statistical Interaction?

A. Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

Q43. What is an example of a data set with a non-Gaussian distribution?

A.The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.

Q44. Define Correlation.

A. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

For example: height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5’5” is less than the average weight of people 5’6”, and their average weight is less than that of people 5’7”, etc.

Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

Q45. What is primary goal of A/B Testing?

A. A/B testing refers to a statistical hypothesis with two variables A and B. The primary goal of A/B testing is the identification of any changes to the web page for maximizing or increasing the outcome of interest. A/B testing is a fantastic method for finding the most suitable online promotional and marketing strategies for the business.

Q46. What is meaning of Statistical Power of Sensitivity?

A. The statistical power of sensitivity refers to the validation of the accuracy of a classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is basically Predicted True Events/Total Events.

Q47. Explain Over-fitting.

A. In the case of over-fitting, the model is highly complex, like having too many parameters which are relative to many observations. The overfit model has poor predictive performance, and it overreacts to many minor fluctuations in the training data.

Q48. Explain Under-fitting

A. In the case of under-fitting, the underlying trend of the data cannot be captured by the statistical model or even the machine learning algorithm. Even such a model has poor predictive performance.

Q49. What is Long Format Data?

A. In the long format, every row makes a one-time point per subject. The data in the wide format can be recognized by the fact that the columns are basically represented by the groups.

Q50. What is Wide Format Data?

A. In the wide format, the repeated responses of the subject will fall in a single row, and each response will go in a separate column.

Keep Learning 🙂

The Data Monk