The people below poverty has negligible chance to
use a watch, so I am eliminating this 25% population from every age group.
Now we have 75% of aforesaid % in every age group.
As below poverty is eliminated 0–15 group mostly have school children and
infants I assume 5% of them wearing watches today. So the count would be 2.25
million*5%=112500
15–25 age group have majority of higher education students
who spend most of their time using mobile phones, so people using watches may
be nearly 25% which gives 1.5 million*25%= 375000
25–50 groups have professionals working somewhere
for their survival whose generation recently entered the smartphone culture. So
most of them use watches. If out of this group, if 90% are couples, we will
approximately have 20% housewives in it. As there is less chance of housewives
wearing watches we can neglect it. The remaining 80% population say 90% wear
watches, then count would be (80%*2.25 million)*90%=1620000
Coming to >50 group say 25% are working somewhere
having 90% of them wearing watches again and in remaining 75% let 20% wear
watches as we see many of our grandfathers wearing watches even in house. So
count would be (25%*1.5 million)*90%+(75%*1.5 million)*20%= 562500
Total= sum of all these =2.67 million wear watches
Number of families(assuming 4 members in each) =
16000000/4 = 4000000.
Guessing that 50% of families have cars, so the number of
families with cars = 2000000.
In Delhi, we can assume that 25% of the families belong to
high class society so they can afford 2 cars on an average and the rest can
afford only one car.
Therefore number of cars with families = 0.25*2000000*2 + 0.75*2000000*1 = 2500000.
Now let’s say only 10% of the individual population can
afford a single car.
Therefore, number of cars with individuals = 0.10*4000000 = 400000.
So the total number of cars in Delhi can be estimated as 2500000+400000 = 2900000 which can be rounded of to 3000000 for simple calculations.
Since Maruti being the Indian market leaders in car sector
so we can safely assume 50% of cars on the roads of Delhi are of Maruti, i.e.,
1500000 cars.
Swift is one of the most common and affordable models
along with Alto, WagonR, Omni and 800. So let’s assume there are 200000 Swifts.
White, silver, grey, black are the most common colors. So
75%(approx) of Swifts will be of these colors.
Now we are left with 50000 Swifts of different colors. Considering red, yellow, blue, maroon and orange as possible other colors,we can guess that there are nearly 10000 red Swifts in Delhi.
Ans: Let’s start with the population of the country ~ 1.3
Billion (1300 million)
Rural – 70% = 900 million Urban 400 million
Let’s divide the Urban population into three groups based
on income level – Low |High |Upper High
I would divide as : Low: 30% High - 50% Upper High - 20% So Urban Low: 120 million High - 200 Upper High - 80 million
Now out of Urban Low I would assume 10% have driving as occupation (Rise of taxi services etc. (only considering 4 wheeler drivers)) = 12 million (120,00,000)
Out of these I would assume 1–1.5% are ambulance drivers =
120000
So we can say almost 1,20,000 ambulances are available in
the country at any time.
Now generally ambulances are available on calling, that
means, at least a half of them are always on backup. let’s assume that 70% of
them are on idle at any given point of time.
Also since their average journey time is of 20–25 minutes,
so we can say not more than 10% of ambulances would be on road at any time
So we can say 10% are on roads at any time = 12000
ambulances on road across Urban India.
If we take Mumbai and say Blue Dart, it will take 1 delivery truck for a region like Andheri from the regional distribution centre.
Andheri is assumed as approximately 25km^2 in area.
A delivery truck driver would work for 7 hours a day driving, at a speed of 30–40 kmph (Andheri being congested).
So kilometers he clocks in a day: 30*7 = 210 km.
Now generally the delivery schedule is planned in a such a
way that all deliveries happen in the most efficient way. (Operations Guys at the firm would be getting paid lakhs for ensuring
this shit!, but then technology
rules! )
So assuming up and down journey, I would assume length of the roadways in Andheri area as 210/2 = 105 km.
But provisioning for some scope of redundancy during the travel, I would take 70% uniqueness factor. That gives us ~ 70km
Also some of the areas would be covered by bike delivery
guys: say 30% of what truck covers (as bikers cover short distances) = ~20km
Also 10–15 km for postman/walker delivery personnel : ~
10km
So total road length for Andheri ~ 100km Now, Mumbai city area: 600km^2 So Andheri like region = 600km^2 / 25 = 24 Assuming a coverage ratio of the courier service to be 90% Regions covered: ~22
Q1. Estimate the number of office chairs sold in India.
Approach:-
To estimate the numbers of office chairs sold in India, we will proceed by estimating, working population in India.
Lets say, out of 120 crore population in India, 60% are people that are above the age of 15, i.e. equal to 72 crore. Approximately 50% people are employed, which is equal to 36 crore of the working population in India.
Lets now assume 20% of the working population work from office where in we exclude the farmers, street vendors and drivers, etc). Approximately 7.2 crore people in India work from office.
Thus, there are 7 crore
people existing office chairs in India. Now we can either assume the time frame
or ask the interviewer.
If we assume, we can
calculate the chairs sold in a year.
Hence, we can conclude that approximately 70 lakh office chairs are being sold in India.
Q2. How many total Gmail users are there in India?
Ans: The population of India is – 1300000000.
Internet penetration in India is – 30%
Population Distribution –
0-20 : 25%
21- 60 : 60%
>60 : 15%
people of age between 0-20 using
internet = 0.25*390000000 = 97500000 =
100000000(aprrox.)
Similarly people of age between 21-60
and >60 using internet will be 230000000
and 60000000 respectively.
People of age 0-20 usually do not need an email account.
But we may consider 5% of them do have it. Therefore email users of 0-20 age
group = 0.05*100000000 = 5000000.
The major proportion of email users are the employed
people. Since in India employment rate is 75% but not all of them need an email
account. We will assume 75% of employed people use an email account.
Therefore email users of age 20-60 = 0.75*0.75*230000000 =
130000000(approx.)
Out of 60000000 people aged above 60, 25% of them use an
email account. So email users of this age group = 0.25* 60000000 = 15000000.
So total email users in India = 5000000 + 130000000 +
15000000 = 150000000.
Gmail is a free and most prominent email service provider.
So the nearly 85% of email users use Gmail.
Therefore number of Gmail users in India = 0.85*150000000 = 127500000.
3. Estimate how many items in a jar ?
Ans: We will take a circular jar base, to use the
simplest mathematics formula we will use the basic shape of a jar to calculate
the number of candy pieces.
Also the jae will change
its diameter.
Lets assume the diameter is
5 cm wide, hence, radius is 2.5 cm. The area of the base is r^2 times Pi.
therefore, area = 2.5*2.5*3.14 = 19.63.
For volume, we will assume
a cylinder and later add a few pieces.
We take area and multiply
it with the height of the cylinder. Lets estimate this again, making the final
results inaccurate as it all depends on how good we are at estimating.
Height of the candy is 7
cm, hence V = 7*19.63 = 137.4 cubic cm.
Lets use 0.8 as a value, as
thats the normal size of a candy.
If we split the jar into four sides and count the candy pieces, there are about 25 extra pieces on each side. So we add 100 to the equation. Ans- 272 pieces.
4. How many paan shops are there in India ?
Ans: We will use the method
of Demand and Supply
Total population of India = 1.2bn or 1200 Mn
Male = 700 Mn & Female = 600 Mn ( 900 women per 1000
men )
The ratio of Female/Male for Pan consumption is very
small.
Female in india those consume pan on regular basis = 2% =
600Mn*.01 -> 12Mn
Divide male on the basis of age groups :
0–15 -> neglect
this section
16–22 -> 15% of 700Mn = 105Mn -> 5Mn ( This is the
college age & as per my own college scenario 5 students consume pan out of
the 100)
51–80 -> 20% of 700Mn=140Mn -> older age, various
problems like in teeth -> .05*140
-> 7Mn
Total demand of pan = 12+5+48+7 = 72 Mn
Using Supply :
time taken to make 1 pan = 2min -> 30 in 1 hour
Lets pan shop opens for 10 hours -> 30*10 = 300 pan
Number of pan shops in
india= 72 Mn/ 300 = 2,40,000 ~ 2.5
lakh
5. Estimate the total length of roads in your city
Ans: If we take Mumbai and say Blue Dart, it
will take 1 delivery truck for a region like Andheri from the regional
distribution center.
Andheri is assumed as
approximately 25km^2 in area.
a delivery truck driver would work for 7 hours a day
driving, at a speed of 30–40 kmph (Andheri being congested).
So kilometers he clocks in a day: 30*7 = 210 km.
Now generally the delivery schedule is planned in a such a
way that all deliveries happen in the most efficient way. (Operations Guys at the firm would be getting paid lakhs for ensuring
this shit!, but then technology
rules! )
So assuming up and down journey, I would assume length of
the roadways in Andheri area as 210/2 = 105 km.
But provisioning for some scope of redundancy during the
travel, I would take 70% uniqueness factor.
That gives us ~ 70km
Also some of the aeas would be covered by bike delivery
guys: say 3o% of wat truck covers (as bikers cover short distances) = ~20km
Also 10–15 km for postman/walker delivery personnel : ~
10km
So total road length for Andheri ~ 100km
Now, Mumbai city area: 600km^2
So Andheri like region = 600km^2 / 25 = 24
Assuming a coverage ratio of the courier service to be 90%
Regions covered: ~22
So total road length ~ 22*100 = 2200km.
6. How much is the surf excel detergent usage in a day in India?
Ans: India has a
population of approx 1.2B People.
About 20% are BPL and would therefore not use surf excel.
Remaining population: o.8*1.2B = 0.96B people.
Assuming a family of 4 people that is 0.24B families.
Rural:Urban = 30:70 (0.072B:0.168B)
Assuming only about 10% of people use surf excel in ruler
areas, due to the availability of other cheaper mediums that will be 16M
Families.
Due to competition and availability of substitutes in
urban areas, assuming surf excel has a market share of 40%, that will be about
28M families.
Total user base: 44M Families.
Everyday usage must be at
least 10 grams, total usage = 440 Million gms of surf excel everyday.
7. How do we estimate the area of an Airport ?
Ans: Let’s assume the
airport can accommodate 10 planes at once and handle 5 runways.
Average length of runway =
2000m
Width of runway = 50 m
The total area of runway is
2000*50*5 = 500000 sq.m
Length of average plane is
assumed as 50m and its wingspan is 40 m.
Total area of 10 planes =
20000+500000 = 520000 sq.m
Now somes assumptions,
total area occupied by the building is equal to the area of runways and planes
i.e 520000 sq.m
Total area of airport =
104000 sq.m.
There is a lot of empty
space on an airport so we assume 40% of the total hence the final area is =
104000/0.6 = 1700000 sq.m.
8. How to estimate the number of ambulances on the road ?
Ans: Let’s start with the population of the country ~ 1.3
Billion (1300 million)
Rural – 70% = 900 million Urban 400 million
Let’s divide the Urban population into three groups based
on income level – Low |High |Upper High
I would divide as : Low: 30% High – 50% Upper High – 20%
So Urban Low: 120 million High – 200 Upper High – 80
million
Now out of Urban Low I would assume 10% have driving as
occupation (Rise of taxi services etc. (only considering 4 wheeler drivers))
= 12 million (120,00,000)
Out of these I would assume 1–1.5% are ambulance drivers =
120000
So we can say almost 1,20,000 ambulances are available in
the country at any time.
Now generally ambulances are available on calling, that
means, at least a half of them are always on backup. let’s assume that 70% of
them are on idle at any given point of time.
Also since their average journey time is of 20–25 minutes,
so we can say not more than 10% of ambulances would be on road at any time
So we can say 10% are on roads at any time = 12000
ambulances on road across Urban India.
9. How many red colour Swift cars are there in Delhi ?
Ans: Let’s start with the population of Delhi
which is 2 Crores.
We will divide this population into two groups-
1. Family(80%) = 0.8*20000000 = 16000000 family members
Number of families(assuming 4 members in each) =
16000000/4 = 4000000.
Guessing that 50% of families have cars, so the number of
families with cars = 2000000.
In Delhi, we can assume that 25% of the families belong to
high class society so they can afford 2 cars on an average and the rest can
afford only one car.
Therefore number of cars with families = 0.25*2000000*2 +
0.75*2000000*1 = 2500000.
Now let’s say only 10% of the individual population can
afford a single car.
Therefore, number of cars with individuals = 0.10*4000000
= 400000.
So the total number of cars in Delhi can be estimated as
2500000+400000 = 2900000 which can be rounded of to 3000000 for simple
calculations.
Since Maruti being the Indian market leaders in car sector
so we can safely assume 50% of cars on the roads of Delhi are of Maruti, i.e.,
1500000 cars.
Swift is one of the most common and affordable models
along with Alto, WagonR, Omni and 800. So let’s assume there are 200000 Swifts.
White, silver, grey, black are the most common colors. So
75%(approx) of Swifts will be of these colors.
Now we are left with 50000
Swifts of different colors. Considering red, yellow, blue, maroon and orange as
possible other colors,we can guess
that there are nearly 10000 red Swifts in Delhi.
10. Guess-estimate the number of people wearing watches in Bangalore.
Ans: Let’s assume population of
Bangalore as 10 million and the day today is a working day for every age group
The people below poverty has negligible chance to
use a watch, so I am eliminating this 25% population from every age group.
Now we have 75% of aforesaid % in every age group.
As below poverty is eliminated 0–15 group mostly have school children and
infants I assume 5% of them wearing watches today. So the count would be 2.25
million*5%=112500
15–25 age group have majority of higher education students
who spend most of their time using mobile phones, so people using watches may
be nearly 25% which gives 1.5 million*25%= 375000
25–50 groups have professionals working somewhere
for their survival whose generation recently entered the smartphone culture. So
most of them use watches. If out of this group, if 90% are couples, we will
approximately have 20% housewives in it. As there is less chance of housewives
wearing watches we can neglect it. The remaining 80% population say 90% wear
watches, then count would be (80%*2.25 million)*90%=1620000
Coming to >50 group say 25% are working somewhere
having 90% of them wearing watches again and in remaining 75% let 20% wear
watches as we see many of our grandfathers wearing watches even in house. So
count would be (25%*1.5 million)*90%+(75%*1.5 million)*20%= 562500
A. A data sample is a
set of data collected and the world selected from
a statistical population by a defined procedure. The elements of
a sample are known
as sample points, sampling units or observations.
Q2. Define Population.
A. In statistics, population refers
to the total set of observations that can be made. For example, if we are
studying the weight of adult women, the population is the set of
weights of all the women in the world
Q3. What is a Data Point?
A. In statistics, a data point (or
observation) is a set of one or more measurements on a single member of a
statistical population.
Q4. Explain Data Sets.
A. Data sets usually
come from actual observations obtained by sampling a statistical population,
and each row corresponds to the observations on one element of that
population. Data sets may further be generated by algorithms for the
purpose of testing certain kinds of software.
Q5. What is meant by the term Inferential Statistics?
A. Inferential statistics use a random sample of
data taken from a population to describe and make inferences about the
population. Inferential statistics are valuable when examination of each member
of an entire population is not convenient or possible.
Q6. Give an
example of Inferential Statistics
A. You asked five of your
classmates about their height. On the basis of this information, you stated
that the average height of all students in your university or college is 67
inches.
Q7. What is Descriptive Statistics?
A. Descriptive statistics are
brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive
statistics are broken down into measures of central tendency and measures of
variability (spread).
Q8. What is the range of data?
A1. It tells us how much the data is spread across in a set.
In other words, it is defined as the difference between the highest and the
lowest value present in the set.
X=[2 3 4 4 3 7 9]
Range(x)%return (9-2)=7
Q9. Define Measurement.
A. Data can be classified as being on one of four scales:
nominal
ordinal
interval
ratio
Each level of measurement has some important properties that are useful
to know. For example, only the ratio scale has meaningful zeros.
Q10. What is a Nominal Scale?
A. Nominal variables (also called categorical variables) can be placed into
categories. They don’t have a numeric value and so cannot be added,
subtracted, divided or multiplied. They also have no order; if they appear to
have an order then you probably have ordinal variables instead
Q11. What is an Ordinal Scale?
A. The ordinal scale contains things that you can place in order. For
example, hottest to coldest, lightest to heaviest, richest to poorest.
Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you
have data that’s on an ordinal scale.
Q12. What is an Interval Scale?
A. An interval scale has ordered numbers with
meaningful divisions. Temperature is on the interval scale: a difference of 10
degrees between 90 and 100 means the same as 10 degrees between 150 and 160.
Compare that to high school ranking (which is ordinal), where the difference
between 1st and 2nd might be .01 and between 10th and 11th .5. If you have
meaningful divisions, you have something on the interval scale.
Q13. Explain Ratio Scale.
A. The ratio scale is exactly the same as the
interval scale with one major difference: zero is meaningful. For example, a
height of zero is meaningful (it means you don’t exist). Compare that to a
temperature of zero, which while it exists, it doesn’t mean anything in
particular.
Q14. What do you mean by Bayesian?
A. Bayesians condition on the data observed and considered
the probability distribution on the hypothesis. Bayesian
statistics provides us with mathematical tools to rationally update our
subjective beliefs in light of new data or evidence.
Q15. What is Frequentist?
A. Frequentists condition on a hypothesis of choice and
consider the probability distribution on the data, whether observed or not.
Frequentist statistics uses rigid frameworks, the type of frameworks that you
learn in basic statistics, like:
A. In statistical significance testing, it is the
probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true.
Q17. What is a Confidence Interval?
A. A confidence interval,
in statistics, refers to the probability that a population parameter will
fall between two set values for a certain proportion of times. Confidence
intervals measure the degree of uncertainty or certainty in a sampling
method.
Q18. Explain Hypothesis Testing.
A. Hypothesis testing is an act
in statistics whereby an analyst tests an
assumption regarding a population parameter. The methodology employed by the
analyst depends on the nature of the data used and the reason for the analysis.
Hypothesis testing is used to infer the result of a hypothesis performed on
sample data from a larger population.
Q19. What is likelihood?
A. The probability of some observed outcomes given a set of
parameter values is regarded as the likelihood of the set of parameter values
given the observed outcomes.
Q20. What is sampling?
A. Sampling is that part of statistical practice concerned
with the selection of an unbiased or random subset of individual observations
within a population of individuals intended to yield some knowledge about the
population of concern.
Q21. What are
Sampling Methods?
A. There are 4 sampling methods:
Simple Random
Systematic
Cluster
Stratified
Q22.What is Mode?
A. The mode of a data sample is the element that occurs the
most number of times in the data collection.
X=[1 2 4 4 4 4
5 5]
Mode(x)% return
3
Q23. What is Median?
A. It is describes as the numeric value that separates the
lower half of sample of a probability from the upper half. It can b easily
calculated by arranging all the samples from highest to lowest (or vice-versa)
and picking the middle one.
X=[2 4 1 3 4 4
3]
X=[1 2 3 3 4 4
4]
Median(x)%
return 3
Q24. What is meant by Quartile?
A. It is a type of quantile that divides the data points
into four or less equal parts(quarters). Each
quartile contains 25% of the total observations. Generally, the data is
arranged from smallest to largest.
Q25. What is Moment?
A. It is the quantitative measure of the shape of a set of
points. It comprises of a set of statistical parameters to measure a
distribution. Four moments are commonly used:
Mean
Skewness
Variance
Kurtosis
Q26. What is the Mean of data?
A. The statistical
mean refers to the mean or average that is used to derive the
central tendency of the data in question. It is determined by adding all the
data points in a population and then dividing the total by the number of
points.
X=[1 2 3 3 6]
Sum=1+2+3+3+6=15
Mean(x)%return (sum/5)=3
Q27. Define Skewness.
A. Skewness is a measure of the asymmetric of the data
around the sample mean. It it is negative, the data are spread out more to the
left side of the mean than to the right. The vice-versa also stands true.
Q28. What is Variance?
A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to
the mean, and to each other. A high variance indicates that the data points are
very spread out from the mean, and from one another. Variance is the average of the squared distances from
each point to the mean.
Q29. Define Standard Deviation.
A. In statistics, the standard deviation is a
measure of the amount of variation or dispersion of a set of values. A low
standard deviation indicates that the values tend to be close to the mean of
the set, while a high standard deviation indicates that the values are spread
out over a wider range.
Q30. What is Kurtosis?
A. Kurtosis is a measure of how outlier-prone a distribution
is. In other words, kurtosis identifies
whether the tails of a given distribution contain extreme values.
Q31. What is meant by Covariance?
A. Covariance measures the
directional relationship between the returns on two assets. A positive covariance means that asset returns move
together while a negative covariance means they move inversely. Covariance
is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying
the correlation between the two variables by the standard deviation of each variable.
It gives the measure of how much two variable change together.
Q32. What is Alternative Hypothesis?
A. The Alternative hypothesis
(denoted by H1 ) is the statement that must be true if the null hypothesis is
false.
Q33. Explain Significance Level.
A. The probability of rejecting
the null hypothesis when it is called the significance level α , and very
common choices are α = 0.05 and α = 0.01.
Q34. Do you know what is Binary search?
A. For binary search, the array
should be arranged in ascending or descending order. In each step, the
algorithm compares the search key value with the key value of the middle
element of the array. If the keys match, then a matching element has been found
and its index, or position, is returned. Otherwise, if the search key is less
than the middle element’s key, then the algorithm repeats its action on the
sub-array to the left of the middle element or, if the search key is greater,
on the sub-array to the right.
Q35. Explain Hash Table.
A. A hash table is a data structure used to implement an
associative array, a structure that can map keys to values. A hash table uses a
hash function to compute an index into an array of buckets or slots, from which
the correct value can be found.
Q36.
What is Null Hypothesis?
A. The null hypothesis
(denote by H0 ) is a statement about the value of a population parameter
(such as mean), and it must contain the condition of equality and must be
written with the symbol =, ≤, or ≤.
Q37.
When You Are
Creating A Statistical Model How Do You Prevent Over-fitting?
A. It can
be prevented by cross-validation
Q38. What do you mean by Cross-vlidation?
A. Cross-validation, it’s a
model validation techniques for assessing how the results of
a statistical analysis (model) will generalize to an independent data
set. It is mainly used in settings where the goal is prediction, and one wants
to estimate how accurately a predictive model will perform in practice
Q39. What is Linear regression?
A. A linear regression is a good tool for quick predictive analysis: for
example, the price of a house depends on a myriad of factors, such as its size
or its location. In order to see the relationship between these variables, we
need to build a linear regression, which predicts the line of best fit between
them and can help conclude whether or not these two factors have a positive or
negative relationship.
Q40. What are
the assumptions required for linear regression?
A. There are four major assumptions:
There is a linear
relationship between the dependent variables and the regressors, meaning the
model you are creating actually fits the data
The errors or residuals of the data are
normally distributed and independent from each other
There is minimal multi-co linearity between
explanatory variables
Homoscedasticity. This means the variance
around the regression line is the same for all values of the predictor
variable.
Q41. What is Multiple Regression?
A. Multiple regression generally explains the
relationship between multiple independent or predictor variables and one
dependent or criterion variable. A dependent variable is modeled as a
function of several independent variables with corresponding coefficients,
along with the constant term. Multiple regression requires two or more
predictor variables, and this is why it is called multiple regression.
Q42. What is a Statistical Interaction?
A. Basically, an interaction is when the effect of one factor (input variable)
on the dependent variable (output variable) differs among levels of another
factor.
Q43. What is an example of a data set with a
non-Gaussian distribution?
A.The Gaussian distribution is
part of the Exponential family of distributions, but there are a lot more of
them, with the same sort of ease of use, in many cases, and if the person doing
the machine learning has a solid grounding in statistics, they can be utilized
where appropriate.
Q44. Define Correlation.
A. Correlation is a statistical technique
that can show whether and how strongly pairs of variables are related.
For example: height and weight are related;
taller people tend to be heavier than shorter people. The relationship isn’t
perfect. People of the same height vary in weight, and you can easily think of
two people you know where the shorter one is heavier than the taller one.
Nonetheless, the average weight of people 5’5” is less than the average weight
of people 5’6”, and their average weight is less than that of people 5’7”,
etc.
Correlation can tell you just how much of
the variation in peoples’ weights is related to their heights.
Q45. What is primary goal of A/B Testing?
A. A/B
testing refers to a statistical hypothesis with two variables A and B. The
primary goal of A/B testing is the identification of any changes to the web
page for maximizing or increasing the outcome of interest. A/B testing is a
fantastic method for finding the most suitable online promotional and marketing
strategies for the business.
Q46. What is meaning of Statistical Power
of Sensitivity?
A. The
statistical power of sensitivity refers to the validation of the accuracy of a
classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is
basically Predicted True Events/Total Events.
Q47. Explain Over-fitting.
A. In
the case of over-fitting, the model is highly complex, like having too many
parameters which are relative to many observations. The overfit model has poor
predictive performance, and it overreacts to many minor fluctuations in the
training data.
Q48. Explain Under-fitting
A. In the case of under-fitting, the underlying trend of the data
cannot be captured by the statistical model or even the machine learning
algorithm. Even such a model has poor predictive performance.
Q49. What is Long Format Data?
A. In the long format,
every row makes a one-time point per subject. The data in the wide format can
be recognized by the fact that the columns are basically represented by the
groups.
Q50. What is Wide Format Data?
A. In the wide format, the repeated responses of the subject will fall in a single row, and each response will go in a separate column.