Q1. What is a Sample?
A. A data sample is a
set of data collected and the world selected from
a statistical population by a defined procedure. The elements of
a sample are known
as sample points, sampling units or observations.
Q2. Define Population.
A. In statistics, population refers
to the total set of observations that can be made. For example, if we are
studying the weight of adult women, the population is the set of
weights of all the women in the world
Q3. What is a Data Point?
A. In statistics, a data point (or
observation) is a set of one or more measurements on a single member of a
statistical population.
Q4. Explain Data Sets.
A. Data sets usually
come from actual observations obtained by sampling a statistical population,
and each row corresponds to the observations on one element of that
population. Data sets may further be generated by algorithms for the
purpose of testing certain kinds of software.
Q5. What is meant by the term Inferential Statistics?
A. Inferential statistics use a random sample of
data taken from a population to describe and make inferences about the
population. Inferential statistics are valuable when examination of each member
of an entire population is not convenient or possible.
Q6. Give an
example of Inferential Statistics
A. You asked five of your
classmates about their height. On the basis of this information, you stated
that the average height of all students in your university or college is 67
inches.
Q7. What is Descriptive Statistics?
A. Descriptive statistics are
brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire or a sample of a population. Descriptive
statistics are broken down into measures of central tendency and measures of
variability (spread).
Q8. What is the range of data?
A1. It tells us how much the data is spread across in a set.
In other words, it is defined as the difference between the highest and the
lowest value present in the set.
X=[2 3 4 4 3 7 9]
Range(x)%return (9-2)=7
Q9. Define Measurement.
A. Data can be classified as being on one of four scales:
- nominal
- ordinal
- interval
- ratio
Each level of measurement has some important properties that are useful
to know. For example, only the ratio scale has meaningful zeros.
Q10. What is a Nominal Scale?
A. Nominal variables (also called categorical variables) can be placed into
categories. They don’t have a numeric value and so cannot be added,
subtracted, divided or multiplied. They also have no order; if they appear to
have an order then you probably have ordinal variables instead
Q11. What is an Ordinal Scale?
A. The ordinal scale contains things that you can place in order. For
example, hottest to coldest, lightest to heaviest, richest to poorest.
Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you
have data that’s on an ordinal scale.
Q12. What is an Interval Scale?
A. An interval scale has ordered numbers with
meaningful divisions. Temperature is on the interval scale: a difference of 10
degrees between 90 and 100 means the same as 10 degrees between 150 and 160.
Compare that to high school ranking (which is ordinal), where the difference
between 1st and 2nd might be .01 and between 10th and 11th .5. If you have
meaningful divisions, you have something on the interval scale.
Q13. Explain Ratio Scale.
A. The ratio scale is exactly the same as the
interval scale with one major difference: zero is meaningful. For example, a
height of zero is meaningful (it means you don’t exist). Compare that to a
temperature of zero, which while it exists, it doesn’t mean anything in
particular.
Q14. What do you mean by Bayesian?
A. Bayesians condition on the data observed and considered
the probability distribution on the hypothesis. Bayesian
statistics provides us with mathematical tools to rationally update our
subjective beliefs in light of new data or evidence.
Q15. What is Frequentist?
A. Frequentists condition on a hypothesis of choice and
consider the probability distribution on the data, whether observed or not.
Frequentist statistics uses rigid frameworks, the type of frameworks that you
learn in basic statistics, like:
Q16. What is
P-Value??
A. In statistical significance testing, it is the
probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true.
Q17. What is a Confidence Interval?
A. A confidence interval,
in statistics, refers to the probability that a population parameter will
fall between two set values for a certain proportion of times. Confidence
intervals measure the degree of uncertainty or certainty in a sampling
method.
Q18. Explain Hypothesis Testing.
A. Hypothesis testing is an act
in statistics whereby an analyst tests an
assumption regarding a population parameter. The methodology employed by the
analyst depends on the nature of the data used and the reason for the analysis.
Hypothesis testing is used to infer the result of a hypothesis performed on
sample data from a larger population.
Q19. What is likelihood?
A. The probability of some observed outcomes given a set of
parameter values is regarded as the likelihood of the set of parameter values
given the observed outcomes.
Q20. What is sampling?
A. Sampling is that part of statistical practice concerned
with the selection of an unbiased or random subset of individual observations
within a population of individuals intended to yield some knowledge about the
population of concern.
Q21. What are
Sampling Methods?
A. There are 4 sampling methods:
- Simple Random
- Systematic
- Cluster
- Stratified
Q22. What is Mode?
A. The mode of a data sample is the element that occurs the
most number of times in the data collection.
X=[1 2 4 4 4 4
5 5]
Mode(x)% return
3
Q23. What is Median?
A. It is describes as the numeric value that separates the
lower half of sample of a probability from the upper half. It can b easily
calculated by arranging all the samples from highest to lowest (or vice-versa)
and picking the middle one.
X=[2 4 1 3 4 4
3]
X=[1 2 3 3 4 4
4]
Median(x)%
return 3
Q24. What is meant by Quartile?
A. It is a type of quantile that divides the data points
into four or less equal parts(quarters). Each
quartile contains 25% of the total observations. Generally, the data is
arranged from smallest to largest.
Q25. What is Moment?
A. It is the quantitative measure of the shape of a set of
points. It comprises of a set of statistical parameters to measure a
distribution. Four moments are commonly used:
- Mean
- Skewness
- Variance
- Kurtosis
Q26. What is the Mean of data?
A. The statistical
mean refers to the mean or average that is used to derive the
central tendency of the data in question. It is determined by adding all the
data points in a population and then dividing the total by the number of
points.
X=[1 2 3 3 6]
Sum=1+2+3+3+6=15
Mean(x)%return (sum/5)=3
Q27. Define Skewness.
A. Skewness is a measure of the asymmetric of the data
around the sample mean. It it is negative, the data are spread out more to the
left side of the mean than to the right. The vice-versa also stands true.
Q28. What is Variance?
A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to
the mean, and to each other. A high variance indicates that the data points are
very spread out from the mean, and from one another. Variance is the average of the squared distances from
each point to the mean.
Q29. Define Standard Deviation.
A. In statistics, the standard deviation is a
measure of the amount of variation or dispersion of a set of values. A low
standard deviation indicates that the values tend to be close to the mean of
the set, while a high standard deviation indicates that the values are spread
out over a wider range.
Q30. What is Kurtosis?
A. Kurtosis is a measure of how outlier-prone a distribution
is. In other words, kurtosis identifies
whether the tails of a given distribution contain extreme values.
Q31. What is meant by Covariance?
A. Covariance measures the
directional relationship between the returns on two assets. A positive covariance means that asset returns move
together while a negative covariance means they move inversely. Covariance
is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying
the correlation between the two variables by the standard deviation of each variable.
It gives the measure of how much two variable change together.
Q32. What is Alternative Hypothesis?
A. The Alternative hypothesis
(denoted by H1 ) is the statement that must be true if the null hypothesis is
false.
Q33. Explain Significance Level.
A. The probability of rejecting
the null hypothesis when it is called the significance level α , and very
common choices are α = 0.05 and α = 0.01.
Q34. Do you know what is Binary search?
A. For binary search, the array
should be arranged in ascending or descending order. In each step, the
algorithm compares the search key value with the key value of the middle
element of the array. If the keys match, then a matching element has been found
and its index, or position, is returned. Otherwise, if the search key is less
than the middle element’s key, then the algorithm repeats its action on the
sub-array to the left of the middle element or, if the search key is greater,
on the sub-array to the right.
Q35. Explain Hash Table.
A. A hash table is a data structure used to implement an
associative array, a structure that can map keys to values. A hash table uses a
hash function to compute an index into an array of buckets or slots, from which
the correct value can be found.
Q36.
What is Null Hypothesis?
A. The null hypothesis
(denote by H0 ) is a statement about the value of a population parameter
(such as mean), and it must contain the condition of equality and must be
written with the symbol =, ≤, or ≤.
Q37.
When You Are
Creating A Statistical Model How Do You Prevent Over-fitting?
A. It can
be prevented by cross-validation
Q38. What do you mean by Cross-vlidation?
A. Cross-validation, it’s a
model validation techniques for assessing how the results of
a statistical analysis (model) will generalize to an independent data
set. It is mainly used in settings where the goal is prediction, and one wants
to estimate how accurately a predictive model will perform in practice
Q39. What is Linear regression?
A. A linear regression is a good tool for quick predictive analysis: for
example, the price of a house depends on a myriad of factors, such as its size
or its location. In order to see the relationship between these variables, we
need to build a linear regression, which predicts the line of best fit between
them and can help conclude whether or not these two factors have a positive or
negative relationship.
Q40. What are
the assumptions required for linear regression?
A. There are four major assumptions:
- There is a linear
relationship between the dependent variables and the regressors, meaning the
model you are creating actually fits the data
- The errors or residuals of the data are
normally distributed and independent from each other
- There is minimal multi-co linearity between
explanatory variables
- Homoscedasticity. This means the variance
around the regression line is the same for all values of the predictor
variable.
Q41. What is Multiple Regression?
A. Multiple regression generally explains the
relationship between multiple independent or predictor variables and one
dependent or criterion variable. A dependent variable is modeled as a
function of several independent variables with corresponding coefficients,
along with the constant term. Multiple regression requires two or more
predictor variables, and this is why it is called multiple regression.
Q42. What is a Statistical Interaction?
A. Basically, an interaction is when the effect of one factor (input variable)
on the dependent variable (output variable) differs among levels of another
factor.
Q43. What is an example of a data set with a
non-Gaussian distribution?
A.The Gaussian distribution is
part of the Exponential family of distributions, but there are a lot more of
them, with the same sort of ease of use, in many cases, and if the person doing
the machine learning has a solid grounding in statistics, they can be utilized
where appropriate.
Q44. Define Correlation.
A. Correlation is a statistical technique
that can show whether and how strongly pairs of variables are related.
For example: height and weight are related;
taller people tend to be heavier than shorter people. The relationship isn’t
perfect. People of the same height vary in weight, and you can easily think of
two people you know where the shorter one is heavier than the taller one.
Nonetheless, the average weight of people 5’5” is less than the average weight
of people 5’6”, and their average weight is less than that of people 5’7”,
etc.
Correlation can tell you just how much of
the variation in peoples’ weights is related to their heights.
Q45. What is primary goal of A/B Testing?
A. A/B
testing refers to a statistical hypothesis with two variables A and B. The
primary goal of A/B testing is the identification of any changes to the web
page for maximizing or increasing the outcome of interest. A/B testing is a
fantastic method for finding the most suitable online promotional and marketing
strategies for the business.
Q46. What is meaning of Statistical Power
of Sensitivity?
A. The
statistical power of sensitivity refers to the validation of the accuracy of a
classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is
basically Predicted True Events/Total Events.
Q47. Explain Over-fitting.
A. In
the case of over-fitting, the model is highly complex, like having too many
parameters which are relative to many observations. The overfit model has poor
predictive performance, and it overreacts to many minor fluctuations in the
training data.
Q48. Explain Under-fitting
A. In the case of under-fitting, the underlying trend of the data
cannot be captured by the statistical model or even the machine learning
algorithm. Even such a model has poor predictive performance.
Q49. What is Long Format Data?
A. In the long format,
every row makes a one-time point per subject. The data in the wide format can
be recognized by the fact that the columns are basically represented by the
groups.
Q50. What is Wide Format Data?
A. In
the wide format, the repeated responses of the subject will fall in a single
row, and each response will go in a separate column.