Statistics Interview Questions

Q1. What is a Sample?

A. A data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

Q2. Define Population.

A. In statistics, population refers to the total set of observations that can be made. For example, if we are studying the weight of adult women, the population is the set of weights of all the women in the world

Q3. What is a Data Point?

A. In statistics, a data point (or observation) is a set of one or more measurements on a single member of a statistical population.

Q4. Explain Data Sets.

A. Data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.

Q5. What is meant by the term Inferential Statistics?

A. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

Q6. Give an example of Inferential Statistics

  A. You asked five of your classmates about their height. On the basis of this information, you stated that the average height of all students in your university or college is 67 inches.

Q7. What is Descriptive Statistics?

A. Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread).

Q8. What is the range of data?

A1. It tells us how much the data is spread across in a set. In other words, it is defined as the difference between the highest and the lowest value present in the set.

X=[2 3 4 4 3 7 9]

Range(x)%return (9-2)=7

Q9. Define Measurement.

A. Data can be classified as being on one of four scales: 

  • nominal
  • ordinal
  • interval
  • ratio

Each level of measurement has some important properties that are useful to know. For example, only the ratio scale has meaningful zeros.

Q10. What is a Nominal Scale?

A. Nominal variables (also called categorical variables) can be placed into categories. They don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to have an order then you probably have ordinal variables instead

Q11. What is an Ordinal Scale?

A. The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.

Q12. What is an Interval Scale?

A. An interval scale has ordered numbers with meaningful divisions. Temperature is on the interval scale: a difference of 10 degrees between 90 and 100 means the same as 10 degrees between 150 and 160. Compare that to high school ranking (which is ordinal), where the difference between 1st and 2nd might be .01 and between 10th and 11th .5. If you have meaningful divisions, you have something on the interval scale.

Q13. Explain Ratio Scale.

A. The ratio scale is exactly the same as the interval scale with one major difference: zero is meaningful. For example, a height of zero is meaningful (it means you don’t exist). Compare that to a temperature of zero, which while it exists, it doesn’t mean anything in particular.

Q14. What do you mean by Bayesian?

A. Bayesians condition on the data observed and considered the probability distribution on the hypothesis. Bayesian statistics provides us with mathematical tools to rationally update our subjective beliefs in light of new data or evidence.

Q15. What is Frequentist?

A. Frequentists condition on a hypothesis of choice and consider the probability distribution on the data, whether observed or not. Frequentist statistics uses rigid frameworks, the type of frameworks that you learn in basic statistics, like:

Q16. What is P-Value??

A. In statistical significance testing, it is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

Q17. What is a Confidence Interval?

A. A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method.

Q18. Explain Hypothesis Testing.

A. Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to infer the result of a hypothesis performed on sample data from a larger population.

Q19. What is likelihood?

A. The probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the observed outcomes.

Q20. What is sampling?

A. Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern.

Q21. What are Sampling Methods?

A. There are 4 sampling methods:

  • Simple Random
  • Systematic
  • Cluster
  • Stratified

Q22. What is Mode?

A. The mode of a data sample is the element that occurs the most number of times in the data collection.

       X=[1 2 4 4 4 4 5 5]

       Mode(x)% return 3

Q23. What is Median?

A. It is describes as the numeric value that separates the lower half of sample of a probability from the upper half. It can b easily calculated by arranging all the samples from highest to lowest (or vice-versa) and picking the middle one.

      X=[2 4 1 3 4 4 3]

      X=[1 2 3 3 4 4 4]

      Median(x)% return 3

Q24. What is meant by Quartile?

A. It is a type of quantile that divides the data points into four or less equal parts(quarters). Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest.

Q25. What is Moment?

A. It is the quantitative measure of the shape of a set of points. It comprises of a set of statistical parameters to measure a distribution. Four moments are commonly used:

  • Mean
  • Skewness
  • Variance
  • Kurtosis

Q26. What is the Mean of data?

A. The statistical mean refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points.

X=[1 2 3 3  6]

Sum=1+2+3+3+6=15

Mean(x)%return (sum/5)=3

Q27. Define Skewness.

A. Skewness is a measure of the asymmetric of the data around the sample mean. It it is negative, the data are spread out more to the left side of the mean than to the right. The vice-versa also stands true.

Q28. What is Variance?

A. It describes how far the value lies from the Mean. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. Variance is the average of the squared distances from each point to the mean.

Q29. Define Standard Deviation.

A. In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Q30. What is Kurtosis?

A. Kurtosis is a measure of how outlier-prone a distribution is. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

Q31. What is meant by Covariance?

A. Covariance measures the directional relationship between the returns on two assets. A positive covariance means that asset returns move together while a negative covariance means they move inversely. Covariance is calculated by analyzing at-return surprises (standard deviations from the expected return) or by multiplying the correlation between the two variables by the standard deviation of each variable. It gives the measure of how much two variable change together.

Q32. What is Alternative Hypothesis?

A. The Alternative hypothesis (denoted by H1 ) is the statement that must be true if the null hypothesis is false.

Q33. Explain Significance Level.

A. The probability of rejecting the null hypothesis when it is called the significance level α , and very common choices are α = 0.05 and α = 0.01.

Q34. Do you know what is Binary search?

A. For binary search, the array should be arranged in ascending or descending order. In each step, the algorithm compares the search key value with the key value of the middle element of the array. If the keys match, then a matching element has been found and its index, or position, is returned. Otherwise, if the search key is less than the middle element’s key, then the algorithm repeats its action on the sub-array to the left of the middle element or, if the search key is greater, on the sub-array to the right.

Q35. Explain Hash Table.

A. A hash table is a data structure used to implement an associative array, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.

Q36. What is Null Hypothesis?

A. The null hypothesis (denote by H0 ) is a statement about the value of  a population parameter (such as mean), and it must contain the condition of equality and must be written with the symbol =, ≤, or ≤.

Q37. When You Are Creating A Statistical Model How Do You Prevent Over-fitting?

A.  It can be prevented by cross-validation

Q38. What do you mean by Cross-vlidation?

A. Cross-validation, it’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice

Q39. What is Linear regression?

A. A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship.

Q40. What are the assumptions required for linear regression?

A. There are four major assumptions:

  1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
  2.  The errors or residuals of the data are normally distributed and independent from each other
  3.  There is minimal multi-co linearity between explanatory variables
  4.  Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

Q41. What is Multiple Regression?

A. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.  A dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term.  Multiple regression requires two or more predictor variables, and this is why it is called multiple regression.

Q42. What is a Statistical Interaction?

A. Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

Q43. What is an example of a data set with a non-Gaussian distribution?

A.The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.

Q44. Define Correlation.

A. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

For example: height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5’5” is less than the average weight of people 5’6”, and their average weight is less than that of people 5’7”, etc.

Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

Q45. What is primary goal of A/B Testing?

A. A/B testing refers to a statistical hypothesis with two variables A and B. The primary goal of A/B testing is the identification of any changes to the web page for maximizing or increasing the outcome of interest. A/B testing is a fantastic method for finding the most suitable online promotional and marketing strategies for the business.

Q46. What is meaning of Statistical Power of Sensitivity?

A. The statistical power of sensitivity refers to the validation of the accuracy of a classifier, which can be Logistic, SVM, Random Forest, etc. Sensitivity is basically Predicted True Events/Total Events.

Q47. Explain Over-fitting.

A. In the case of over-fitting, the model is highly complex, like having too many parameters which are relative to many observations. The overfit model has poor predictive performance, and it overreacts to many minor fluctuations in the training data.

Q48. Explain Under-fitting

A. In the case of under-fitting, the underlying trend of the data cannot be captured by the statistical model or even the machine learning algorithm. Even such a model has poor predictive performance.

Q49. What is Long Format Data?

A. In the long format, every row makes a one-time point per subject. The data in the wide format can be recognized by the fact that the columns are basically represented by the groups.

Q50. What is Wide Format Data?

A. In the wide format, the repeated responses of the subject will fall in a single row, and each response will go in a separate column.

Author: TheDataMonk

I am the Co-Founder of The Data Monk. I have a total of 6+ years of analytics experience 3+ years at Mu Sigma 2 years at OYO 1 year and counting at The Data Monk I am an active trader and a logically sarcastic idiot :)

One thought on “Statistics Interview Questions”

Comments are closed.