Statistics Guide for Data Science Interivews

Overview
Statistics is a core component of any data scientist's toolkit. Since many commercial layers of a data science pipeline are built from statistical foundations (for example, A/B testing), knowing foundational topics of statistics is essential. This post will serve as a basic guide for core topics in statistics, with some sample problems and solutions at the end.
Properties of Random Variables
For any given random variable X, the following properties hold true (below we assume X is continuous, but it also holds true for discrete random variables).
The expectation (average value, or mean) of a random variable is given by the integral of the value of X with its probability density function (PDF):
$$\mu = E[X] = \int_{-\infin}^{\infin} xf_X(x)dx$$
and the variance is given by:
$$Var(X) = E[(X-E[X])^2] = E[X^2] - (E[X])^2$$
The variance is always non-negative, and its square root is called the standard deviation, which is heavily used in statistics.
$$\sigma = \sqrt{Var(X)} = \sqrt{E[(X-E[X])^2]} = \sqrt{E[X^2] - (E[X])^2}$$
The conditional values of both the expectation and variance are as follows. For example, consider the case for the conditional expectation of X, given that Y = y:
$$E[X|Y=y] = \int_{-\infin}^{\infin} xf_{X|Y}(x|y)dx$$
For any given random variables X and Y, the covariance, a linear measure of relationship between the two variables, is defined by the following:
$$Cov(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]$$
and the normalization of covariance, represented by the Greek letter ρ, is the correlation between X and Y:
$$\rho(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}}$$
All of these properties are commonly tested in interviews, so it helps to be able to understand the mathematical details behind each and walk through an example for each.
Law of Large Numbers
The Law of Large Numbers (LLN) states that if you sample a random variable independently a large number of times, the measured average value should converge to the random variable's true expectation. Stated more formally,
$$\bar{X}_n = \frac{X_1+...+X_n}{n} \rightarrow \mu, \text { as } n \rightarrow \infin$$
This is important in studying the longer-term behavior of random variables over time. As an example, a coin might land on heads 5 times in a row, but over a much larger n we would expect the proportion of heads to be approximately half of the total flips. Similarly, a casino might experience a loss on any individual game, but over the long run should see a predictable profit over time.
Central Limit Theorem
The Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable.
Recall that the normal distribution takes on the form:
$$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\left(\frac{(x-\mu)^2}{2\sigma^2}\right)}$$
The CLT states that:
$$\bar{X}_n = \frac{X_1+...+X_n}{n} \rightarrow \sim N(\mu, \frac{\sigma^2}{n})$$
and hence
$$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$
The CLT provides the basis for much of hypothesis testing, which is discussed shortly. At a very basic level, you can consider the implications of this theorem on coin flipping: the probability of getting some number of heads flipped over a large n should be approximately that of a normal distribution. Whenever you're asked to reason about any particular distribution over a large sample size, you should remember to think of the CLT whether it is Binomial, Poisson, or any other distribution.
Hypothesis Testing
General setup
The process of testing whether or not a sample of data supports a particular hypotheses is called hypothesis testing. Generally, hypotheses concern particular properties of interest for a given population, such as its parameters, like μ (for example, the mean conversion rate among a set of users).
The steps in testing a hypothesis are as follows:
- State a null hypothesis and an alternative hypothesis. Either the null hypothesis will be rejected (in favor of the alternative hypothesis) or it will fail to be rejected (although failing to reject the null hypothesis does not necessarily mean it is true, but rather that there is not sufficient evidence to reject it).
- Use a particular test statistic of the null hypothesis to calculate the corresponding p-value.
- Compare the p-value to a certain significance level α.
Since the null hypothesis typically represents a baseline (e.g., the marketing campaign did not increase conversion rates, etc.), the goal is to reject it with statistical significance and hope that there is a significant outcome.
Hypothesis tests are either one-tailed or two-tailed tests. A one-tailed test has the following types of null and alternative hypothesis:
$$H_0: \mu= \mu_0 \text{ versus } H_1: \mu < \mu_0 \text{ or } H_1: \mu > \mu_0$$
whereas a two-tailed test has these types:
$$H_0: \mu = \mu_0 \text{ versus } H_1: \mu \ne \mu_0$$
where H_0 is the null hypothesis and H_1 is the alternative hypothesis.
Understanding hypothesis testing is the basis of A/B testing, a topic commonly covered in tech companies’ interviews. In A/B testing, various versions of a feature are shown to a sample of different users, and each variant is tested to determine if there was an uplift in core engagement metrics.
Test Statistics
A test statistic is a numerical summary designed for the purpose of determining whether the null hypothesis or the alternative hypothesis should be accepted as correct. More specifically, it assumes that the parameter of interest follows a particular sampling distribution under the null hypothesis.
For example, the number of heads in a series of coin flips may be distributed as a binomial distribution, but with a large enough sample size, the sampling distribution should be approximately normally distributed. Hence, the sampling distribution for the total number of heads in a large series of coin flips would be considered normally distributed.
Several variations in test statistics and their distributions are the following:
- Z-test: assumes the test statistic follows a normal distribution under the null hypothesis
- t-test: uses a student's t-distribution rather than a normal distribution
- Chi-squared: used to assess goodness of fit, and check whether two categorical variables are independent
The Z-Test
Generally the Z-test is used when the sample size is large (to invoke the CLT) or when the population variance is known, and a t-test is used when the sample size is small and when the population variance is unknown. The Z-test for a population mean is formulated as:
$$z = \frac{\bar{x}- \mu_0}{\sigma/\sqrt{n}} \sim N(0, 1)$$
in the case where the population variance is known.
The t-test
The t-test is structured similarly to the Z-test, but uses the sample variance s^2 in place of population variance. The t-test is parametrized by the degrees of freedom, which refers to the number of independent observations in a dataset, denoted below by n-1:
$$t = \frac{\bar{x}- \mu_0}{s/\sqrt{n}} \sim t_{n-1}$$
where
$$s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}$$
As stated earlier, the t-distribution is similar to the normal distribution in appearance but has larger tails (i.e., extreme events happen with greater frequency than the modeled distribution would predict), a common phenomenon, particularly in economics and earth sciences.
Hypothesis Testing for Population Proportions
Note that, due to the CLT, the Z-test can be applied to random variables of any distribution. For example, when estimating the sample proportion of a population having a characteristic of interest, we can view the members of the population as Bernoulli random variables with those having the characteristic represented by “1s” and those lacking it represented by “0s”. Viewing the sample proportion of interest as the sum of these Bernoulli random variables divided by the total population size, we can then compute the sample mean and variance of the overall proportion, about which we can form the following set of hypotheses:
$$H_0: \hat{p} = p_0 \text{ versus } H_1: \hat{p} \ne p_0$$
and the corresponding test statistic to conduct a Z-test would be:
$$z = \frac{\hat{p}- p_0}{\sqrt{p_0(1-p_0)/n}}$$
In practice, these test statistics form the core of A/B testing. When asked about A/B testing or related topics, you should always cite the relevant test statistic and the cause of its validity (usually the CLT).
p-values and confidence intervals
Both p-values and confidence intervals are commonly covered topics during interviews. Put simply, a p-value is the probability of observing the value of the calculated test statistic under the null hypothesis assumptions. Usually, the p-value is assessed relative to some pre-determined level of significance (0.05 is often chosen).
In conducting a hypothesis test, an alpha value, or measure of the acceptable probability of rejecting a true null hypothesis, is typically chosen prior to conducting the test. Then, a confidence interval can also be calculated to assess the test statistic. This is a range of values that, if a large sample were taken, would contain the parameter value of interest (1-$\alpha$)% of the time. For instance, with , a 95% confidence interval would contain the true value 95% of the time. If 0 is included in the confidence intervals, then we cannot reject the null hypothesis (and vice versa).
The general form for a confidence interval around the population mean looks like the following, where the Z term is the critical value (for the standard normal distribution):
$$\mu \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}} $$
Knowing how to explain p-values and confidence intervals, in technical and non-technical terms, is very useful during interviews, so be sure to practice these. If asked about the technical details, always remember to make sure you correctly identify the mean and variance at hand.
Type I and II errors
Two types errors are frequently assessed: Type I error, which is also known as a "false positive", and Type II error, also known as a "false negative". Specifically, Type I error is rejecting a null hypothesis when that hypothesis is correct, whereas Type II error is failing to reject a null hypothesis when its alternative hypothesis is correct.
The mentioned above is the value of Type I error, the probability of rejecting a true null hypothesis, and 1-α is then referred to as the confidence level. Type II error, or “a false negative, is denoted by , and 1-β is then referred to as the power of the statistical test. Plotting sample size versus power generally reveals that a larger sample size corresponds to a larger 1-β, or power. Sample size can be selected to achieve a desired level power. Generally tests are set up in such a way as to have both 1-α and 1-β be relatively high (say, at .95, and .80 respectively).
Generally most interview questions concerning Type I and II errors are qualitative in nature, for instance, requesting explanations of terms or of how you would go about assessing errors/power in an experimental setup.
MLE and MAP
Any probability distribution has parameters, and so fitting parameters is an extremely crucial part of data analysis. There are two general methods for doing so. In maximum likelihood estimation (MLE) the goal is to estimate the most likely parameters given a likelihood function:
$$\theta_{MLE} = \argmax_{\theta} L(\theta), \text{ where } L(\theta) = f_n(x_1...x_n|\theta)$$
Since the values of X are assumed to be i.i.d., then the likelihood function becomes the following:
$$L(\theta) = \prod_{i=1}^{n}f(x_i|\theta)$$
The natural log of L(θ) is then taken prior to calculating the maximum; since log is a monotonically increasing function, maximizing the log-likelihood log L(θ) is equivalent to maximizing the likelihood.
Another way of fitting parameters is through maximum a posteriori estimation (MAP), which assumes a prior distribution.
$$\theta_{MAP} = \argmax_{\theta} g(\theta)f(x_1...x_n|\theta)$$
where the similar log-likelihood is again employed, and g(θ) is a density function of θ.
Both MLE and MAP are especially relevant in statistics and machine learning, and knowing these is recommended, especially for more technical interviews. For instance, a common question in such interviews is to derive the MLE for a particular probability distribution. Thus, understanding the above steps, along with the details of the relevant probability distributions, is crucial.
Sample Interview Questions
Easy Problems:
- Say you flip a coin 10 times and observe only one heads. What would be your null hypothesis and p-value for testing whether the coin is fair or not?
- Describe what Type I and Type II errors are, and the tradeoffs between them.
- Explain the statistical background behind power.
- What is a Z-test and when would you use it versus a t-test?
- Say you are testing hundreds of hypotheses, each with t-test. What considerations would you take into account when doing this?
Medium Problems:
- A coin was flipped 1000 times, and 550 times it showed heads. Do you think the coin is biased? Why or why not?
- You are drawing from a normally distributed random variable X ~ N(0, 1) once a day. What is the approximate expected number of days until you get a value greater than 2?
- Say we have X ~ Uniform(0, 1) and Y ~ Uniform(0, 1) and the two are independent. What is the expected value of the minimum of X and Y?
- How many cards would you expect to draw from a standard deck before seeing the first ace?
- Say you draw n samples from a uniform distribution U(a, b). What are the MLE estimates of a and b?
Hard Problems:
- What are MLE and MAP? What is the difference between the two?
- Say you are given a random Bernoulli trial generator. How would you generate values from a standard normal distribution?
- Say we have a random variable X ~ D, where D is an arbitrary distribution. What is the distribution F(X) where F is the CDF of X?
- How do you uniformly sample points at random from a circle with radius R?
- Say you continually sample from some i.i.d. uniformly distributed (0, 1) random variables until the sum of the variables exceeds 1. How many samples do you expect to make?
Sample Selected Solutions
Easy:
The primary consideration is that, as the number of tests increases, the chance that a stand-alone p-value for any of the t-tests is statistically significant becomes very high due to chance alone. As an example, with 100 tests performed and a significance threshold of α = 0.05, you would expect 5 of the experiments to be statistically significant due only to chance. That is, you have a very high probability of observing at least one significant outcome. Therefore, the chance of incorrectly rejecting a null hypothesis (i.e., committing Type I error) increases.
To correct for this effect, we can use a method called the Bonferroni correction, wherein we set the significance threshold to α/m, where m is the number of tests being performed. In the above scenario having 100 tests, we can set the significance threshold to instead be 0.05/100 = 0.0005. While this correction helps to protect from Type I error, it is still prone to Type II error (i.e., failing to reject the null hypothesis when it should be rejected). In general, the Bonferroni correction is mostly useful when there is a smaller number of multiple comparisons of which a few are significant. If the number becomes sufficiently high that many tests yield statistically significant results, the number of Type II errors may also increase significantly.
Medium:
Let Z = min(X, Y). Then we know the following:
$$P(Z \le z) = P(\min(X, Y) \le z) = 1 - P(X > z, Y > z)$$
For a uniform distribution, the following is true for a value of z between 0 and 1:
$$P(X > z) = 1-z \space \text{ and } \space P(Y>z) = 1 - z$$
Since X and Y are i.i.d., this yields:
$$P(Z \le z) = 1 - P(X > z, Y > z) = 1 - (1-z)^2$$
Now we have the cumulative distribution function for z. We can get the probability density function by taking the derivative of the CDF to obtain the following:
$$f_Z(z)= 2(1-z)$$
Then, solving for the expected value by taking the integral yields the following:
$$E[Z] = \int_{0}^{1} zf_Z(z)dz = 2\int_{0}^{1} z(1-z)dz = 2\left(\frac{1}{2}-\frac{1}{3}\right) = \frac{1}{3}$$
Therefore, the expected value for the minimum of X and Y is 1/3.
Hard:
MLE stands for maximum likelihood estimation and MAP for maximum a posteriori. Both are ways of estimating variables in a probability distribution by producing a single estimate of that variable.
Assume that we have a likelihood function P(X|θ). Given n i.i.d. samples, the MLE is as follows:
$$MLE(\theta) = \max_\theta P(X|\theta) = \max_\theta \prod_{i}^{n} P(x_i|\theta)$$
Since the product of multiple numbers all valued between 0 and 1 might be very small, maximizing the log function of the product above is more convenient. This is an equivalent problem since the log function is monotonically increasing. Since the log of a product is equivalent to the sum of logs, the MLE becomes the following:
$$MLE_{log}(\theta) = \max_\theta \sum_{i=1}^{n} \log P(x_i|\theta)$$
Relying on Bayes rule, MAP uses the posterior P(θ|X) being proportional to the likelihood multiplied by a prior P(θ), i.e., P(X|θ)P(θ). The MAP for θ is thus the following:
$$MAP(\theta) = \max_\theta P(X|\theta)P(\theta) = \max_\theta \prod_{i}^{n} P(x_i|\theta)P(\theta)$$
Employing the same math as used in calculating the MLE, the MAP becomes:
$$MAP_{log}(\theta) = \max_\theta \sum_{i=1}^{n} \log P(x_i|\theta) + \log P(\theta)$$
Therefore, the only difference between the MLE and MAP is the inclusion of the prior in MAP; otherwise, the two are identical. Moreover, MLE can be seen as a special case of the MAP with a uniform prior.
Are you interviewing for data science jobs or are you trying to hone your data science skills? Check out our newsletter, Data Science Prep, to get data science interview questions straight to your inbox.