Business Data Science - Experiments // TheStatsDude's blog

As a statistician whose PhD work was concentrated on computational statistics, my expertise and mindset has gravitated towards how we can develope novel algorithms to enable faster estimations for statistical models over the years. Though I very much enjoyed reading books and papers about topics like variance reduction techniques for Monte Carlo simulations, Markov Chain Monte Carlo, approximate Bayesian computation, I always believe the prominence of my research builds upon the assumption that those statistical models are useful to the broad scientific community to be worth being computed faster.

Now working as a statistician in industry, I still read computational statistics work for fun at times but my mission has switched to how I can effectively extract information from data to better quantify the relationships between variables and to drive better business decisions. Therefore, the ability to use data to understand if certain product changes can better user experience, higher revenues, etc becomes critical. Throughout my statistics education from college to grad school, I’ve got exposed to a good portfolio of statistcal methods, including but not limited to two-sample tests, analysis of variance, linear models, generalized linear models and time series modeling, but a more systematic knowledge of these methods and even a philosophical understanding of causal inference becomes necessary. To this end, I plan to read the following books in the years of 2022

Business Data Science - by Matt Taddy
Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction - by Donald Rubin and Guido Imbens
Data Analysis Using Regression and Multilevel/Hierarchical Models - by Andrew Gelman and Jennifer Hill

As a starting point, I’d like to summarize my learning from reading through the Chapter 5 (Experiments) of the book Business Data Science during my Christmas holiday.

Why experiments?

Although correlation might be all that we need in some scenarios (for example, we might recommend movie to those who watched movie without recognizing the plausible underlying fact that these preferences can be caused by a third unseen signal, which is they all have very young kids who like cartoon musicals), and for many real-world business problems it is critical for us to know if/how an product change or price move alone can affect the outcome of interest (e.g., customer behavior, revenue, etc.). In scientific research, experiments are used for measuring the effect of an action is an experiment as we are able to manipulate the treatment variable independently from all other upstream influences to assess its impact on the outcome of interest. In the sections below, I shall first cover the gold-standard for establish causal statements, randomized controlled trials, followed by several near experimental methods, including diff-in-diff analysis, regression discontinuity and instrumental variables. In particular, I plan to leave the summary of instrumental variable method in a separate blog due to its intricacy.

Randomized Controlled Trials (RCTs)

The RCTs are arguably the gold standard among various possible forms of experiments as we can assign the treatment to subjects from the population of interest in a completely random manner to eliminate any dependence between the treatment variable and all other upstream influences.

Estimation and Inference of average treatment effect

Under RCT, it is easy to estimate the (ATE) as the mean difference between the groups (indicated by the treatment indicator) under comparison \(ATE = \mathbb{E}(y|d=1) - \mathbb{E}(y|d=0)\) by

\[ \widehat{ATE} = \bar{y}_{1} - \bar{y}_{0} \]

where \(\bar{y}_{k} = \sum_{i: d_{i} = k} y_{i}\), \(k = 0, 1\), corresponds the sample means for two groups.

Under the assumptions of experimental units being independent from each other, we can estimate the variance of our estimator as

\[ Var(\bar{y}_{1} - \bar{y}_{0}) = Var(\bar{y}_{1}) + Var(\bar{y}_{0}) = \frac{1}{n_{0}} \frac{\sum_{i:d_{i}=0}(y_{i}-\bar{y}_{0})^2}{n_{0}-1} + \frac{1}{n_{1}} \frac{\sum_{i:d_{i}=1}(y_{i}-\bar{y}_{1})^2}{n_{1}-1} \] and therefore construct the \((1-\alpha)\%\) confidence interval as

\[ (\bar{y}_{1} - \bar{y}_{0}) \pm z_{\alpha} * \sqrt{(Var(\bar{y}_{1} - \bar{y}_{0}))} \]

where \(z_{\alpha}\) denotes the upper-\(\alpha\) quantile of a standard normal distribution.

Practical knowledge

The RCTs may appear to be easy conceptually, but one must be very aware of many practical considerations for the successful practice. Although we have to acknowledge perfect experiments are rare, this does not imply that we will just stay there and do nothing.

Checking randomization

Being able to randomly assign the treatment to individuals serves as the fundamental building block for RCTs, but it is difficult to make sure that treatment is truly randomly assigned to subjects in practice. A common practice for online platforms is to first run ``AA’’ tests (both groups receive the same treatment) and then conduct further investigations if the result turns out to be statistically significant.

In some cases, the root cause of imperfect randomization might be the selection bias along known characteristics - it is important to include those characteristics as covariates in the analysis in order to get an unbiased estimate of ATE.

Dependence between subjects

The dependence between subjects mainly affect the uncertainty estimates and three commonly used techniques include

Aggregate the data to enable treatment effect estimation at higher level under which the assumption of independent errors holds.
Blocked bootstrap to resample at higher level under which the assumption of independent errors holds.
Huber-White heteroskedastic consistent variance estimator.

Unrepresentative samples

Weighting is typically used to map from the sample of individuals in the experiment to the future treatment population if we know certain subpopulations are harder to be sampled.

Improving sensitivity

It is always desirable to achieve trustworthy results with less data to accelerate innovations, and below are two possible directions to improve sensitivity in experiments

Designing more sensitive metric (e.g., https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/beyond-power-analysis-metric-sensitivity-in-a-b-tests/)
Variance reduction for the mean estimator (e.g., CUPED technique)

Diff-in-diff analysis

The diff-in-diff analysis applies when you have two groups whose pretreatment differences can be isolated and modeled so that post-treatment differences are the basis for causal estimates of the treatment effect. The key assumptions here include

Treatment independence
After taking into account the pretreatment difference, the remaining gap is not because of external shocks.

A famous class of questions that can be solved by diff-in-diff analysis is the optimization of search engine marketing - it is critical for search engines / advertisers to understand the effect of paid search advertising. However, this is not an easy problem for the following reasons -

Difficulty in isolating the effect of paid search. Typically, the results from large companies show up in the organic results anyway. How do we know if the user would have clicked the organic results in the absence of paid search?
Can’t randomize the treatment status - Google will only display the additional ads link only if Google believes the possibility of being clicked is higher. Therefore, a simple comparison on the pages with ads and without ads is problematic due to the unequal baseline.

To tackle this problem, Blake et al. managed to convince the powers at eBay to run a large-scale experiment where SEM was turned off for a portion of users - eBay decided to stop biddihng on any AdWords for 65 of 210 ``designated market areas’’ (DMA) in the US for about eight weeks in 2012. The model they use to estimate the effect is linear model as follows

\[ \mathbb{E}(y_{it}) = \alpha + \beta_{d} d_{i} + \beta_{t} t + \gamma d_{i} t \] and

\[ \mathbb{E}(y_{it}) = \alpha_{i} + \beta_{d} d_{i} + \beta_{t} t + \gamma d_{i} t \] where \(i\) is the index of DMA, \(d_{i}=1\) indicates the corresponding DMA is on the treatment group, and \(t = 0, 1\) denotes the pre and post treatment era, respectively. Note \(\gamma\) here represents the treatment effect of interest. It is worth noting that the difference between the above two methods is that the latter takes the baseline difference across DMAs into account and hence makes the assumption of independent errors more plausible.

Regression discontinuity (RD)

Regression discontinuity (RD) takes advantage of another common near-experimental design where the treatment assignment is determined by a threshold on some ``forcing variable’’, and subjects that are close to the threshold, on either side, are comparable for causal estimation purposes. The nice thing about a RD is that we know the only confounding variable that we need to control for is the the forcing variable.

One famous class of problems that can be analyzed using RD is digital marketing under which the ad display depends upon whether the highest ad auction ``score’’ passes a reserve threshold. In strict RD, the treatment is determined entirely by the forcing variable so we just need to control for that variable. The key assumption underlying the RDS analysis is the continuity assumption, that is, if the threshold were to move slightly, subjects switching treatment groups will behave similarly to those near them in their treatment group. In other words, we can look at the relationship between the forcing variable and the response on one side of the threshold and extrapolate to the other side.

Using the potential outcomes notation, let \(y_{i}(d)\) denotes the potential outcome for the \(i\)-th customer under treatment \(d\). Given the actual realized treatment group \(d_{i}\), we have \(y_{i}(d_{i})\) as its observed response. Consider the forcing variable \(r_{i}\), the location of which relative to the treatment threshold determines the treatment status. Limiting the question here to binary treatments, the treatment allocation with forcing variable \(r_{i}\) is then

\[ d_{i} = I(r_{i} > 0) \]

By construction, we have treatment allocation is independent of the response given \(r_{i}\)

\[ [y_{i}(0), y_{i}(1)] \perp d_{i} | r_{i} \] Therefore, the important continuity assumption can be written as follows

\[ \mathbb{E}[y_{i}(d) | r = -\epsilon] \approx \mathbb{E}[y_{i}(d) | r = 0] \approx \mathbb{E}[y_{i}(d) | r = \epsilon] \] where \(\epsilon\) is a small number that specifies the range within which the continuity assumption holds.

Given a fixed \(\epsilon\), we filter out those data with \(r_{i}\) falling outside \([-\epsilon, \epsilon]\), and then fit an ordinary least squares line to the remaining data under the following model specification

\[ \mathbb{E}(y_{i}) = \alpha + \gamma I(r_{i} > 0) + r_{i} (\beta_{0} + \beta_{1} I(r_{i} > 0)) \] Note that the treatment effect is parameterized by \(\gamma\) in the model above, and we are only able to learn about the treatment effect at the threshold. As any RD analysis is sensitive to the size of the window size \(\epsilon\), it is advisable to try a range of window sizes and assess the stability of the results within a range of plausible values.