10  Frameworks, ‘models’, inference, tests

  • Willem’s blog and vignettes

  • Reinstein notes and beyond

    • This should not include extensive discussion of causality (which is later), only an introduction to regression as a ‘fitted descriptive line/plane’]
  • Lakens on statistical inference

  • McElreath

10.1 Types of ‘statistical models’ goals

As discussed in our posts and linked material, we broadly imagine three categories of modeling: (See discussion in the fold/footnote.)

  1. Descriptive modeling: Essentially, offering a “dimension reduction” of the data, presenting ‘which features relate to donation behavior’?
  2. Predictive modeling: Training a model to produce the best out-of-sample (or out-of-time) fit for current (or future) donations based on individual characteristics. This model should not include ‘leaks’, i.e., it should excluding individual characteristics and outcomes that occur after the time we would be ‘doing the prediction’.
  3. Causal: Consider identification of ‘causal paths’, i.e., ‘what actually determines amounts donated’.1

10.2 Discussion: statistics, probability and inference

Frequentist, Bayesian, ‘randomization inference’, ‘likelihood-ist’

DR: I am not saying this should be a major focus. We probably don’t want to get too deep here. However, if we do end up discussing these issues, I propose we put it or link it here.

10.3 Hypothesis testing, statistical comparisons and inferences

10.3.1 ‘Common statistical tests are linear models’

Many of the ‘univariate’ tests presented below can be extended to multiple-variable models (e.g., regression coefficients).

Further discussion, examples, and tables comparing the statistics by Oska Fentem in his Notion here.

10.4 Randomization and permutation-based tests and inference

2

Basic description: (still looking for the best source for this)

  • Discussion of the difference between randomization inference and bootstrapping here

Bootstrapped p-values are about uncertainty over the specific sample of the population you drew, while randomization inference p-values are about uncertainty over which units within your sample are assigned to the treatment.

The infer package vignette gives a good walk-through; this package is useful for doing these tests (some also recommend the coin package).3


We use this, and give some explanation, in the 2021 EAS donations post - see bookdown (see folds within)

We use permutation tests for testing whether the median (and mean) of planned donations exceeded/fell short of the mean for actual donations, but using the data from different years’ surveys (without connected individuals). …

The code we use for these tests is permalinked here


Why use these techniques?

  • Tractable for testing differences in medians

  • Fairly easy to explain, fairly intuitive

  • Do not depend on strong assumptions about underlying distributions or ‘large sample asymptotics’

  • Some statisticians and Econometricians (e.g., Athey and Imbens) argue for their value and robustness; it also seems close to what a lot of ‘data science’ people do (they love simulation)

DR concerns about ‘population vs super-population’, possibly misinformed:

… we nearly always want to make inferences about the population that the treatment and control groups are taken from (even thinking about a hypothetical super-population), not about the impact on the sampled groups themselves. So, with this in mind, when would I still want to use randomization inference.


10.5 Particular ‘experimetrics’ issues

Should we include controls (covariates) in analyzing ‘treatment effects’ from randomized experiments?

“in the conventional sampling paradigm… Controlling for observable heterogeneity using a regression model” is required for the assumptions to be justified with this approach. With the randomisation approach it makes more sense to put data into strata by covariates, analyse within-group experiments and average results.” - (?) Athey and Imbens

:::

11 Equivalence tests and Bayesian approaches

11.1 Equivalence tests and simple frequentist approaches

From Wikipedia (accessed 12 Apr 2022)

In equivalence tests, the null hypothesis is defined as an effect large enough to be deemed interesting, specified by an equivalence bound. The alternative hypothesis is any effect that is less extreme than said equivalence bound. The observed data is statistically compared against the equivalence bounds. If the statistical test indicates the observed data is surprising, assuming that true effects are at least as extreme as the equivalence bounds, a Neyman-Pearson approach to statistical inferences can be used to reject effect sizes larger than the equivalence bounds with a pre-specified Type 1 error rate.

Several tests exist for equivalence analyses; however, more recently the Two-one-sided t-tests (TOST) procedure has been garnering considerable attention. As outlined below, this approach is an adaptation of the widely known t-test.

“A very simple equivalence testing approach is the ‘two-one-sided t-tests’ (TOST) procedure. [Schuirmann, 1987] In the TOST procedure an upper (\(\Delta U\)) and lower (-\(\Delta L\)) equivalence bound is specified based on the smallest effect size of interest (e.g., a positive or negative difference of \(d = 0.3\)). Two composite null hypotheses are tested: \(H_{01}: \Delta \leq –\Delta L\) and \(H_{02}: \Delta \geq \Delta U\) When both these one-sided tests can be statistically rejected, we can conclude that \(–\Delta L < \Delta < \Delta U\), or that the observed effect falls within the equivalence bounds … .[Searman et al, 1998]” [Lakens 2017]

Lakens offers pointers, and TOSTER package, for this approach.

Bayes factors

Semi-aside: Unresolved discussion of ‘Bayes factors’ (Nik and Reinstein)

Clear DataColada argument against the use of ‘Bayes Factors’ for ‘accepting the null’

11.2 Some simple Bayesian testing examples

We return to consider a very simple and common case of interest:

  • We have a population divided into two treatment conditions (e.g., ‘Impact’ vs ‘Emotion’ language in an email donation appeal)

  • We wish to focus on a binary outcome (e.g., ‘whether clicked to donate’), which may be rare in each treatment.

    • We want to understand the impact of the treatment condition on the binary outcome, putting probability bounds on our belief about this


This is a classic example for Bayesian inference. Some good discussions and vignettes:

Discussions and walk-throughs


Ready packages

  • bayesAB is easy to apply, allows flexible specification of priors (and has tools to help you visualize these), and generates very clear output and graphs. See vignette
    • It (seems to) save the simulation data for you to summarize as you like
    • Limitations: Requires raw data for each treatment (not just summarized data), cannot address more than two groups, does not (?) share the ‘raw code doing the work’ for tweaking
  • BayesianFirstAid::bayes.prop.test is also easy to use (see vignette), and it works with either vectors or counts.
    • It uses a uniform prior, but you can get it to spit out the rjags code and then adjust this
    • … or adjust to multiple groups
    • The plots it give you by default are pretty messy, but it also preserves the simulation data, so you could make your own plots

11.3 A single framework? “Significance and equivalence testing” with randomization inference/simulation; building to Bayes

This is an outline/discussion. I (DR) try to implement and demonstrate this in eamt_data_analysis/oftw_giving_season_upsell_2021_trial/oftw_upsell_input_first_analysis.Rmd

Note: there may be good ways to hijack all sorts of existing tools, such as the bayesAB package or BayesianFirstAid

Suppose we see a ‘small difference’ between treatment groups and it is ‘not significant in standard tests’ (tests not shown here yet).

How can we: - Put meaningful bounds on this? - Statistically ‘rule out large effects’? - Do this in the same context as our ‘null hypothesis tests’? - Do and communicate this in a way that is acceptable to a range of audiences?

(This parallels the analysis done in HERE, which includes dome further explanation of the methods)


I (David) propose taking the following approach:

  1. Construct (by simulation or analytical formulae) the ‘probability of “some function of our data” given a range of true parameters, i.e., given a range of relevant ’true rates of (relative) incidence’ under each treatment (1/2, 3/4, equal, 4/3, 2, etc). This includes (and perhaps focuses on) the case where ‘the true rates of incidence are equal to one another and equal to the average empirical incidence in our sample.’

What “function of our data”?

  • The exact proportional difference in incidence rates that we see

  • A proportional difference in incidence rates as large as we see (in absolute value) or larger (I believe this is the ‘two-tailed frequentist p-value’)

  • A proportional difference in incidence rates as large as we see (in absolute value) or smaller (this gives us a sense of ‘how unlikely is a large true effect … can we probabilistically rule out a big difference’)

  • A proportional difference in incidence rates as large or larger in favor of the Treatment

  • A proportional difference in incidence rates as large as we see in favor of the Control

  • … Perhaps similar measures for other statistics such as ‘difference in counts (not proportional), or ’average amount (donated)’ (for the latter, we’d need to consider distinct true distributions of contributions)

  1. Plot the above over the parameter range to get a visual picture of the maximum likelihood parameter, and (eyeballing) the relative probability of different ranges

  2. For a few important elements of the above, consider the ‘relative likelihood of the data’ under different true rates of incidence (or distributions of donations), for important comparisons such as

  • A relative incidence of 1.5 versus a relative incidence of 1

… If the ratio is very small we might suspect that ’no difference is far more likely than a strong difference in favor of the treatment.

  1. Implicitly considering a ‘flat prior’, integrate (average) and compare important ranges of the above

E.g….

  • the probability mass of the parameter ‘the incidence is 1.25x or higher in favor of the Treatment’ , versus 1.25x or lower (closer to zero, or higher incidence under the Control)…

  • … If the ratio is very small, then, given a fairly flat prior, our posterior should put a low probability on ‘a large difference in favor of the treatment’ …

  • And similarly, reversing the control and treatment

  • And similarly, for both extremes relative to a middle incidence range…

  • … here, if the ratio is very small, we will say “we can probablistically rule out an effect of 1.5x or larger in either direction”

Note:

Jamie:

“To put simply, likelihood is”the likelihood of \(\theta\) having generated D” and posterior is essentially “the likelihood of θ having generated D” further multiplied by the prior distribution of θ. If the prior distribution is flat (or non-informative), likelihood is exactly the same as posterior.”

So in essence, if you wanted, you could just put a prior distribution on the differences over which you have simulated differences, and multiply it with your likelihood. To get a proper posterior you would need a continuous distribution for the prior over the parameter space, and probably more simulations across the parameter space as well, to fill in the shape of the posterior. I guess the key difference between what you are suggesting and just running it as a Bayesian binomial regression is that your likelihood function is produced by simulation, whereas the one in the Bayesian regression would be generated through the model assumptions of the regression.

  1. Optionally, repeat the above with a Bayesian tool, considering more than one ‘explicit prior’.

Ongoing question: Do we need or want a separate prior for ‘incidence under treatment’ and ‘incidence under control’, or can our prior simply focus on the relative incidence rates?

Jamie suggests: Just do a Bayesian analysis and consider the posterior distribution. You can test the sensitivity to different priors.

11.4 ‘Empirical Bayes’

Reference: David Robinson “Introduction to empirical bayes examples from baseball statistics” downloaded 28 Mar 2022

Empirical Bayes estimation, where a beta distribution fit on all observations is then used to improve each individually. What’s great about this method is that as long as you have a lot of examples, you don’t need to bring in prior expectations.

Some data cleaning

setup baseball-stats
p_load(Lahman)
# Filter out pitchers
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)
# Include names along with the player IDs
career <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID") %>%
dplyr::select(-playerID)


career_filtered <- career %>%
filter(AB > 100) #he uses 500, I think 100 is enough

11.5 Step 1: Estimate a [Beta] prior from all your data

We set up a likelihood function… probability of the data given different parameters. It sums the log of ‘how likely is each row (batter) given that batting averages are drawn particular true beta distribution, and hits are drawn from this’. At bats are taken as a constant, I guess

Code
library(stats4)
p_load(VGAM)

ll <- function(alpha, beta) {
x <- career_filtered$H
total <- career_filtered$AB

-sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log = TRUE)) #Note this is outputting the negative Log likelihood (not subtracting it from the last line)
# 'how likely
}

Next we search for the ‘most likely parameters’ … the alpha and beta that maximize the above (using a particular algorithm).

Code
# maximum likelihood estimation
m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B",
lower = c(0.0001, .1))
ab <- coef(m)
alpha0 <- ab[1]
beta0 <- ab[2]

11.6 Step 2: Use that distribution as a prior for each individual estimate

With the Beta prior and Binomial distribution, it turns out there is a simple rule for finding the midpoint (MAP?) of the posterior for the mean. Adjust the ‘average’ formula by … taking the numerator (hits) and add \(\alpha_0\) from the Beta prior. Take the denominator (at-bats) and add \(\alpha_0 + \beta_0\).

So if someone got 30 hits out of 100, our midpoint estimate for their true ‘data generating’ batting average is \(\frac{30+\alpha_0}{100+\alpha_0+\beta_0}\)

We can do this across all the data, and find the ‘best and worst’ batters:

Code
career_eb <- career %>%
mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))


career_eb %>% arrange(eb_estimate) %>% head(5)
name H AB average eb_estimate
Bill Bergen 516 3028 0.17  0.178
Ray Oyler 221 1265 0.175 0.191
John Vukovich 90 559 0.161 0.195
John Humphries 52 364 0.143 0.195
George Baker 74 474 0.156 0.196
Code
career_eb %>% arrange(-eb_estimate) %>% head(5)
name H AB average eb_estimate
Rogers Hornsby 2930 8173 0.358 0.355
Shoeless Joe Jackson 1772 4981 0.356 0.35 
Ed Delahanty 2597 7510 0.346 0.342
Billy Hamilton 2164 6283 0.344 0.34 
Willie Keeler 2932 8591 0.341 0.338

As noted (and plotted) in the book, this method tends to ‘shrink’ the averages towards the overall midpoint. The lowest averages are raised a bit, and the highest averages are reduced a bit … with more ‘shrinkage’ where there are fewer observations (at-bats).

12 Factor analysis, dimension reduction, and ‘carving reality at its joints’

Integrate…

Willem’s work link: ‘How to Science’

Code
knitr::include_url("https://willemsleegers.github.io/how-to-science/chapters/statistics/factor-analysis/EFA.html")

  1. However causal inference requires strong assumptions and/or exogenous and random shifts in potential determinants of donation behavior.↩︎

  2. Thought: the simulations done in randomization inference may overlap the simulations done in computing likelihoods as an input to Bayesian approaches … see ‘single framework’↩︎

  3. See also Reinstein’s notes/work on the vignette.↩︎