Methodology: Representativeness; weighting and sensitivity testing

This chapter integrates with the Demographics: considering representativeness chapter The present could be released as a ‘frozen’ EA Forum post if necessary, but it will continue to be updated in the hosted Bookdown.)

Getting surveys/polls right is difficult, as recent US electoral outcomes and polling exercises suggest. It hard to produce statistics that we can be confident are representative of a population-of-interest, particularly when:

there are high non-response/non-participation rates, and these rates likely vary by groups and subgroups,
we are considering a hidden/rare and fluctuating population, with ‘coverage’ limitations*

* We may not have have complete ‘coverage’ of relevant groups, e.g., some important parts of the EA population may not be reading the outlets that advertise the survey.

there are few or no benchmarks to target and train on (such as actual voting records); nor ‘complete enumerations’ of the population of interest.

1.1 The challenge for the EA Survey (EAS)

Considering representativeness of a survey with potentially substantial under-coverage, low participation rates, and a rare population

In measuring the demographics, psychometrics, economic status, attitudes, behaviors, preferences, etc., of “the global population of those who identify with Effective Altruism and/or are influenced by EA ideas” we face the above trifecta.

Selection issues are endemic to surveys. If the ‘propensity to complete a survey’ is related to any individual characteristic or response (and we don’t know the exact nature of this relationship), than the measured outcomes will be biased measures of the population outcomes. It is difficult or impossible to correct for such biases without a benchmark (such as an election result, a legal national census or a survey that offers an incentive so strong that it guarantees a near-100% response rate).

We may not be able to ensure that we have ‘reasonable confidence that we have a reasonably representative sample.’ Without a benchmark, how would we know?

If certain groups are not being reached at all (e.g., those not on the internet), there is very little we can do. However, suppose the survey merely under-represents groups with certain (observed) characteristics; e.g., suppose ‘mods’ are less likely to notice or respond to the survey as ‘rockers.’ To recover unbiased estimates, we would simply need to weight each group by the inverse probability of sampling, i.e., weight each rocker as half as important as a mod in constructing summary statistics.

But for the EAS (and many other situations), we simply don’t know the extent to which each group is under-responding. The best we may be able to do is to consider the sensitivity of our results to the extent of under/oversampling along observable lines. (For 2020, this is the approach we take in considering demographics and key outcomes.)

1.1.1 Considering academic/formal literature

I discuss this issue in a very general methodological sense here (a work-in-progress), in my broad set of methodological notes, and offer some (rough) discussion of relevant academic work. I present some key points and reference below.

I did a fairly shallow dive into the methodological literature to put our problem in the context of survey sampling methodology, and to consider approaches to similar problems.

‘Probability sampling’ has been the standard approach in survey sampling since random-digit dialing was possible.*

*This paragraph draws from the Wikipedia entry on ‘survey sampling’, as well as cited and connected articles.

Probability sampling identifies a population of interest and a sample frame meant to capture this population. Rather than appealing to this entire population/frame, probability sampling randomly (or using stratification/clustering) samples a ‘probability share’ (e.g., 1/1000) from this frame. As only a smaller number of people are selected, you can spend more time and money/incentives, trying to make sure they respond, and track and adjust for rates of responses. Furthermore, probability sampling also allows ‘stratification,’ and oversampling of harder-to-reach groups. One can then potentially divide up (‘stratify’) that frame by observable groups, and randomly draw (sample) within each strata with a certain probability. If we have an informative estimate of the TRUE shares in each strata we can sample/re-weight so that the heterogeneous parameter of interest can be said to represent the average value for the true population of interest.

In contrast, the EA Survey is a ‘non-probability sample’ Baker et al. (2013a). We have collected survey responses from self-selected ‘convenience’ samples (‘internet surveys’) across several years. Our current approach may be described as a combination of ‘convenience sampling’ (‘river sampling’ and ‘opt-in’) and ‘snowball sampling.’*

I have heard claims that ‘internet surveys,’ if done right, with proper adjustments, can be as or more reliable than traditional polling. However, these adjustments depend on external measures and ‘gold standards,’ such as actual electoral outcomes, and large scale repeated census enumerations.

In our context there seems to be little potential to reweight or ‘post-stratify’ to recover results that represent the EA population as a whole.

Consider the Wikipedia entry on ‘convenience sampling’:

Another example would be a gaming company that wants to know how one of their games is doing in the market one day after its release. Its analyst may choose to create an online survey on Facebook to rate that game.

Bias The results of the convenience sampling cannot be generalized to the target population because of the potential bias of the sampling technique due to under-representation of subgroups in the sample in comparison to the population of interest. The bias of the sample cannot be measured. [emphasis added] Therefore, inferences based on the convenience sampling should be made only about the sample itself. (Wikipedia, on ‘Convenience sampling,’ cites Borenstein et al, 2017)

This entry is deeply pessimistic… for our case.

We might consider alternate or additional sampling approaches in future EA Surveys. Selected sampling will not, by itself, do anything to lessen the problem of differential non-participation. However, it is conceivable that we might take improve our representativeness through some combination of …

Surveying the general (non-EA) population as part of larger representative surveys to get a sense of the overall composition of EAs (e.g., the gender ratio). However, differential non-response, to these larger surveys would again throw this in doubt. Standard corrections may not be easy: and relative non-response among EAs (e.g., male versus female EAs) may differ from the relative non-response to such surveys in other populations.
Tracking, and attempting to adjust for particular rates of non-participation among groups with known compostions, and extending this, by inference, to groups with unknown composition.^*

E.g., suppose we knew the true gender composition of EA Global participants was 80/20 male/female, and we see a 90/10 split from EAG participants in the EAS. We might then assign a double weight to female EAG participants in the EAS, and perhaps also assign a double weight to females in other groups where we expect a similar pattern.
Additionally, taking probability samples from within known groups we expect to be ‘broadly representative of EA’ (or at least of particular interest) and offering much strong incentives to these individuals. If this lead to very-high participation rates among these probability samples, this would bring us closer to a gold-standard measure, at least for this group of interest.

In our posts for the 2020 survey, we aim to focus (as we have, to some extent in the past) on ‘sensitivity testing.’

We might also consider “respondent-driven sampling” in combination with a careful measurement of the network structure of the EA population, and the sharing of the EAS in this network.

As noted, the EAS considers a hidden/rare and fluctuating population, with ‘coverage’ limitations. Salganik and Heckathorn (2004a) suggest using “respondent-driven sampling” (or ‘snowball sampling’ or ‘chain referral’) in such contexts, and making adjustments to recover representativeness (which will work only under particular condidtions). The process involves

a small number of seeds who are the first people to participate in the study. These seeds then recruit others to participate in the study. This process of existing sample members recruiting future sample members continues until the desired sample size is reached.

They note that

This research is fairly well summarized by Berg (1988) when he writes, “as a rule, a snowball sample will be strongly biased toward inclusion of those who have many interrelationships with or are coupled to, a large number of individuals.” In the absence of knowledge of individual inclusion probabilities in different waves of the snowball sample, unbiased estimation is not possible.

This motivates the authors’ approach, involving measuring the likelihood that an individual is reached as a function of their network, and then downweighting those individuals that are more likely to be sampled.

They claim that through the process they propose, “it is possible to make unbiased estimates about hidden populations from these types of samples … asymptotically unbiased no matter how the seeds are selected.”

While this approach is compelling, and may hold some potential for future EAS work, this is not feasible with the data we have now. Furthermore, it is not clear whether it would be an improvement. Relying on chain referral may increase the extent to which the EAS participants are skewed towards certain groups. Even if we take on the adjustments advocated in Salganik and Heckathorn (2004a), this may add more ‘bias’ in net.

Given the data we have, we thus focus on ‘sensitivity-testing,’ as discussed below.*

* We also sketch and discuss some alternative analytical approaches, perhaps for future years, below).

1.2 Sensitivity-checks

In our posts for the 2020 EA survey, we will focus (as we have, to some extent in the past) on ‘sensitivity testing.’

For key outcomes of interest, we will consider how our reported estimates vary …

by ‘referrer’ (the link that brought the respondent to the EAS), varying the weights assigned to referrers with different characteristics (‘large pool’ vs ‘small pool,’ level of EA-alignment, etc.)
by level of self-reported engagement,

(In our future ‘Engagement’ posts, we will also consider this by ‘level of self-reported engagement’).*

*Those less-engaged are presumably less interested in filling out the survey, and thus ‘under-represented.’ While we might imagine that the views of less-engaged EA’s are less relevant to consider, this may not always be the case, and might not track 1-1 with their lower response rates.

and by time-to-respond and ‘agreement to respond to future surveys.’

Why these groupings?

Some referrers may be promoting the EAS more than others, and EA’s who identify with certain referrers may visit them more often than EA’s in other mileu.
Those less-engaged are presumably less interested in filling out the survey, and thus ‘under-represented.’ While we might imagine that the views of less-engaged EA’s are less relevant to consider, this may not always be the case, and might not track 1-1 with their lower response rates.
Reasonably, ‘being less eager to do something’ may leads to procrastination, or to requiring more reminders before you do it. If so, those who took the longest to respond to the survey (at least relative to the first time they first learned about the survey) may better reflect, within each group, those ‘less likely to take the survey’ (or respond to followup surveys). Arguably, this group is under-represented.

In future…*

*In future we also aim to consider this by demographics and ‘clusters/vectors of demographics’ we particularly anticipate having differential response rates, in light of past survey research. Some demographics (particular considering career and family status) may face a greater time-cost, making them less likely to complete the survey, and leading to under-representation.

1.3 Modeling and bounding the biases

In the sensitivity checks below we report values of key demographics and other important outcomes for each of the groupings mentioned above. Taking the extreme values (minima and maxima) of the outcomes across these groupings could be seen to embody fairly extreme and ad-hoc assumptions.

For example we report a series of key outcomes for each referrer (and for some groupings of these). We also did this for groupings of level of self-reported engagement).

mean_don <- mean(eas_20$donation_2019_c, na.rm=TRUE)

n_shared_link <- sum(eas_20$referrer=="Shared link")
n_optin <- sum(eas_20$referrer=="Email; opt-in from prev. EAS")
male_rate_shared_link <-  mean(eas_20$d_male[eas_20$referrer=="Shared link"], na.rm=TRUE)
sd_male_rate_shared_link <- sqrt(male_rate_shared_link*(1-male_rate_shared_link))
male_rate_optin <-  mean(eas_20$d_male[eas_20$referrer=="Email; opt-in from prev. EAS"], na.rm=TRUE)
sd_male_rate_optin <- sqrt(male_rate_optin*(1-male_rate_optin))

1.4 Modeling and bounding the biases

* Consider the largest and smallest value of each measure, across all referrers, of, e.g., ‘share identifying as Male.’ This leads to a low of 59.4117647058824% male (from the ‘shared link’) and a high of 82.5641025641026% male (the opt-in from last year’s survey). Constructing 80% standard statistical confidence intervals to each of these measures we have a lower-CI for ‘shared link’ of 55.0482069474871% and an upper-CI for ‘opt-in link’ of 85.8498321116085%.

These bounds are clearly overly wide, i.e., overly conservative, at least in considering the impact of over/under-sampling from each referrer. Clearly, neither the ‘opt-in link’ (the most male group) nor the ‘shared link’ (the most female group) groups are so under-represented that they should constitute the virtual entirety of the EA population. We know that, e.g., 465 completed the survey from the 80K hours link, and 235 reported a level of engagement of 3 or more. This is not a drop in the bucket: estimates of the total number of “active EA’s” range from about 4700 to about 13000, or perhaps a few hundred thousand if we take the widest view (cf. ‘over 150,000 subscribers’ to the 80K hours newsletter).*

* Also note, from the same linked post, some benchmarks:

19% of GWWC members are present in 2018 survey ,,, but we might disagree about whether all GWWC members are EEA in terms of ‘being influenced by EA thought’
Informal polls of relatively heavily-engaged audiences record 30-50% response to the EAS (39% in CEA/80K/FHI Oxford offices, 40% for EA forum, 43% of Local group members, and 31-50% for specific local groups

In future we might also use total known numbers in particular groups as upper-benchmarks

We are working to address these methodological issues more carefully, and work towards a theoretically grounded approach, both in considering the data we have, and our future survey design and implementation. Below, we present some ideas in this direction.

1.5 Other possible approaches (for future consideration)

Re-weighting approaches?

We might consider ‘re-weighting (or ’post-stratification’) to match key demographics of some known group. Suppose we knew the gender and country composition of the members of the Effective Altruism Facebook Group. Under standard assumptions, reweighting the EAS results to match this might reduce the bias of our estimates – at least as a measure of the ‘average responses for those on EA Facebook.’ However:

The ‘EA Facebook group’ may not be representative of the EA movement as a whole, and
The re-weighting need not reduce the bias from differential response rates, even considering the EA FB group as the sampling frame of interest. For a particular outcome, considering particular subgroup weightings… this would only be guaranteed if the ‘nonresponders within each subgroup’ had the same distribution of this outcome as the overall population within each subgroup.

Less extreme (but ad-hoc) sensitivity checking

We might also consider sensitivity tests that are ‘less extreme’ than the bounds that might be implied by the tests below. E.g., we could consider….

Some ideas

As a first-pass at these sensitivity checks, we might report bounds under ‘somewhat less extreme’ possibilities, under ad-hoc grouping assumptions

Each individual referrer’s response rate (among true EA’s) is Normally distributed, with mean 40% and sd 20%.
The ‘response rate draws’ are not following groupings of referrers may each have

Or drop or downweight by 50% half of the list of referrers with all combinations, report the most extreme one?

Or drop any single referrer one by one and find most extreme results

Try to classify these groups (or a ‘vector’ of groups and downweight along these lines)

By ‘time to respond’ (how to present/test this?) or some ‘likelihood of response?’
For those willing/unwilling to respond to share email for future surveys
Another formal test?

Consider sensitivity to downweighting

by ‘Referrer’ (the link that brought the respondent to the EAS), varying the weights assigned to referrers with different characteristics (‘large pool’ vs ‘small pool,’ level of EA-alignment, etc.)

Each referrer/grouping with over 100 responses
Ad-hoc groupings:
- 80K, EA Forum, Email opt-in+shared link+Newsletter, Local Groups, Other
- 80K, EA Organisations/link (EA Forum, EA Newsletter, opt-in from prev. EAS), Social media/rationalist/blogs (Reddit, LW, SSC, FB, memes), Personal (Groups, Shared link)

by level of self-reported engagement

0-2, 3, 4, 5
0-2, 3, 4-5

and by time-to-respond and ‘agreement to respond to future surveys.’

Time to respond relative to survey start date: 4 quantiles, RR of 80%, 60%, 40%, and 20% assumed
Time to respond relative to others from same referrer … quantiles and RR as above
Agreed to respond to future surveys (no/EAS only/EAS and followups) (20% vs 50% vs 80% RR assumed)

Bayesian modeling of outcomes in light of (distributions) of response rates

Perhaps the best approach would be a model that - makes our beliefs over the possible response rates for different groups explicit, - adjusts these in light of new information, - summarizes our uncertainty over possible outcome values in a ‘posterior distribution.’

A very rough sketch of such a ‘Bayesian model’ is folded below.

Model components and assumptions:

Probability distribution over the total number of (E)EAs (does this matter?)

Suppose (all 3+ engaged EAs) for each referrer are in fact EA-affiliated.

Critical component: A probability distribution over the ‘true response rate’ for each (referrer) group

Benchmarks for this (as reported in earlier post)

19% of GWWC members present in 2018 survey (but are all GWWC members EEA?)
30-50% response rates in informal polls of relatively heavily-engaged audiences*

* From an informal poll, 39% in the CEA/80K/FHI Oxford office. Comparing response numbers in the EAS with the EA Groups survey implies 40% response rates for EA forum members, 43% Local group members, and 31-50% for a set of specific local groups,

Total ‘members’ of specific referrers.

At first pass, we might converge on a conservative guesstimate of:

a normal distribution over EEA response rates for each referrer
a 40 percent mean response rate (by referrer, and thus overall)
A 90% CI for the response rate (of EEAs) for each referrer of 15 - 80 percent (this is totally ad-hoc)

However, the hardest component to consider is the correlation between the response rates of sets of referrers . We might expect, e.g., the response rates of SSC and LW to be correlated.

We can only partially deal with this by choosing larger groups and defining priors .grouping sets of of referrers. Thus we may need to define a prior over (e.g., if we assume multivatiate normality) the variance/covariance matrix.

We do have some benchmark datasets that might help us with this … the SSC survey, EAGx demography … calculate weights to minimize error between our surveys and others …

1.6 References (work considered above and linked)

(To be integrated above as well)

Salganik and Heckathorn (2004b)

Baker et al. (2013b)

Särndal, Swensson, and Wretman (2003)

Schwarcz et al. (2007)

Wright and Peugh (2012)

References

Baker, Reg, J. Michael Brick, Nancy A. Bates, Mike Battaglia, Mick P. Couper, Jill A. Dever, Krista J. Gile, and Roger Tourangeau. 2013b. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1 (2): 90143. https://doi.org/ggmdn5.

———. 2013a. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1 (2): 90–143. https://doi.org/ggmdn5.

Salganik, Matthew J., and Douglas D. Heckathorn. 2004b. “Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.” Sociological Methodology 34 (1): 193240. https://doi.org/bzv8kr.

———. 2004a. “Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling.” Sociological Methodology 34 (1): 193–240. https://doi.org/bzv8kr.

Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media.

Schwarcz, Sandra, Hilary Spindler, Susan Scheer, Linda Valleroy, and Amy Lansky. 2007. “Assessing Representativeness of Sampling Methods for Reaching Men Who Have Sex with Men: A Direct Comparison of Results Obtained from Convenience and Probability Samples.” AIDS and Behavior 11 (4): 596. https://doi.org/10.1007/s10461-007-9232-9.

Wright, Graham, and Jordon Peugh. 2012. “Surveying Rare Populations Using a Probability-Based Online Panel.”