To weight, or not to weight, that is the question:
Whether ’tis nobler in the mind to suffer
Discrepancies between sample and population proportions,
Or to take arms against the gaps
But by weighting them, to distort the truth…
Weighting plays an important role in how we measure, and even what we consider, the truth. Its role is not just limited to contemporary consumer research. Its role in commerce was important enough to merit mention in the deliberately short U.S. Constitution.
In Article 1, Section 8, Clause 5 of the U.S. Constitution, Congress is granted the power to “fix the standard of weights and measures.” This is an important government function performed by the NIST Office of Weights and Measures (OWM) to “ensure that consumers get what they pay for and sellers get fair payment for the goods and services they sell by promoting a uniform and technically sound system of weights and measures. This, in turn, promotes consumer confidence and helps ensure fair competition for U.S. commerce that spans from local business operations to a global scale.”
Weighting can play a similar role in UX research when a sample fails to match a standard with regard to its composition. For example, although it’s not an official government standard, some researchers want the demographics of their samples to be consistent with the U.S. population, using the census as a standard (census-matching sampling). Ideally, the sampling strategy (stratified random sampling or quota sampling) produces an appropriate sample. However, if the composition of a sample deviates from key variables (as it often does to some degree), researchers can use weights to adjust the means and proportions to better match the U.S. population (or any other reference population).
But just because you can use weights, should you?
The fundamental reason to use weights is when cases in a dataset are not equally important—when there is a need to increase the influence of some cases and decrease the influence of others on an overall statistic (e.g., mean or proportion). For example:
- Unintended disproportionate sampling: Without carefully constructed quotas (and sometimes even with them), samples may deviate from a standard such as the census or a company’s internal data about customer characteristics (e.g., tenure as a customer).
- Deliberate oversampling: When there is a large difference in measurement variability between two groups, one way to get a more precise estimate of the mean of the group with higher variability is to increase its sample size. For example, it’s well known that novice performance is more variable than expert performance. To compensate for the effect of oversampling on overall measures (means or proportions), analysts weight cases from the oversampled group less than other cases.
- Defined differences in importance: Sometimes there are a priori reasons to weight values differently. For example, to calculate an overall grade, a teacher might give less weight to quizzes than exams and give the final exam the most weight. In UX research, this may mean weighting often-used tasks or products compared to less frequently used ones when comparing task metrics or study-level metrics such as SUS scores.
- Weighted least squares regression: More specifically (and technically), when the variance of residuals is not constant in ordinary least-squares regression, the cases where noise variance is small are weighted more than cases where noise variance is high.
The first two situations are the most relevant for UX research. The greater the difference between the sample and reference proportions, the more important it is to use weights. Unfortunately, there are no clear guidelines on how much difference is enough to justify weighting, and as discussed below, what you might achieve with weighting comes at a cost, which is why it’s usually preferable not to weight.
Unless there is a clear need for weighting, researchers should avoid it, including when:
- There is no need to match a reference population. It’s often the case in UX research that the research focus is on understanding group differences rather than interpreting overall statistics. The practice of weighting to adjust overall statistics is more associated with social, political, and market research than UX research, although there may be some projects that blend UX and market research.
- The demographic variables that differ between the sample and the reference population don’t affect important outcome variables. For example, the System Usability Scale (SUS) is one of the best-researched standardized UX metrics. Evaluations of the effect of demographic variables on SUS scores have found no significant effect of gender (five studies), age (two studies), or U.S. geography (one study).
- The sample proportions closely match the reference population. When estimating overall statistics is an important research goal and your sample closely matches the reference population, there’s no need to introduce the complication of weighting. This is the benefit of properly balancing the sample during data collection (pre-stratification) rather than adjusting data after the fact (post-stratification).
- It’s more important for estimates to be precise than match a reference population. Weighting tends to increase variability of estimates relative to unweighted data, making estimates less precise.
- There is no reliable reference population. If there is a need to match against a reference population but there are no good estimates of the proportions in that reference population, or the researcher has to choose one of several reference populations, then weighting is as likely to distort as improve estimates of overall statistics (e.g., the polling paradox).
- Sample sizes are very small for key groups. When a sample size for a key group is small relative to a reference population, then weighting it will substantially increase its influence on the overall statistics. This is problematic because small-sample estimates can be unreliable, so increasing the weight of these cases can amplify any error. Some analysts recommend restricting weights to be no more than 2.0 as a hedge against this amplification, but then, the weights no longer completely compensate for the differences in the sample and the reference population.
The key risks of weighting are diffusion of estimates due to increased measurement variability, uncertainty in selecting the right reference population, and amplification of measurement error when sample sizes are small. But what if you decide you have a good reason to weight? What’s the best way to do that? We’ll cover how to use weights in upcoming articles.
Due to its risks, the consensus about weighting is that it is a method of last resort. You should consider weighting when:
- It is critical for proportions of sample groups to match a reference population.
- Problems in a sampling plan have failed to acquire the proper proportions (unintended disproportionate sampling).
Furthermore, weighting to correct disproportionate sampling is not prudent unless:
- There is an appropriate reference population.
- There are actual differences in the group proportions of the reference population and the sample.
- The study measurements are affected by variables in the reference population.
- Group sample sizes are large enough to produce stable estimates.
To avoid the risks associated with weighting, researchers should default to analyzing unweighted data. Even if the plan is to present weighted data, good practice is to compare the weighted and unweighted results to see which conclusions, if any, have been significantly affected by weighting.
Good night, sweet data,
And may researchers analyze thee properly to the benefit of all.