
Five users are enough. Or do you need a large sample size to make statistically significant claims?
One of the enduring controversies and sources of confusion in UX research concerns sample size.
Part of the reason for the confusion is that there are different perspectives; some are more vocal than others. This isn’t different from other enduring controversies, such as between frequentist and Bayesian statisticians.
Even among frequentist statisticians, there’s been a long debate about how to interpret the p-value as either a continuous measure of evidence (Fisher) or as a cutoff for significance decisions (Neyman-Pearson). In fact, debates and philosophical disagreements exist in most scientific fields (such as dark matter and dark energy in cosmology or adaptationism vs. structuralism in evolution). And, of course, one of the original philosophical debates was between Plato’s emphasis on ideals and reason versus Aristotle’s focus on empiricism.
We’ve noticed there seem to be two schools of thought when discussing sample sizes in UX. Both are too extreme, but both have kernels (or more) of truth. One school preaches that small sample sizes are always enough; the other, that you always need very large sample sizes.
As is often the case with vocal opinions, the truth lies somewhere in between. First, let’s understand the philosophies of each school.
If this school had a motto, it would be “Five is just fine.”
The first school of thought is that small sample sizes are more than enough. This school believes that a few users suffice for just about any UX research study, from uncovering problems, determining preferences, interviewing prospective users, or assessing a new design’s effectiveness.
Also common with this school of thought is the prohibition on quantification. No numbers, no questions about statistical significance.
Five is always enough. Regardless of the method or objective, a few people are all you need. More than five is a waste of resources. Five will provide saturation, and five will prove preferences.
If this school had a motto, it would be “Large is in charge. Five is never enough.”
In this school, five is a rounding error. When there is a five in the sample size, it’s usually followed by two or three zeros.
This school believes that strong statements about generalizations (what users prefer, which design is better) require large sample sizes because using statistics requires large sample sizes.
Five users are meaningless. This school would also agree that you can’t quantify data from sample sizes of five. And you certainly can’t use statistics on small samples. Using large sample sizes allows you to use statistics and brings respect to the field.
But there is another school, one we invite you to join.
The third school on sample size is less dogmatic and more pragmatic. In this school, the right sample size is based on objectives, not objections. Sometimes five is fine, and sometimes it’s abysmally too small. The pragmatic school of thought is not based on picking and choosing what sounds good. It’s based on starting from the goal and defining the approach that works best mathematically in the long term—let’s say starting with Plato and applying Aristotle.
We recommend and teach that finding the right sample size starts with the research goal, and the optimal size will vary. UX research goals typically fall into one of three approaches.
Finding Problems/Insights
One research goal is to discover problems in an interface or glean insights from interviews or real-world observations. Here, the sample size is based on the expected frequency of the unknown problem/insight (or a range of possible frequencies) and how high a chance you need to uncover it. Typically, when a serious problem is discovered once (like tripping on a carpet or other safety issue), you fix it and don’t need to see it again. But the likelihood of discovering a critical issue, like any other issue, depends on the frequency with which it will happen. Problems that impact many people (e.g., more than 30% of the population) can be detected with as few as five users. Less common problems (e.g., those that affect 10% or less of the population) require larger sample sizes (20+) to detect.
Estimating a Population Parameter
Population parameters include the average completion rate, average rating scale score, average NPS, median time, or association/correlation between metrics. This method uses a confidence interval, and sample size strongly influences the interval’s margin of error. Precision and sample size follow an inverse square relationship. To cut the margin of error in half (e.g., 20% to 10% margin of error), you need to roughly quadruple the sample size. Interestingly, this is the same law that governs why you need four times the transmission power to double the distance a signal needs to travel. The sample size you need for estimating population parameters depends on three things: the variability of the measurement, the desired confidence level, and the desired precision of measurement (margin of error).
Making Comparisons
Is one design rated as easier than another? Which design do users prefer? Is the conversion rate higher with the new design than with the older one? Did the NPS change this quarter compared to last quarter? This sample size is driven largely by the same factors that matter when estimating parameters (variability, confidence, and error limit) plus two additional factors: power and experimental design (between- or within-subjects). The conception of error limit is similar but not exactly the same for estimation and comparison. Estimation focuses on controlling the margin of error around a measurement, while comparison focuses on controlling the minimum size of the difference (the critical difference) between two measurements (although this can also be conceptualized as controlling the margin of error around a difference). Also, like estimation, sample sizes for comparisons follow the inverse square law (to cut the critical difference in half requires quadrupling the sample size).
In a previous article, we warned against using the absolute words “always” or “never” in item stems of survey questions. The same applies to thinking about sample size requirements for UX research. We have seen pronouncements in peer-reviewed literature and social media about why or why not a specific sample size is or is not appropriate for UX research. There may be a germ of truth in these pronouncements such that they are appropriate in a very limited way, but they do not generalize outside of specific research contexts.
Table 1 lists the differences between the two absolute schools and the pragmatic school on three key research practices.
Research Practice | Small Sample School | Large Sample School | Pragmatic School |
---|---|---|---|
Sample size | Small n always good enough | Small n never good enough | Low to high n depends on objective |
Quantification | Never report numbers | Always report numbers | Numbers often help even at small sample sizes |
Assess statistical significance on small sample sizes | Don’t do it | Don’t do it | Small samples limit what stats you can detect and project but they don’t prohibit stats usage |
Table 1: Positions of the two absolute schools and the pragmatic school on three research practices.
Absolute claims about sample sizes and related research practices are easy to follow but are often wrong. When it comes to sample size estimation, we recommend pragmatic practices that are consistent with the math that underlies the relevant research goal (e.g., discovery, estimation, comparison).