
Who cares what happened 15 or 20 years ago? Though technology changes fast, some of the most important questions in UX research are enduring. Preparing for the future means understanding the past.
We’re celebrating our 20th anniversary at MeasuringU (2005–2025). For us, it’s less about popping the champagne and more about reflecting on how the UX industry has changed and how we have helped shape some of that change through measurement.
Some things have changed a lot while others haven’t. We looked back at key moments, reviewing influential publications and events to describe the story of how our company and the industry have evolved.
We’ve divided the MeasuringU timeline into epochs. In each epoch, we briefly describe the industry trends, our company milestones, and the state of the art (including our contributions) for the enduring UX topics of sample size estimation, online UX tools, usability testing, UX data analysis, and UX metrics.
In this article, we cover the foundational years of both MeasuringU and the measurement of UX. In two upcoming articles, we’ll cover the middle and most recent epochs. We (Jeff and Jim) will provide our perspectives in an April 2, 2025 webinar. And while we both see a lot of work ahead, there’s still some time to sip the champagne!
The MeasuringU timeline starts before the company became incorporated in 2005. Our genesis started in 1998 when Jeff Sauro, the founder of MeasuringU, was trained in statistical process control for manufacturing at General Electric. During this period, GE was the largest company in the world (by market capitalization), with dozens of business units including Healthcare, Appliances, Aircraft, Credit Cards, and even the TV network NBC.
Industry Trends
Dotcom Boom, then Bust
The largest macroeconomic trend during this period was the dot-com bubble and burst. GE itself was a bit of a poster child for the peak and fall of industrial conglomerates, a transition indicative of the end of the 20th century’s industrial revolution and the beginning of the dominance of big tech.
After major investments into web companies by venture capitalists in the mid to late 1990s, the party came to an end in 2000, leading to a major economic recession in 2001. Consequently, there was a strong need to justify usability budget and headcounts (which is still the case but is nothing like it was in the early 2000s). Many practitioners and researchers benefited from Bias and Mayhew’s 1994 book on cost-justifying usability.
Don Norman is usually credited with coining the term “user experience” at Apple around 1993. It took a while for that term to supplant earlier descriptions of the discipline (e.g., usability engineering, human factors engineering, human-computer interaction), but it continued to gain ground during this time. During this period, social media moved from text-based Listservs and chat rooms to multimedia platforms like Friendster, MySpace, and Facebook.
Company Milestones
The Need for MU
Jeff Sauro worked in the Information Management Leadership Program at GE, focusing on human interactions with industrial products like IT systems for power generation and medical imaging devices like MRI and CT scans.
Working at an industrial conglomerate provided early exposure to the need for measurement for better management—and to the weakness in our ability to statistically measure industrial human behavior. To help with measurement at scale, by the late 1990s, about two-thirds of Fortune 500 companies had adopted Six Sigma and expanded the concepts to all aspects of corporate work, including UX activities.
A clear transition from the old industrial economy to the newer tech-driven economy was well illustrated by the then newly appointed CEO of Intuit, Steve Bennett. Bennett came from GE and brought with him an enthusiasm for Six Sigma.
The idea behind Six Sigma was that instead of having a bunch of statisticians trying to measure everything in a company on products and processes that were foreign to them, subject matter experts should know enough statistics to improve products and processes on their own (Figure 1).
Figure 1: Booklet used during a 2003 Intuit presentation on how to use Six Sigma with traditional usability metrics.
There were more questions than answers. As Jeff researched to improve and teach usability measurement, he wanted to document it publicly in the hope of getting help (and helping others), so in 2004, he started measuringusability.com (Figure 2). The first articles emphasized the importance of measurement in usability research and practice (e.g., What’s a Z-Score and Why Use It in Usability Testing?).
Figure 2: The Measuring Usability website in 2004.
Only a few publications were available online. Most articles Jeff could access only through university libraries. That’s where he noted an author who kept coming up—Jim Lewis. Jeff saw that Jim was at IBM, found his number, and called him from a desk phone in 2003 while working at Intuit. He and Jim started talking about Jim’s findings, discovered they had a lot of common interests, and started discussing possible collaboration.
Many of the early articles on the measuringusability.com website were about usability metrics, but one of the earliest gaps was how to determine sample sizes for usability studies.
Sample Size
Finding Problems is Easier than Finding Sample Sizes
The methods for calculating sample size requirements for standard statistics (e.g., confidence intervals, correlations, tests of significance) had been known since the early part of the 20th century, but these methods didn’t apply to a key UX method, the problem-discovery usability study. From 1982 through the 1990s, people researching this specialized topic had settled on using the formula 1 − (1 − p)n to model expected percentages of problem discovery as a function of sample size and the expected discovery rate, p. Using an equivalent variant of this formula, in 2000, Jakob Nielsen published “Why You Only Need to Test with 5 Users.”
Following the publication of Nielsen’s article, there was a flurry of papers criticizing the implication of its title that “five is enough” (Perfetti, 2001; Spool & Schroeder, 2001) and, more technically, the idea of using the mean of a set of discovery rates for p without accounting for its variability (Caulton, 2001; Woolrych & Cockton, 2001). Another criticism of simply averaging values of p obtained from small-sample usability studies was raised by Hertzum and Jacobsen (2001) in a paper in which they proved that these estimates are necessarily inflated relative to measurements made with large samples. Participants (including Jakob Nielsen and Jim Lewis) in a panel at the 2002 annual conference of the Usability Professionals Association (UPA) discussed these criticisms and ways to potentially deal with them. We’ll see that the challenges and discussions around sample size weren’t going anywhere.
UX Online Tools
The Digital Lab Moves Online
The prevailing method for conducting UX research such as usability testing was in-person moderated sessions with one participant at a time. Digital tools like Morae digitized the data collection experience by synchronizing video and logging.
In 2002 McFadden et al. described the state-of-the-art in remote usability evaluation using collaboration software like Microsoft NetMeeting and Lotus SameTime. These products (barely) supported moderated remote evaluation, and products that could support unmoderated remote usability evaluation just started to appear toward the end of this period (e.g., WebEffective).
Usability Testing
Usability Is Defined but Its Reliability Is Questioned
The ISO standard for usability definitions and concepts (9241-11) was published in 1998, defining usability as the combination of effectiveness, efficiency, and satisfaction in a specified context, codifying recommended practice for summative usability tests. During this period, however, research and practice in formative usability evaluation was facing a replicability crisis. In 1998, Gray and Salzman published “Damaged Merchandise? A Review of Experiments That Compare Usability Methods,” which was deeply critical of the reliability of these types of experiments. This was also the year in which Rolf Molich et al. published the first of the comparative usability evaluation (CUE) studies, reporting variability in usability test methodologies and little overlap in discovered problems, and Jacobsen, Hertzum, and John reported the evaluator effect in usability testing (low overlap of problem detection and problem severity judgments across evaluators). Jeff addressed some of these criticisms in a 2004 article for ACM Interactions (Premium Usability: Getting the Discount without Paying the Price).
Data Analysis
Small Sample Sizes Push the Limits of Traditional Statistical Tests
Much of the statistics instruction in US universities is focused on preparing students to publish their research in scientific publications. Unfortunately, a number of statistical practices that are appropriate for publication are not appropriate in UX research. During this time, applied statisticians working in UX-related fields began publishing articles to explain how differences in the goals of industrial research affect good practice in statistical analysis, notably the 1998 paper “Commonsense Statistics” by the well-known engineering psychologist Christopher Wickens. Except for the t-test, most well-known statistical methods required larger sample sizes than were typically available to UX researchers and practitioners.
UX Metrics
Questionnaires Were Standard but More Was Needed
By this time, the use of classical test theory (CTT) to create standardized UX metrics was fairly well established (e.g., SUMI, PSSUQ, QUIS, SUS). The first standardized UX questionnaire for evaluating websites was published by Jurek Kirakowski in 1998 (WAMMI). Some standardized questionnaires (e.g., Marc Hazzenzahl’s AttrakDiff) increased the number of items addressing emotional consequences of interaction (i.e., hedonic quality). In 2003, Fred Reichheld introduced the Net Promoter Score.
The ISO standard for usability definitions and concepts (9241-11) defined usability as the combination of effectiveness, efficiency, and satisfaction in a specified context but didn’t precisely define metrics. That fell to the 2001 ANSI Common Industry Format for Usability Test Reports, which defined metrics for effectiveness and efficiency (based on completion rates and times) and listed numerous existing questionnaires for the measurement of “satisfaction,” many of which we now think of as measures of perceived usability (e.g., ASQ, CUSI, PSSUQ, QUIS, SUMI, and SUS).
Industry Trends
In macroeconomics, the economy experienced growth, low interest rates, and a real estate boom from 2005 to 2007, only to suffer a housing crash and financial market instability that, by late 2008, had become the global financial crisis now known as the Great Recession.
The field of UX was affected by the introduction of the increased interactivity of Web 2.0 and the growth of mobile UX fueled by the release of the iPhone in 2007. MySpace was the most visited website in June 2006, but it was overtaken by Facebook in 2008. Tom Tullis and Bill Albert published the first edition of Measuring the User Experience (2008).
Company Milestones
Measuring Usability LLC Is Incorporated
Jeff received his Master’s in Learning, Design, and Technology from Stanford, and it was fitting that his commencement speaker that year was Steve Jobs (Figure 3). One theme of that now famous (and Jobs’s only) commencement speech was to not necessarily “do what you love” but love what you do. It was that same year Measuring Usability LLC was incorporated, with an initial focus on training in applied statistics and UX measurement.
Figure 3: The 2005 Stanford Graduation Ceremony with Jeff in attendance (yellow star). Image credits Stanford University and Errol Sandler.
There was resistance in the industry to quantification, with some seeing it as a threat to methods and blurring distinctions between roles. But there clearly were gaps in methods and metrics, so Jeff’s collaborations continued with Jim Lewis (IBM) and Erika Kindlund (Intuit) on measurement research and peer-reviewed publication.
Sample Size
Getting the Right Message Out about Sample Size
A few years had passed since the 2000 article on sample size, but in talking with colleagues at the time because of the conflicting messages (5 is enough vs. 5 is nowhere near enough), most people just tried to avoid the discussion altogether.
In 2006, Jeff was the guest editor of a special Waits & Measures issue of ACM Interactions magazine as part of his outreach to the UX community about the fundamentals of statistical analysis in the context of UX research and practice. One of the articles in that issue was by Jim Lewis (Determining Usability Test Sample Size), which demonstrated the usefulness of the binomial probability formula for modeling sample sizes for formative usability testing. 2006 also saw the publication of Jim Lewis’s chapter on Usability Testing in the third edition of the Handbook of Human Factors and Ergonomics, which included detailed coverage of sample size estimation for summative and formative usability studies.
UX Online Tools
Unmoderated Research Becomes More Available
RelevantView and UserZoom entered the market of remote unmoderated UX research platforms as low-cost alternatives to WebEffective.
Usability Testing
Metrics in Formative Tests
In 2005, Jeff and Jim both contributed to the UPA 2005 workshop on reporting formative test results led by the NIST Industry Usability Reporting project. The workshop’s goal was to understand the different approaches UX researchers and practitioners were taking to report the results of formative usability studies. A key finding from that workshop was a wide division between practitioners who used small-sample quantitative analysis and those who felt that quantification was inconsistent with the goals of qualitative research—a division that persists to the current day.
Data Analysis
2+4 Equals a Better Result
Jeff and Jim had their first collaborative publication in the proceedings of the Human Factors Society in 2005, in which they compared different methods for estimating completion rates from small samples, identifying the adjusted-Wald method as best for UX research due to its accuracy and the standard Wald—the approach most often taught in intro to stats classes—as the worst (Figure 4). The approach involved a deceptively simple adjustment of adding approximately 2 to the numerator and 4 to the denominator before computing a 95% confidence interval. They included the adjusted-Wald in the mix of methods because Jeff had found a paper in the medical literature that used it for patient outcome studies that often have small samples. Jeff and Jim followed up on this in 2006 with “When 100% Really Isn’t 100%,” evaluating different methods for adjusting the point estimate of a percentage to increase its accuracy.
Figure 4: The abstract from Jeff and Jim’s first collaborative publication.
Jeff published “The User Is in the Numbers” in the 2006 Waits and Means issue of ACM Interactions to explain how to use hypothesis testing (p-values) along with appropriate confidence intervals (such as the adjusted-Wald) to help practical decision-making in UX research. In the UPA Voice, he reached another audience with “Why a Completion Rate Is Better with a Confidence Interval.”
UX Metrics
Summarizing Metrics
From 2005 to 2006, Jeff (in collaboration with Erika Kindlund) conducted research and published related papers on how to compute a single usability metric (SUM) from component UX metrics that were measured on different scales. They started by explaining how to use a Six Sigma method to standardize the different scores (Making Sense of Usability Metrics: Usability and Six Sigma; How Long Should a Task Take?), how to combine them (A Method to Standardize Usability Metrics into a Single Score), and how to use SUM to compare competitive products (Using a Single Usability Metric (SUM) to Compare the Usability of Competing Products). This was an important step toward a unified measure of usability.
Table 1 summarizes the key topics for the MeasuringU timeline from 1998 to 2008.
Topics | 1998–2004 | 2005–2008 |
---|---|---|
Industry Trends | Dot-com crash | iPhone, social media |
Company Milestones | Identify need for MU | Measuring Usability, LLC |
Sample Size | “5 is enough” controversy | Some resolution of “5 is enough” |
UX Online Tools | Mostly meeting software plus WebEffective | RelevantView, UserZoom |
Usability Testing | Summative defined, formative in crisis | NIST project, CUE studies |
Data Analysis | Need for common sense UX statistics | Adjusted-Wald binomial confidence intervals |
UX Metrics | ISO 9241-12, ANSI CIF, Six Sigma, NPS | SUM |
Table 1: Summary timeline of key topics.
We’re planning to publish two more articles, one covering events from 2009 through 2015 (e.g., recession to recovery, first employee to renaming as MeasuringU, more books, early days of MUiQ®, construct of usability, development of SUPR-Q) and another from 2016 through 2025 (e.g., economic expansion, pandemic and recovery, more books, sample size tables, maturation of MUiQ, AI, PURE, NPS, SUPR-Qm®, UX-Lite®).