
Who cares what happened 15 or 20 years ago? While technology changes fast, some of the most important questions in UX research are more enduring. Preparing for the future means understanding the past.
We’re celebrating our 20th anniversary at MeasuringU (2005–2025). For us, it’s less about popping the champagne and more about reflecting on how the UX industry has changed and how we have helped shape some of that change through measurement.
Some things have changed a lot, while others haven’t. We looked back at key moments, reviewing influential publications and events to describe the story of how our company and the industry have evolved.
We’ve divided the MeasuringU timeline into three epochs. In each epoch, we briefly describe the industry trends, our company milestones, and the state of the art, including our contributions to enduring UX topics of sample size estimation, online UX tools, usability testing, UX data analysis, and UX metrics.
In our first article on the history of MeasuringU, we covered the foundational years from 1998 to 2008. This period included the founding of the company, the launching of the website, and the first of many collaborations and publications between Jeff Sauro and Jim Lewis, including the recommendation of adjusting binary confidence intervals for more accurate analysis.
In our second article, we reviewed how the expansion of UX research teams led to the need for well-documented ways of UX measurement. This period culminated in several books, including A Practical Guide to Measuring Usability, Quantifying the User Experience, and Customer Analytics for Dummies.
In this final article, we cover the years from 2016 to the present, a period in which we made major investments in our research platform (MUiQ®), grew our research team, and developed new research in UX methods and metrics.
Industry Trends
Moderate but Steady Economic Growth
The world economy saw moderate but steady growth while the U.S. experienced one of its longest periods of economic expansion. The U.S. unemployment rate dropped to 3.5% in 2019, but wage growth remained slow. Companies began to invest heavily in AI to reduce the need for low-skilled labor.
Major trends in UX research were affected by an increased use of web analytics, automated usability testing platforms, conversational interactions (e.g., chatbots, voice agents), and standardized UX metrics (e.g., SUS, NPS, SUPR-Q).
Company Milestones
The MeasuringU Press Keeps Printing
In 2016, we released the second edition of Quantifying the User Experience with a new chapter on correlation, regression, and analysis of variance. MeasuringU outgrew its original space and built a new office with custom labs, including one-way mirrors (Figure 1). While one-way mirrors were becoming less common as much UX research moved online, they proved invaluable a few years later when testing new physical products, including AR and VR devices.
Figure 1: MeasuringU lab with a one-way mirror.
As more UX teams began to measure their user experience with benchmarking, we got more calls to help with planning and analysis. Across these requests, we saw a lot of the same questions and challenges related to benchmarking. What sample size, what metrics, how many tasks? How do you assess task completion rates when you can’t directly observe the participants? To answer those questions, in 2018 Jeff wrote Benchmarking the User Experience (Figure 2), which was the first of a series of books by the newly created MeasuringU Press.
Figure 2: Cover of Benchmarking the User Experience.
We continued to collect more SUPR-Q® data, periodically updating the percentiles, and we developed a version focused on mobile experiences, the SUPR-Qm® (more on that below). We published three papers on the SUS and Jim’s book, Using the PSSUQ and CSUQ in User Experience Research and Practice. We also developed and published the PURE usability inspection method (Practical Usability Rating by Experts) and a more scientific approach to UX maturity measurement.
Dissatisfied with the existing online UX research platforms, we made major investments in MUiQ to extend its capabilities from analysis to online data collection. The MUiQ platform became an integral part of our annual UX measurement bootcamps.
Starting in 2017, we had over a million views of our blog articles a year, demonstrating our increasing reach to the UX community through this educational channel.
Sample Size
Sample Sizes in Practice
With the release of the second edition of Quantifying the User Experience, we added sample size estimation methods for correlations and simple regression. In Chapter 6 of Benchmarking the User Experience, Jeff provided numerous tables to simplify sample size planning (many of which are also available in MeasuringU blog articles published during this time).
UX Online Tools
MUiQ Gets Its First Licensed Customers
During this period, we conducted many studies that pushed the limits of existing online UX research platforms. In 2016 we enhanced the MUiQ platform, originally just an analytical platform, to enable UX data collection, and in 2017 we introduced the first licensable version. After that, we continued to enhance the MUiQ platform with features like website intercepts and URL-based tasks with screen recording.
Usability Testing
Facilitation and Facilities
Because finding and fixing usability problems is the cornerstone of usability testing, we published blog articles in 2016 that discussed seven ways to uncover usability problems, how to assign severity ratings, and steps for conducting an effective expert review. We focused on facilitation and prototype testing in 2017. Usability testing topics in 2018 included the cost of usability tests, how to build a dedicated lab, and further investigation of the evaluator effect.
Data Analysis
PURE Fun
At CHI 2016, Jeff (in collaboration with Christian Rohrer and other UX researchers from Intel Security) published the first description of PURE (Practical Usability Rating by Experts), a new usability inspection method that included analysis of data across multiple expert inspectors (Figure 3). They validated the method with data from usability tests on three software products. Following up on this research, Jeff wrote additional blog articles in 2017 and 2018 on the PURE method, including an overview, practical tips, and predicting UX metrics.
Figure 3: Example PURE scorecard.
UX Metrics
Innovating and Separating the Data from the Diatribes
In this period, we conducted significant research on the SUPR-Qm, SUS, and NPS.
SUPR-Qm
In 2017, Jeff and Paree Zarolia published “SUPR-Qm: A Questionnaire to Measure the Mobile App Experience.” The paper described the development of the SUPR-Qm, a 16-item instrument that assesses a user’s experience of a mobile application. They used Rasch analysis to assess the psychometric properties of items collected from four independent surveys (N = 1,046) with ratings on 174 unique apps. The final version had very high reliability, high concurrent validity, and good predictive validity (its scores significantly correlated with the number of app reviews in the Google Play Store and Apple’s App Store).
SUS
Using their very large dataset of completed SUS questionnaires, in 2017 and 2018 Jeff and Jim published three papers in the Journal of User Experience: “Revisiting the Factor Structure of the System Usability Scale,” “Can I Leave This One Out? The Effect of Dropping an Item from the SUS,” and “Item Benchmarks for the System Usability Scale.” The first paper presented evidence that an earlier finding of Usability and Learnability subscales for the SUS were artifacts caused by the mixed tone of the SUS items. The second paper demonstrated that removing any one item from the SUS has little effect on its score as long as it’s properly interpolated to a 0–100-point scale. The third paper presented benchmarks for the interpretation of individual SUS items. Jeff also published one of our most popular blog articles on five ways to interpret the SUS.
NPS
Between 2016 and 2019, the NPS maintained significant popularity as a key metric for assessing customer loyalty (nearly half the firms surveyed by Forrester in 2019 used the NPS). During this time, a backlash against the NPS emerged in the UX community, with some legitimate complaints about corporate misuse but also some over-the-top polemics claiming the NPS was not only useless but harmful and should be replaced by a generally unspecified “something else” without any proof that the “something else” would be better.
A quick review of the NPS: People rate their likelihood to recommend on a 0–10 point scale and are classified as detractors (0–6), passives (7–8), or promoters (9–10). The NPS is the percentage of promoters minus the percentage of detractors.
As a backlash to its popularity, the NPS controversies began to get loud, and the first of many screaming screeds proclaiming the uselessness of the NPS started to appear. We were never necessarily strong advocates of using the NPS, but we generally saw the value in having corporate commitment to measuring likelihood-to-recommend, which is the basis of the NPS. So, it was a bit surprising to read these strong condemnations from prominent UX voices like Jared Spool.
Jeff was working at Intuit in 2003 when the NPS was introduced. The founder and former CEO had come from Bayne and quickly introduced the NPS at Intuit. The software division Jeff was working in at the time was in nonprofit and state/local government accounting software. Despite working on products where recommendation behavior seemed unlikely, Jeff’s team got an edict from upper management to start using the NPS. The content of the question didn’t seem right—who would recommend government accounting software to friends? But, to Jeff’s surprise, it correlated highly with the measures of satisfaction and loyalty his team was using, and they were able to slightly modify the wording so it made sense in their research context (e.g., “If a friend or colleague asked about your experience with government accounting software, how likely would you be to recommend the xyz product?”).
Upon even mild scrutiny, some of the basic reasons provided for why the NPS was harmful were easily dismissed. We realized this was less about data than about diatribes (Figure 4). Some people simply saw (and continue to see) an opportunity to make a name for themselves in social media by signaling that they were in the anti-NPS tribe.
Figure 4: Graphic from MeasuringU’s three webinars on NPS Data, Drivers, and Diatribes.
This led to a lot of MeasuringU research (primary and secondary) into claims about the NPS claims. For example, people might reference an article as “proof” that the NPS was debunked, but careful reading of the article showed the data did not support that conclusion. Several articles showed that the NPS wasn’t necessarily superior, but it performed about as well as other similar metrics (e.g., customer satisfaction). It was clear that more data were needed.
To address that need for more data, from 2018 to 2019 Jeff published 13 blog articles on the NPS. Some key findings from that body of research are:
Industry Trends
COVID Impacts Led to Zero Interest Rate, Fueling Growth in Tech and UX
The COVID-19 pandemic led to significant economic contraction from 2020–2022, including spikes in unemployment worldwide. With the rollout of vaccines and easing of restrictions, economies began to recover in 2022 and experienced robust growth rates, particularly in sectors like technology and healthcare, though with the specter of rising inflation. The pandemic accelerated the adoption of digital technologies, leading to growth in e-commerce, remote work, and digital services.
In UX, designers and researchers began exploring how to incorporate new generative AI tools into their work processes, striving to understand what would and wouldn’t improve their efficiency and quality, given the power and limitations of this new technology. This period also saw significant advancement in remote unmoderated testing.
Company Milestones
Jim Lewis Joins MeasuringU
From the beginning, Jim had been MeasuringU adjacent while working at IBM, collaborating with Jeff on publishing research and books, with some joint teaching of tutorials and workshops at major conferences (e.g., HCII, CHI, UXPA). After 40 years at IBM, Jim retired on December 31, 2019, remained retired for one day, then joined MeasuringU on January 2, 2020. During this period, he and Jeff published six peer-reviewed journal articles, a chapter in the 5th edition of Handbook of Human Factors and Ergonomics (Usability and User Experience: Design and Evaluation), and, in 2024, published their latest book, Surveying the User Experience (Figure 5).
Figure 5: Cover of Surveying the User Experience.
The pandemic put an end to our in-person training, including our popular UX measurement bootcamp. We shifted to teaching over Zoom in 2020 (extending our training to a worldwide audience) and began creating courses for our MeasuringUniversity® online educational service. Our current bootcamps have a flipped-classroom format, reducing the time spent on live lectures for more dynamic interaction during the scheduled live sessions.
Over this period, our org chart really expanded. The research team doubled to ten researchers and managers, the software development team grew from two to ten developers and managers, and the customer engagement, research operations, and project management staff grew, too.
We published 37 industry reports and continued investing in the development of standardized UX metrics and methods (including careful incorporation of generative AI) and our MUiQ platform (details below).
Sample Size
Rating Scales Need Only about ¼ the Sample Size as Binary Metrics
In this time period, we analyzed our historical data to estimate typical standard deviations for rating scales and standardized UX questionnaires, then used those estimates to create sample size tables to make it easier for UX researchers to do fundamental sample size estimation (for the SUS in 2022 and more generally in 2023). We also worked out sample size estimation methods for the NPS. Other key blog articles on this topic in 2024 were that one sample size does not fit all usability studies and an analysis of what you get with specific sample sizes in UX problem discovery studies.
UX Online Tools
MUiQ Goes Moderated
To better serve our clients, we began licensing an enterprise-ready version of the MUiQ. We expanded its feature set to include mobile studies, click tests, and most recently, moderated studies.
Usability Testing
Thinking Aloud Does Affect Behavior
In 2022 and 2023, we published 12 blog articles about think-aloud usability testing, both in-lab and remote unmoderated. Some key findings were that asking participants to think aloud increases dropout rates and task times but leads to the discovery of more problems.
Using the new click-test capabilities of MUiQ, we found, with some caveats, that click tests were reasonably predictive of live site and product page clicks.
Our initial exploration in 2023 of using ChatGPT-4 to support usability testing found variability in multiple AI classifications of user comments, but only slightly lower interrater reliabilities between human coders and ChatGPT than between human coders alone. In 2024, we compared ChatGPT-4’s ability to sort items into groups with the outputs of multiple human researchers, finding moderate to substantial interrater reliability. In a separate experiment, we found that ChatGPT-4 was not good at estimating how well humans find items in a tree test but was reasonably good at predicting human ease ratings of this task.
Data Analysis
Statistical Analysis of the NPS
An outstanding problem for the NPS was a lack of statistical methods for estimating confidence intervals, comparing scores with a benchmark, or comparing two scores. In 2021, we published nine blog articles describing how to do these computations. We also published articles on how to combine significance testing and confidence interval estimation to go beyond simple statistical significance to the evaluation of practical significance.
UX Metrics
From the SUS to the UX Lite: UX Metrics that Span the Generations
In 2020 and 2021, we made major investments in studying the effects of item formats on the rating behaviors of participants in UX research studies. We published 19 blog articles on various topics like labeling scale points, using sliders instead of radio buttons, and using stars or emojis instead of scale numbers. Most of these alterations had little effect on respondent behavior. We researched SEQ® variants and published a new way to interpret SEQ scores based on an adjective scale.
We published “Measuring User Experience With 3, 5, 7, or 11 Points: Does It Matter?” in Human Factors (answer—don’t use three points; otherwise it doesn’t much matter) and “Comparison of Select-All-That-Apply Items with Yes/No Forced Choice Items” in the Journal of User Experience (SATA is better). In 2024, we published a new standardized questionnaire designed to assess the perceived clutter of websites.
During this time, we developed three especially useful new standardized UX metrics: UX-Lite, TAC-10, and SUPR-Qm V2.
UX-Lite
The UX-Lite® has its roots in the UMUX-LITE, which was itself derived in 2013 from the UMUX (2010). It’s a two-item questionnaire that is essentially a miniature version of the Technology Acceptance Model (TAM), assessing the perceived ease-of-use and perceived usefulness of products and services, and it is becoming an increasingly popular UX metric. From 2020 to 2024, we published 15 articles on the UX-Lite, many of which explored different ways to phrase the “usefulness” item because its original wording was overly complex. In addition to demonstrating the reliability and validity of the UX-Lite, it has also proved to be useful in regression and structural equation modeling of higher-level outcome metrics like ratings of overall experience, likelihood-to-recommend and likelihood to reuse, and actual user behaviors (2020: “Perceived Usability and the Modified Technology Acceptance Model,” 2023: “Effect of Perceived Ease of Use and Usefulness on UX and Behavioral Outcomes“).
TAC-10
We presented the TAC-10, a new measure of tech savviness, at UXPA 2024 based on research conducted at MeasuringU from 2015 through 2023. We published six blog articles in 2023 detailing its development, including why there was a need for a measure of tech savviness in UX research (to enable discrimination of interface and participant characteristics when analyzing UX data) and how to use the TAC-10 to classify participants into different levels of tech savviness.
SUPR-Qm V2
In 2025, we published the second version of the SUPR-Qm in the Journal of User Experience. In this paper, we described our analysis of large-sample datasets to (1) replicate the Rasch model reported in the initial publication of the SUPR-Qm in 2017, (2) identify redundant items from the original 16-item SUPR-Qm that could be removed to create a streamlined five-item version (SUPR-Qm V2), (3) demonstrate the stability of the SUPR-Qm and SUPR-Qm V2 over two multiyear periods in which data were collected for 155 mobile apps across 23 industries, and (4) develop interpretive norms (including curved grading scales) for the SUPR-Qm and SUPR-Qm V2. This new version enhances the usefulness of the SUPR-Qm for UX practitioners and researchers who need a standardized questionnaire that provides a quick five-item measure of the UX of mobile apps. The SUPR-Qm V2 is easy to interpret with norms that should remain stable for many years.
Table 1 summarizes the key topics for the MeasuringU timeline from its genesis to the present day.
Topics | 1998–2004 | 2005–2008 | 2009–2012 | 2013–2015 | 2016–2019 | 2020–2025 |
---|---|---|---|---|---|---|
Industry Trends | Dot-com crash | iPhone, social media | Great recession | Sluggish recovery | Economic expansion, slow wage growth | Pandemic, contraction, ZIRP, recovery, AI |
Company Milestones | Identify need for MU | Measuring Usability, LLC | Hired first employee | MeasuringU | Benchmarking the User Experience, over 1 million blog views in 2017 | Surveying the User Experience, increased staff |
Sample Size | “5 is enough” controversy | Some resolution of “5 is enough” | Quantifying the User Experience (Ch 6–7) | For eight common research designs | 2nd ed. Quantifying—correlations & regression | Tables based on historical UX standard deviations |
UX Online Tools | Mostly meeting software plus WebEffective | RelevantView, UserZoom | Mobile test capabilities | First version of MUiQ (analytics) | Enhanced MUiQ for data collection and licensing | Enterprise-ready version of MUiQ |
Usability Testing | Summative defined, formative in crisis | NIST project, CUE studies | Practical Guide | Discovery frequency and problem severity | Severity ratings, facilitation, prototype testing | Think-aloud, click testing, ChatGPT-4 |
Data Analysis | Need for common sense UX statistics | Adjusted-Wald binomial confidence intervals | Quantifying the User Experience (Ch 3–5) | Card sorting, icon testing, reliability | PURE | NPS, practical significance |
UX Metrics | ISO 9241-12, ANSI CIF, Six Sigma, NPS | SUM | Construct of usability, SEQ, SUS | SUPR-Q | SUPR-Qm, SUS, NPS | UX-Lite, TAC-10, SUPR-Qm2 |
Table 1: Summary timeline of key topics.
What’s to Come
This is the last article for our 20th-anniversary retrospective, having previously written about our foundational years (1998–2008) and the following period of growth and change (2009–2015). The first 20 years have been unbelievably rewarding for a research company dedicated to delivering empirically grounded and practical UX research, and we are looking forward to the next 20 years. Is it too early to start planning a 50th-anniversary retrospective?