Many organizations want a score to quantify the user experience of their products.
Having a score allows organizations to:
-
- Describe whether designs help or hinder an experience.
- See whether designs improve over time.
- Compare their users’ experience objectively to competitors or industry standards.
But often the effort required to generate that score can be onerous. The logistics of empirical evaluations, especially when a company has multiple products, can present many challenges, including:
-
-
- Participants who are difficult to recruit.
- Products that can be difficult to set up in a controlled environment.
- Research staff who are already over-committed to existing formative evaluations.
-
When you need to score the UX of a product, we recommend these three approaches.
In a task-based benchmark study, users attempt prescribed tasks with the interface being evaluated, typically simulating actual usage in a controlled setting. When conducted in a physical lab, this looks a lot like the classic usability test setup (Figure 1). It’s also possible to run task-based studies remotely, moderated or unmoderated, using a service like our MUiQ® platform.
Figure 1: (Left) A classic usability test lab from IBM in the 1970s—the hairstyles and clothing have changed, but the basic setup has been remarkably stable. (Right) One of MeasuringU’s Denver current labs.
Task-based benchmarking is a great way to quantify the user experience at the task and study level. We’ve benchmarked a lot of interfaces, from physical products (TVs, remote controls, smart glasses) to websites, apps, and even service experiences. We did, after all, write the book on benchmarking.
You need access to a working product or a high-level prototype with realistic data (this can be especially hard for some B2B companies) and access to users (also hard for some products). Typically, 5–10 tasks are derived that cover the core functionality for each user role or persona included in the study. Metrics include core task-level metrics such as completion rates, task time, and ease. Study-level metrics include standardized UX questionnaires and questions such as the SUS, UX-Lite®, SUPR-Q®, and NPS, plus any custom metrics developed for the study.
In a retrospective benchmark study, participants are asked to recall their most recent experience with an interface and answer related questions, typically by administering a survey. We use this approach for our consumer and business software benchmark reports. Our regular SUPR-Q industry reports use a mix of retrospective and task-based metrics.
Retrospective benchmarking enables relatively easy collection of attitudinal data (e.g., perceived usability, perceived usefulness, and behavioral intention) but not objective performance data. These types of data are useful not only for benchmarking (Figure 2) but also for the development of statistical measurement models and key-driver analyses. Retrospective surveys also typically include open-ended questions to collect what users liked and disliked about their website experiences. By definition, retrospective studies require access to users who have prior experience with the product they are rating.
Figure 2: SUPR-Q percentiles for a 2019 retrospective study of pet supplies websites.
In contrast to the empirical task-based and retrospective methods, PURE is an analytic inspection method that builds on a hundred years of human factors research. Essentially, it’s like a cognitive walkthrough with a three-level scoring rubric that correlates with empirical benchmark data.
A PURE evaluation starts with the identification of target users or persona and associated critical or top tasks. Next, each task is decomposed into the logical steps this type of user would take to complete the task. Then two or more evaluators independently score the difficulty/mental effort of each task step using a scoring rubric with three levels (Figure 3).
Figure 3: PURE scoring rubric.
The PURE score for a given task is the sum of the scores of all the steps in that task after reconciliation of any differences among the evaluators’ ratings (Figure 4). The overall color for the task and the product is determined by the worst score of a given step within that task and product based on the rationale that no mature consumer product should have a step in which the target user is likely to fail a fundamental task.
Figure 4: Example of a PURE scorecard.
These methods are examples of three major types of UX studies: behavioral (task-based), attitudinal (retrospective), and analytic/inspection (PURE).
Table 1 summarizes the pros and cons of these three methods, which range in cost from low (PURE) to high (task-based). They also differ in setup effort. Task-based benchmarks require the most work to set up and execute because they require access to users and the development of realistic task scenarios. PURE has the second most effortful setup due to the detailed level of task analysis and the need for multiple evaluators. Retrospective studies are the easiest to set up but are usually more expensive than PURE due to the cost of data collection.
Aspects | Task-based | Retrospective | PURE |
---|---|---|---|
Requires access to working product | Yes | No | No |
Requires access to at least screenshots or demo | No | No | Yes |
Requires definition of realistic tasks with success criteria | Yes | No | Yes |
Requires access to users | Yes | Yes | No |
Supports identification of task-level problems | Yes | No | Yes (inferred from evaluators) |
Applicable for new and existing users? | Yes (Both) | No (Existing only) | Yes (Both) |
Type of study | Empirical | Empirical | Analytic/inspection |
Setup effort | High | Low | Medium |
Cost | High | Medium | Low |
Table 1: Aspects (including pros and cons) of task-based, retrospective, and PURE studies.