How Stable is the SUPR-Qm After 8 Years? – MeasuringU


feature image showing a person holding a dark green smartphone with fingers on the screenWe spend a lot of time using our phones every day, but we hardly make any calls. A lot of our time is spent interacting with various mobile apps. Mobile apps are different enough from websites and hardware products for us to believe they deserve their own UX measurement instrument. We aren’t the only ones who think so, as there have been previous efforts to measure the mobile app experience using standardized questionnaires.

Building on prior work, we developed the SUPR-Qm® in 2017 to focus on the quality of the mobile app experience.

But a lot has changed since 2017, and mobile apps aren’t the same as they were. How stable is the SUPR-Qm model we generated over eight years ago? In this article, we review our latest results from a large-scale data collection effort to assess the stability of the SUPR-Qm.

Between February 2019 and May 2023, we collected retrospective mobile app data from 4,149 participants using our MUiQ® platform. The participants came from a U.S.-based panel with a mix of gender (48% male, 50% female) and age (42% less than 30 years old, and 58% 30 years or older). The data collection effort was part of our rolling SUPR-Q industry surveys, so we also asked participants about their website usage and items specific to an industry (such as dating, pets, and office supplies). In all, we collected ratings for 155 mobile apps across 23 industries.

In these surveys, all SUPR-Qm items had initially been randomly assigned to one of two eight-item grids and then randomized within those grids for each participant.

We used Rasch analysis, part of the Item Response Theory (IRT) methods, to analyze the new data for comparison with the original analysis of the SUPR-Qm. Most questionnaires in UX research (including the SUPR-Q) were developed using Classical Test Theory (CTT). But when the measurement goal is a unidimensional measure, which can have a large number of items focusing on the measurement of individual differences, the more appropriate method is Item Response Theory.

Interpreting a Wright Map

One key output of Rasch analysis is a Wright map (also called an item-person map), which places the difficulty of the items (how hard it was for respondents to agree with them) on the same measurement scale as the participants’ ratings. Each # represents the number of participants on the left side, and the label shows each item’s location on the right side of the map.

A Wright map is organized as two vertical histograms with the items and respondents (persons) arranged from easiest (most likely to agree) on the bottom to most difficult (least likely to agree) on the top. For example, most participants agreed or strongly agreed (4s and 5s) to the items “Easy” and “EasyNav.” In contrast, few participants highly rated apps in the item “AppBest.”

On the left side, the Wright map shows the mean (M) and two standard deviation points (S = one SD and T = two SD) for the measurement of participants’ tendency to agree. On the right side of the map, the mean difficulty of the items (M) and two standard deviation points (S = one SD and T = two SD) for the items are shown.

Original and Replication Wright Maps

Figure 1 shows the results from the original analysis conducted in 2017 on the left and the results from this replication study on the right. The item wording and the mapping between the labels that appear in Figure 1 also appear in Table 1.

Original Wright map of the SUPR-Qm. Replication Wright map of the SUPR-Qm.

 

Figure 1: Original (left) and replication (right) Wright maps of the SUPR-Qm.

Original Label Replication Label Item Wording
CantLiveWo CantLiveWithout I can’t live without this app on my phone.
AppBest AppBest The app is the best app I’ve ever used.
CantImagineBetter CantImagineBetter I can’t imagine a better app than this one.
NeverDelete NeverDelete I would never delete the app.
EveryoneHave EveryoneHave Everyone should have the app.
Discover Discover I like discovering new features on the app.
AllEverWant AllEverWant The app has all the features and functions you could ever want.
Delightful Delightful The app is delightful.
Integrates Integrates The app integrates well with the other features of my mobile phone.
UseFreq UseFreq I like to use the app frequently.
DefFuture DefFuture I will definitely use this app many times in the future.
AppAttractive AppAttractive I find the app to be attractive.
FindInfo FindInfo The design of this app makes it easy for me to find the information I’m looking for.
AppMeetsNeeds AppMeetsNeeds The app’s features meet my needs.
EasyNav EasyNav It is easy to navigate within the app.
Easy Easy The app is easy to use.

Table 1: Mapping between Figure 1 labels and item wording.

As shown in the left Wright map in Figure 1, the alignment of the original 16 items on the logit scale ranged between −2 (Easy and EasyNav) on the very easy end of the scale and +2 (AppBest and CantLiveWo) on the very difficult end. There were three clusters of items in between those extremes:

  • Above 0 (hard to agree with): CantImagineBetter, NeverDelete, EveryoneHave, Discover
  • Around 0 (moderate): AllEverWant, Delightful, Integrates, UseFreq
  • Below 0 (easy to agree with): DefFuture, AppAttractive, FindInfo, AppMeetsNeeds

The right panel of Figure 1 shows the replication Wright map. As in the original research, all items were located between −2 and +2 on the logit scale, with items tending to be more separated than in the first panel. Consistent with the original findings, Easy was the easiest item to agree with, and CantLiveWithout was the most difficult. The other items were as follows:

  • Above 0 (hard to agree with): AppBest, CantImagineBetter, NeverDelete, EveryoneHave, UseFreq
  • Around 0 (moderate): AllEverWant, Delightful, Discover
  • Below 0 (easy to agree with): DefUse, Integrates, AppAttractive, FindInfo, AppMeetsNeeds, EasyNav

Despite a few minor differences in the location of items on the Wright maps shown in Figure 1, the Spearman correlation of the rank order of items on the logit scales was statistically significant and nearly perfect (r(14) = .95, p

The Wright map for this replication study was very similar to the Wright map published by Sauro and Zarolia (2017). In both maps, the logit scores for items fell between −2 and +2. The rank correlation of items on the two maps was almost perfect (r = .95). Taken together, these findings suggest a successful replication and demonstrate the stability of the SUPR-Qm model over eight years.

In future articles, we will discuss the research we’ve conducted to streamline the SUPR-Qm, verify the stability of full and streamlined versions of the SUPR-Qm, and develop norms for interpreting SUPR-Qm scores.

For more details about this research, see the paper we published in the Journal of User Experience (Lewis & Sauro, 2025).

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Compare items
  • Total (0)
Compare
0