What Happens When You Test a Mobile Prototype on Desktops? – MeasuringU


Feature ImageEarly and often is not just advice for voting in Chicago—it’s also one of the key principles for designing for a usable experience.

Testing an experience while it’s still in its prototype stage allows you to find and fix problems before they become difficult and expensive to fix. User experiences with prototypes (even low fidelity ones) tend to be reasonable predictors of the final experience.

Not surprisingly, then, building and testing prototypes are common activities in the development process for websites, software, physical products, and mobile apps.

With rapidly moving development timelines, designers often want to get feedback as soon as possible. When testing prototypes for mobile (websites or apps), it can often be easier and faster to recruit and test users who participate in online research with desktop computers or laptops. The two main reasons are that participants don’t have to download an app if you want to record screens or clicks and that it’s easier to display longer task instructions that remain visible on a desktop (e.g., using the MUiQ® platform—see Figure 1).​

Desktop and mobile presentations of task instructions.

Figure 1: Desktop and mobile presentations of task instructions.

Figure 1 shows the simultaneous presentation of the prototype, task description, and End Task button on a desktop screen in Panel 1. Panel 2 shows the task presentation modifications needed for mobile devices, with the task instructions presented in a pop-up over the prototype screen that participants can close by clicking “Proceed with the task” and re-open by clicking Task Info in the lower-right corner of the screen (after which End Task is enabled).

But how realistic is testing a mobile prototype on desktop? A few things to consider about desktops:

  1. They have bigger screens than mobile devices.
  2. You use a mouse rather than gestures such as tapping and swiping.
  3. They lack haptic feedback (e.g., vibrations), which is common on mobile devices.
  4. They can’t mimic the contextual interactions of using mobile devices, such as walking, driving, or one-handed use.

So, while it may be easier to collect data on desktop, how realistic is that data relative to data collected on mobile devices?

There is surprisingly little published research comparing assessments of the same mobile application on desktops and mobile devices. In this article, we’ll review what we found on published differences in testing as well as report on the findings from an experiment we conducted.

In 2015, we compared the UX of mobile versus desktop presentations of seven live websites (n = 3,740). We expected that the richer UX of a desktop would lead to higher SUPR-Q® scores. Instead, six of the seven websites had higher SUPR-Q scores for the mobile presentations.

In an earlier published study, Betiol and Cybis (2005) used the mobile usability testing technologies available at the time. They had participants complete seven tasks for a mobile portal app using a mobile emulator on a computer screen, a mobile device fixed to a tripod in a usability lab, and a mobile device fixed to a sled with a wireless camera used outside the lab (12 participants per context). Despite expecting these different contexts to have dramatic effects on user performance and problem identification, they concluded, “The results … showed the existence of more similarities than significant differences” (p. 470).

These examples illustrate the importance of conducting this type of research. Even when you logically expect one result, “the only sure way to have knowledge of a context variable is to vary it” (Abelson’s 7th law: You can’t see the dust if you don’t move the couch).

Clearly, more data is needed, as we’ve observed many studies testing mobile prototypes in a desktop setting. So, we conducted an experiment to measure the differences in objective and subjective usability metrics when evaluating a mobile prototype with a desktop emulator or on a mobile device.

In January and February 2025, we used Figma to build a fake banking app called Capital Two (yes, creative; we know). We had 100 participants attempt three tasks using our MUiQ platform. Half the participants were assigned to take the study on a desktop with a mobile emulator, and the other half used the mobile version of MUiQ on their Android or iPhone (which required downloading our MUiQ app to record behavior). With 50 participants in each group, at p SEQ®) and about 27% in binary metrics (e.g., success rates).

The Three Tasks

Participants were told they would be asked to complete three tasks on a mobile prototype where some links or features may not be functional. The task orders were randomized for each participant.

The three task scenarios and validation criteria were:

Task 1: Check Balance

Description: Find the checking account balance and begin the bill payment process.

Validation: Selecting the correct balance amount and the first word in the placeholder text for the terms and conditions.

Task 2: Card Transaction

Description: Locate the Jan 7 transaction for the Savor credit card.

Validation: Identify the correct first word from a transaction description.

Task 3: Credit Score

Description: Find and note the credit score range displayed in the app.

Validation: Choose the correct score range from multiple-choice options.

Figure 2 shows the starting state for the prototype app and two ways to display controls that are not visible at the start. Those controls are the necessary first steps to successfully complete the tasks.​

Screenshots of mobile app

Figure 2: Initial paths to successful task completion for all three tasks in the study. From the initial state in Panel 1, users can display the controls below the fold by scrolling or dragging (Panel 2) or  clicking on the hamburger menu (Panel 3).

Task 1: Check Balance

Figure 3 shows the flow for efficient completion of this task, assuming the user has selected Check Balance from the starting screen or Checking Account from the hamburger menu. To complete the task, the participant notes the balance in Panel 1, selects Pay Bills from the options in Panel 1, notes the first word in the terms and conditions (Panel 2), scrolls/drags to the bottom of that page, then selects the checkbox to agree (Panel 3).

Screenshots of mobile app

Figure 3: Flow for the Check Balance task.

Task 2: Card Transaction

Figure 4 shows the flow for efficient completion of this task, assuming the user has selected View Cards from the starting screen or Credit Card Transactions from the hamburger menu. To complete the task, the participant swipes or clicks an arrow control to switch from the Platinum to the Savor card (Panel 1), then notes the first word of the final transaction (Panel 2).

Screenshots of mobile app

Figure 4: Flow for the Card Transaction task.

Task 3: Credit Score

Figure 5 shows the final screen for the efficient completion of this task (assuming the user has selected View Score from the starting screen or Check Credit Score from the hamburger menu). To complete the task, the participant notes the credit score range.

Screenshot of mobile app

Figure 5: The completion screen of the Credit Score task.

Task-Level Metrics

To compare performance, we had four task-level metrics: two objective and two attitudinal metrics. Successful completion rates were assessed with multiple-choice questions, task time was automatically collected by the MUiQ platform, perceived task ease was assessed with the SEQ, and confidence in having completed the task correctly was assessed with our standard confidence item (Figure 6). Screenshots of survey metrics

Figure 6: Our standard task completion confidence item.

Study-Level Metrics

At the study level, we measured perceived ease and usefulness with the UX-Lite® and the extent to which participants thought the prototype they evaluated was realistic (Figure 7).

Screenshots of survey metrics

Figure 7: The realism item.

Across most metrics collected, mobile tended to have slightly more favorable scores, with some reaching statistical significance. Prior to digging into the individual metrics, we compared the composition of the samples, finding no significant differences in the distributions of participant age or gender for the two experimental conditions.

Successful Task Completions

All success rates tended to be high (over 80%), with two out of three completion rates higher on mobile. Only the Credit Score difference was statistically significant (higher for the mobile condition) as shown in Table 1.

Task Desktop Mobile Difference p
Check Balance 90% 84%    6% 0.375
Card Transaction 86% 94%  −8% 0.185
Credit Score 82% 96% −14% 0.026

Table 1: Successful task completions with p-values generated using N−1 Two-Proportion tests.

To understand the differences in completion rates, we reviewed the task videos (captured in MUiQ) of the participants who failed the Credit Score task (nine in the desktop condition and two in the mobile condition), focusing on the properties of the screens shown in Figure 8.

Screenshots of mobile app

Figure 8: Two potential sources of error in the Credit Score task.

As shown in Panel 1 of Figure 8, the area on the home screen required to complete the first step of the Credit Score task is completely hidden from view (below the fold) and includes an image that participants might interpret as the target for the task. We were curious about whether there were differences in the error paths for the desktop and mobile conditions. We hypothesized several different potential error paths:

  • Participant did not find either path to the credit score screen (via scrolling/dragging or the hamburger menu) and incorrectly guessed at the correct answer to the multiple-choice question.
  • Participant scrolled/dragged to the bottom of the home screen, then estimated the score indicated by the static graphic (Figure 8, Panel 1).
  • Participant navigated to the credit score screen; then, instead of focusing on the range displayed toward the top of the screen, used the credit score graphic to estimate the score (Figure 8, Panel 2).

For those who found the credit score screen, we expected participants to make a note of the range displayed at the top of the page (690–750) rather than trying to interpret the credit score pointer graphic. The distractors in the multiple-choice question were 400–500, 650–700, and 725–800. None of the participants selected 400–500. Both mobile participants who were incorrect selected 650–700, as did seven of the nine participants in the desktop condition who were incorrect (the other two selected 725–800). The absence of any selection of 400–500 strongly suggests either that these participants were trying to interpret one of the credit score pointer graphics or that 650–700 seemed like the most plausible range for guessing.

For the two participants in the mobile condition, one never manipulated the home page below the fold (see Video 1), while the other one dragged the home page to reveal the button for checking credit scores and selected it to get to the credit score page.

Video 1: Participant in the mobile condition who never found the credit score page.

For participants in the desktop condition, only one selected the hamburger menu to get to the credit score page. Two participants never navigated to the credit score page—it’s possible that both of them thought they had finished the task after they revealed the static credit score graphic at the bottom of the home page. The other six participants all manipulated the home page to get to the bottom of the page and selected the button to get to the credit score page. We suspect that they noted the apparent (but actually ambiguous) location of the credit score pointer, which led them to select an incorrect option from the multiple-choice question (e.g., see where the participant in Video 2 seems to be focusing the mouse pointer on the credit score dial toward the end of the video).

Video 2: Participant in the desktop condition who found the credit score page but reported an incorrect range.

Task Times

The tasks were short (roughly averaging from a half minute to a minute), but all times were nominally faster on mobile than desktop, most notably for the Credit Score task. Because time is a continuous metric, it’s often the most sensitive of the three key usability metrics (completion rates, completion times, ratings of perceived ease), but this sensitivity can be reduced when the data have high variability (potentially due to the presence of users taking a lot longer than others).

The completion times for the three tasks (with 95% confidence intervals and t-test results) are in Table 2.

Task Desktop Mobile Difference p
Check Balance 65.2 63.5  1.7 0.82
Card Transaction 57.0 54.8  2.2 0.77
Credit Score 49.3 36.2 13.0 0.15

Table 2: Task completion times (in seconds, t-tests had 98 degrees of freedom).

Ease and Confidence

Ease (Table 3) and confidence (Table 4) ratings were nominally higher on five of six comparisons. Using an alpha criterion of .10, three of the differences were statistically significant in favor of the mobile condition (Check Balance and Card Transaction for the SEQ; Check Balance for confidence ratings).

Task Desktop Mobile Difference p
Check Balance 6.1 6.5 −0.3 0.06
Card Transaction 6.0 6.4 −0.5 0.05
Credit Score 6.4 6.7 −0.3 0.11

Table 3: SEQ results (t-tests had 98 degrees of freedom).

Task Desktop Mobile Difference p
Check Balance 6.1 6.6 −0.5 0.06
Card Transaction 6.5 6.5  0.0 0.87
Credit Score 6.4 6.7 −0.3 0.85

Table 4: Confidence ratings (t-tests had 98 degrees of freedom).

UX-Lite

Mobile UX-Lite ratings were seven points higher than desktop. The difference was statistically significant, primarily driven by the ease ratings (Table 5).

Metric Desktop Mobile Difference p
UX-Lite 84.8 91.8 −7.0 0.02
Ease 86.5 94.5 −8.0 0.01
Usefulness 83.0 89.0 −6.0 0.12

Table 5: UX-Lite scores (all five-point scales transformed to 0–100-point scales; t-tests had 98 degrees of freedom).

Realism Ratings

Our final measure was to see whether participants perceived the experience as less realistic on desktop compared to mobile. We surprisingly found almost no difference on the measure we used. That is, participants rated the realism of the mobile prototypes almost the same regardless of whether it was presented on desktop or mobile.

The mean difference in realism scores was −0.08 (a little less than 1% of the 0–10-point scale, slightly favoring the mobile condition), which was not statistically significant (t(98) = −0.20, p = .85).

To compare UX data collected for a prototype mobile banking app displayed on a desktop computer versus on a mobile device, we had 100 participants (50 per condition) attempt three tasks: Check Balance, Card Transaction, and Credit Score. Overall, the results were generally better for the mobile condition (in some cases, significantly so).

Key Findings

Mobile had somewhat higher completion rates and faster times. Two of the three task completion rates were higher for the mobile condition, and one of those was statistically significant. For the task with significantly different completion rates (Credit Score), we reviewed the failure videos and found that the user behaviors that led to failure were consistent across conditions. Average task completion times were faster in the mobile condition, only slightly for two tasks, but 13 seconds faster for Credit Score (none of the differences were statistically significant).

Mobile tasks were rated as slightly easier. Participant ratings of task ease (SEQ) and confidence in successful task completion were higher in the mobile condition (significantly different SEQ for Check Balance and Card Transaction but not for Credit Score; significantly different confidence only for Check Balance).

Mobile had generally higher UX metrics and significantly higher UX-Lite scores. Out of 16 comparisons (including the UX-Lite and its components), the results favored the mobile condition 15 times, and of those 15 comparisons, six were statistically significant (using p

The mobile advantage was not due to a difference in perceived realism. There was no significant difference in ratings of perceived realism (slightly favoring the mobile condition, but with an unstandardized effect size of less than 1%).

It’s Unclear Why Mobile Scores are Slightly Better

It isn’t clear why the results so consistently favored the mobile condition, especially considering that participants in the mobile condition could not see the task instructions while attempting the task (Video 1), but the task instructions were always present in the desktop condition (Video 2).

Nothing in participant videos accounts for differences in completion rates. Our review of the videos didn’t turn up anything in the stimuli or respondent behaviors that accounted for the difference in completion rates for the Credit Score task. There were no significant differences in age or gender distributions in the two conditions, but it’s possible there were other differences in participant characteristics for those who would take the study on a desktop and those who would take the study on a mobile device—we just don’t know. However, the favorable scores are consistent with our earlier findings that showed SUPR-Q scores were also slightly higher on mobile websites.

Differences in orientation of rating scales don’t explain the mobile advantage. One possible contributor to the difference in subjective assessments was the change in orientation of the rating scales, horizontal for the desktop condition and vertical for the mobile condition (e.g., Figures 6 and 7). We have studied that difference in previous studies with five-point and eleven-point scales, but in those studies, the mobile advantage was small (about 1–2% of the scale range) and nonsignificant, so that doesn’t account for the magnitude of the significant differences in rating scales in this study.

The results might change if we replicate the study. Finally, 50 participants per condition isn’t a small sample size, but it’s also not a very large one. We plan to continue to research this topic to better understand which effects are likely real and which may have been due to chance.

Participants were not randomly assigned to experimental conditions. One covariate that may play a role is that participants who take mobile studies on panels might be slightly different than those who take them on desktop. While our differences in demographic data (age, gender) show no differences, other factors may be at play. Because of the nature of panel recruitment, participants were not randomly assigned to either desktop or mobile studies. Instead, we simultaneously recruited participants for a mobile study OR a desktop study.

Bottom Line

Right now, we can only speculate about what factors drove the differences between the desktop and mobile conditions in this experiment.

The mobile advantage doesn’t appear to be due to differences in perceived realism or rating scale orientation. The dramatically nonsignificant differences in the realism ratings strongly suggest that a difference in perceived realism across the condition was NOT the cause. Our previous studies of rating scale orientation had a modest mobile advantage, but not enough to account for the differences in this experiment.

Is it OK to test mobile on desktop? Probably. This analysis suggests that should you decide to test mobile prototypes on a desktop experience, you’ll likely get similar results, though potentially with slightly lower ratings than if they were administered on a mobile phone. While there’s more we can test, this first experiment rules out major differences.

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Compare items
  • Total (0)
Compare
0