Measuring mobile apps performance in production

Gleb Tarasov on 2023-12-15

Measuring mobile apps performance in production

“People using your app expect it to perform well. An app that takes a long time to launch, or responds slowly to input, may appear as if it isn’t working or is sluggish. An app that makes a lot of large network requests may increase the user’s data charges and drain the device battery. Any of these behaviors can frustrate users and lead them to uninstall the app.”

Apple Documentation

App performance is an integral part of the user experience. An app that’s prone to freezing or takes ages to launch won’t satisfy our customers. If the waiting time to load search results or the hotel details screen is too long, it could detract from the excitement of planning upcoming vacations. This is something we would definitely prefer to avoid. However every new feature could slightly degrade app performance and certain changes might have a greater impact, which can get out of control.

The key aspect of mitigating performance issues in mobile apps is proper monitoring; otherwise, any effort to improve or preserve the performance would turn into jumping in at the deep end.

A brief history of the App Performance team

At Booking.com we’ve been monitoring app performance metrics for quite some time. For instance, the first iOS startup time metric was introduced in 2016. Around 2019 the team responsible for monitoring & improving the performance was created.

By 2021, the team realized that the existing setup for performance monitoring was quite obsolete, unreliable and didn’t fully fit our requirements, so it needed a revision.

While addressing the functional improvements in metrics, we also decided to rewrite the performance libraries completely, simultaneously transitioning from the older Objective-C/Java to the modern Swift/Kotlin languages. Throughout this process, we designed our libraries to be fully independent of other Booking infrastructures, injecting external dependencies like experimentation, storage, and networking.

Why not use existing third-party tools

There’s no shortage of free or paid tools to monitor app performance. Apple and Google offer some out-of-the-box monitoring solutions, and there are a few big third-party players such as Firebase Performance.

However, our monitoring tool had three primary requirements, all of which were related to integrating with the Booking infrastructure and covering specifics of our development culture:

What we measure

After recognizing the need for performance monitoring, we had to determine the metrics to track. Every metric should address a user pain point. We identified two primary user concerns: wait time and interface smoothness. This led us to focus on three primary metrics:

App startup time

The App Startup Time metric measures the time in milliseconds (ms) from the user tapping the app icon on their Home screen till the app draws its first frame.

Both platforms also differentiate between “cold” and “warm” app starts, but our main focus was to improve the “cold” launches, which is when the system cannot benefit from the app state previously cached in memory. That’s because the “warm” starts mostly depend on the performance of specific screens opened first when a user returns to the app (and as you will see in the next section we measure this metric separately anyway).

More details & official recommendations regarding app startup process for every platform you could find in official developer documentations (iOS & Android).

Time to Interactive (TTI)

Time to Interactive (TTI) is the time in milliseconds (ms) spent from the screen’s creation start until the first frame of meaningful content is rendered, ensuring that:

Initially, this metric was defined by Google for web-development but we’ve found it very useful and perfectly suitable for mobile apps as well.

To investigate degradations in TTI, we need to understand the reasons behind them. For this purpose, we use supportive metrics. We monitor the wall-clock time and the latency of every network request related to screen loading, which helps us identify degradations caused by the backend. For screens that involve heavy read/write storage operations, it also makes sense to monitor storage performance separately.

Additionally, we use the Time To First Render metric to pinpoint degradations caused by screen creation and rendering (see the next section).

Time To First Render (TTFR)

Time To First Render (TTFR) is the time in milliseconds (ms) spent from the screen’s creation start until the screen renders its first frame.

It starts at the same time as the general TTI measurement but may stop earlier. In the most common case, the screen should be ready to be drawn as soon as possible, but it shouldn’t necessarily show meaningful content immediately. Usually, the screen can show some “progress indicator” and do some heavy initializations in the background. We stop TTFR tracking once the very first frame is drawn, so the metric is pretty close to measuring the screen’s creation time. It allows us to prevent the UI thread from freezing during creation, which leads to a better user experience.

This metric directly impacts TTI and may impact Rendering Performance as well.

Rendering performance

To ensure that a user’s interaction with an app is smooth, the app should render frames in under 16 ms to achieve 60 frames per second (note: on many modern devices, the target might be set to 90 or 120 fps due to higher display frame rates, but we will refer to 60 fps in this article). If the frame rendering time exceeds 16 ms, then the system is forced to skip frames and the user will perceive stuttering in the app.

Let’s try to visualize how the app renders frames on the timeline:

There are 2 main factors that have an impact on how bad the rendering performance might be:

In the illustration above we see 6 frames: 3 frames are good and 3 frames have freezes. That means that 3 good frames were rendered within 16ms and 3 other frames have freezes of different durations. We calculate freeze duration as a difference between the actual frame duration and 16ms of the target frame duration. To calculate total freeze time we need to summarize the durations of all freezes that happen on the screen.

Freeze Time can be the same with different patterns: 1 freeze with 1000ms, 100 freezes with 10ms. Also, freeze time can increase without any additional change just by increased session duration (e.g. when every item of the scrollable list generates some slow frames and the user starts to scroll more it leads to a higher total freeze time).

To catch such situations we are also using 2 additional metrics:

Both Google and Apple offer metrics for assessing rendering performance. Initially, we adopted a method implemented by Firebase for our rendering performance monitoring, which involved tracking slow frames (>16ms to render) and frozen frames (>700ms). However, we discovered that these metrics did not adequately capture rendering performance degradation.

For instance, consider a scenario where views in a list are already slow, requiring 20ms for rendering. If the rendering time increases to 300ms, the metrics would still report one slow frame per view without any frozen frames, failing to indicate a significant deterioration in rendering time.

Moreover, there is an inconsistency in how performance changes are reflected. A view’s rendering time increasing from 15ms to 20ms is recorded as the same metric change as an increase from 15ms to 300ms, which does not accurately represent the severity of the slowdown.

Apple’s “Hang rate” metric, which is calculated as seconds of hang time per hour, appeared to be more in line with what we needed. It resembles our Freeze Time metric but is normalized by dividing the total freeze time by the session duration. However, this normalization caused the metric to become overly sensitive to changes in user behavior.

For instance, if a product feature causes users to spend more time scrolling through a slow list, the Hang rate may show an improvement because the session duration increased, even though the user experience has degraded due to more freezes.

After encountering various scenarios where the relative metric did not provide a clear picture of performance, we decided to use an absolute metric instead. This allows us to measure rendering performance more accurately, not just for the entire application but for each screen session, without the results being skewed by user behavior or session length.

The absolute metric has certain limitations too. For instance, we can take the same example: if a product feature results in users scrolling through a slow list more frequently, the rendering metric would worsen even though there hasn’t been a technical decline in performance. However, incorporating a supplementary metric Session Duration allows us to manage these situations effectively.

The main idea behind this is that we consider any increase in Freeze Time as a negative performance change, regardless of the reason (though ideally, the user shouldn’t see any freezes at all). Of course, it is important to react to new performance issues caused by a new feature, but it is also important to detect the old screen producing more freezes because users start to interact with it more actively.

Show me the code!

Wrapping up all the theoretical knowledge and clear metric definitions we finally can implement the working solution for collecting these metrics. We’ve recently open-sourced our performance tracking libraries for both platforms, which you can find on GitHub:

Feel free to try it out, leave feedback, or even better, contribute!

Vadim Chepovsky and Gleb Tarasov