NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...2 min read

Alice is impatient

Share
NOW LET US Article – Alice is impatient

An exploration of the inspection paradox in system performance, explaining why users experience much higher latency and recovery times than what standard system metrics report.

Meet Alice. Alice uses your web service. Alice, like most humans, measures her time in seconds and minutes. Alice says your service is slow. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s.

You’re both right.

Meet Alex. Alex uses your web service. Alex, like most humans, measures his time in seconds and minutes. Alex says that when you have outages, they last a long time and he gets really annoyed. You tell Alex that your MTTR is less than 1 minute. Alex says that he sees the mean outage lasting 1 hour.

Again, you’re both right.

What’s going on? What’s going on is that you’re measuring time in requests, or in outages, and Alex and Alice are measuring time in seconds and minutes. When you have a long request or a long outage, Alex and Alice count that as a long time, with a heavy weight. But you only count that as one.

More technically, what’s going on here is the inspection paradox. Alex and Alice don’t experience your latency distribution $f(t)$, they experience a t-weighted version of it. If you have a MTTR or mean request time of $\mathbb{E}[X]$, Alex and Alice experience $\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}$.

Most of the time they’re waiting, they’re waiting for things that take a long time. This is (roughly) how humans experience time.

Let’s play with this with a little simulation. Plug in your median latency (or recovery time), and 99th percentile latency (or recovery time), we’ll fit a log-normal distribution to it, and then plot both what your service metrics see and what your customers see.

Median: ms p99: ms

What your service sees (mean): — ms. What your customers experience (mean): — ms.

For example, put in 30 as the median (let’s ignore the milliseconds and pretend these are minutes for now) for a 30 minute Median TTR (i.e. in half of your postmortems you see a recovery time of $\leq 30$ minutes), and 600 in as the p99 (one in every 100 events, recovery takes 10 hours). Your MTTR is just over an hour. Your customers experience a mean time to recovery of around 6 hours!

There are many arguments for why tail latency (and long recovery times) are so important to understand (e.g. multiple samples), but this is the one that I think is the least widely understood. For service times, timeout-and-retry can hide this latency some of the time (as long as the running request doesn’t hold locks or other exclusive resources). But, for recovery time, no such hiding is possible. The heaviness if the tail matters a great deal. This is also one of the reasons I don’t like trimmed measurements (like trimmed means) as a way of thinking about service latency or recovery time. They throw out some really critical context about the shape of the right tail that dominates the customer experience (the other reason is related to Little’s Law and capacity usage, which I’ve written about before).

A note on log-normal: I chose log-normal here for numerical convenience. It has the nice property that $\mathrm{lognormal}(\mu, \sigma^2)$ becomes $\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)$. Also it’s well-behaved around 0. I don’t believe that log-normal is a particularly good choice of distribution for latency or recovery time metrics, and generally would approach these problems entirely non-parametrically.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Pre-2022 Books

dev-tools

Pre-2022 Books

The rise of generative AI has sparked a subconscious preference for books published before 2022, where every word represents manual human effort and craftsmanship.

NOW LET US Related – DOS Game "F-15 Strike Eagle II" reversing project needs DOS test pilots

dev-tools

DOS Game "F-15 Strike Eagle II" reversing project needs DOS test pilots

The reverse engineering project for the 1989 DOS game F-15 Strike Eagle II has reached a major milestone, reconstructing all C source code. The developer is now calling for 'test pilots' to help find bugs in the latest playable build.

NOW LET US Related – 16-year-old SATA II SSD survives 1 petabyte of writes, 25x the drive's rating

dev-tools

16-year-old SATA II SSD survives 1 petabyte of writes, 25x the drive's rating

A 16-year-old SanDisk P4 SSD has survived a grueling endurance test, reaching 1 petabyte of writes—25 times its official TBW rating—proving that SSDs can be far more durable than manufacturers claim.

NOW LET US Related – I Stored a Website in a Favicon

dev-tools

I Stored a Website in a Favicon

An intriguing tech experiment demonstrating how to encode an entire HTML website into the RGB pixels of a favicon and decode it back using JavaScript.

NOW LET US Related – Where to Find the Colors Your Screen Can't Show You

dev-tools

Where to Find the Colors Your Screen Can't Show You

There are colors in the real world that digital screens, cameras, and games simply cannot display due to the limitations of color spaces. This article explains the science of human color vision and guides you on where to experience these ultra-saturated hues in nature.

NOW LET US Related – There are no instances in ATProto

dev-tools

There are no instances in ATProto

A clear explanation of why the concept of "instances" does not exist in ATProto (the protocol behind Bluesky), highlighting the fundamental architectural differences between ATProto and Mastodon/ActivityPub.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.