Skip to content
Blog

Why status pages lie

Taha Al-Jody

A status page is supposed to answer one question: is the product working right now? Most status pages answer a different question by accident. They report what an operator last typed, which is not the same as what the system is doing.

The gap between typed and measured

Traditional status pages are manual. Someone notices a problem, opens the dashboard, decides it is bad enough to communicate, and flips a component to "degraded". Every step in that chain adds delay and judgment. The page goes yellow minutes or hours after the metric did, and it goes green again on a hunch. The trade-offs of this model are laid out in automated versus manual status pages.

Uptime checks do not close the gap. A check that pings a URL confirms the URL responds. It does not confirm that checkout latency is inside the contract you signed, or that the replica lag is under the threshold your team treats as healthy. The number your engineers actually watch lives on an internal dashboard the page never reads.

This is how a page ends up showing "all systems operational" in green while customers are failing to check out. Nobody is lying on purpose. The page is just reporting a stale human judgment instead of a live measurement.

What the gap looks like in practice

Picture a checkout service. The team's real definition of healthy is "p95 latency under 800 ms". One afternoon a slow dependency pushes p95 to 2.4 seconds. Requests still return 200, so the uptime check stays green and the public page stays green with it. Customers feel every second of that latency; some abandon the cart. The internal Grafana dashboard turned amber the moment p95 crossed the line, but the status page never looked at that dashboard.

Eventually someone in an on-call channel says "checkout feels slow". An engineer confirms it, debates whether it is bad enough to post, and twenty minutes later flips the component to degraded. By then the dependency has half recovered, so the page now overstates the problem. Ten minutes after that, someone flips it back to operational on a feeling, not on a measurement. From the outside the page told two lies in one incident: green when it was slow, then degraded when it had mostly recovered. Both were honest mistakes, and both came from the same root cause. The page was driven by a person's memory of a dashboard instead of the dashboard itself.

The cost is trust. A page that is green during a real slowdown teaches customers to ignore it, and a page they ignore is worse than no page at all, because it absorbed effort and produced false confidence.

Read the metric, publish the verdict

Observer takes the metric your engineers already trust, applies the same threshold they already use, and publishes the result. There is no second source of truth to keep aligned and no manual step between the breach and the page.

In the checkout example, you would define healthy as p95 under 800 ms once, in the same terms your team already reasons about. When the metric crosses the line, the page flips within seconds. When the metric recovers and holds, the page recovers on its own. A short dwell window keeps a one-off spike from flapping the page, so the public status reflects a sustained condition rather than a single noisy sample. Nobody is paged to type an update, and nobody has to remember to undo it.

The agent runs inside your network and pushes only the verdict outbound, so the raw metric never leaves. The page reflects measured reality, and it recovers the moment the metric does. That is the whole idea: the metric feeds the status page, not the operator.

The same mechanism makes per-customer truth possible. A 247 ms reading can be healthy for a customer on a 99.9% contract and unhealthy for one on 99.99%. When the verdict comes from a threshold rather than a person, you can run a different threshold per customer from a single probe, which a page driven by manual toggles can never do.

When a human still belongs in the loop

Reading the metric does not mean removing people. It means putting them where judgment actually helps. Detection is mechanical and should be automatic. Communication is a judgment call and should stay human. Observer auto-drafts the incident the moment a threshold is crossed and asks a person to approve before anything reaches customers. Math decides what happened; a human decides what to say about it. The lie disappears not because a person was taken out of the loop, but because the person is no longer the sensor.

If you are evaluating tools, the compare pages break down how the manual and ping-based incumbents differ from a metric-driven page, and the complete guide to status pages covers the rest of the decision.

Keep reading