Status pages: a complete guide for engineering teams
A status page is the public record of whether your product is working. When something breaks, it is the first place customers look, the link your support team pastes into every ticket, and the artifact your contracts point at when an SLA is questioned. Done well, it cuts inbound tickets and buys trust. Done badly, it quietly tells customers everything is fine while their checkouts fail.
This guide covers what a status page is, the difference between public and internal pages, the two models for deciding status (manual and automated), what belongs on a page, and how to pick a tool. It is written for SRE, platform, and on-call engineers who own reliability and have to communicate it.
What a status page actually is
At its simplest, a status page shows a list of services or components and a verdict for each one: operational, degraded, or down. It usually carries an incident log (what broke, when, and the updates as you fixed it), a history strip showing uptime over the last 30 or 90 days, and a way for customers to subscribe to updates by email, RSS, or a chat integration.
The page answers one question for a customer: can I rely on this product right now? Everything else on the page exists to support that answer or to explain it after the fact.
Public versus internal status pages
There are two audiences, and they need different pages.
A public status page faces your customers. It lives on a domain like status.yourcompany.com, it is intentionally plain, and it errs toward reassurance. It shows the services customers care about, not your internal architecture. Wording is careful because every word is a commitment.
An internal status page faces your own engineers, support, and leadership. It can be denser, show more services, and expose the underlying metrics. It is gated behind SSO so it never leaks. Platform teams use it to see the health of shared services that many product teams depend on. See internal team status pages for how that pattern works in practice.
Most teams need both, fed from the same signals, so the internal view and the public view never disagree. A page that says "operational" to customers while the internal dashboard is red is the fastest way to lose trust.
The hard part: deciding the verdict
A status page is only as honest as the thing that decides each verdict. There are two models.
The manual model
Someone watches a dashboard, notices a problem, decides it is worth communicating, and flips a component to "degraded" by hand. This is how the incumbent tools work by default.
The manual model has one fatal weakness: it depends on a human being awake, paying attention, and willing to admit something is wrong. The page goes yellow minutes or hours after the metric did, and it goes green again on a hunch. We wrote about why this drifts from reality in why status pages lie.
The automated model
Status is decided by a measurement. A check or a metric crosses a threshold, and the verdict updates without anyone touching it. Within the automated model there are two very different signals.
Uptime checks ping a URL from outside and confirm it responds. They are easy to set up and catch hard-down outages, but they cannot see the metric that actually defines healthy for your product. A checkout endpoint can return 200 OK while p99 latency is four seconds and customers are abandoning carts.
Internal metrics are the numbers your engineers already watch: request latency, error rate, queue depth, replica lag. These define healthy in the terms of your actual SLA. The challenge has always been getting them onto a public page without exposing your internal monitoring stack. That is the gap Observer is built to close: it reads the metric inside your network and publishes only the verdict.
The deeper comparison between these models is in automated versus manual status pages.
What to include on a status page
A good public page is short. Include:
- Services, not components of your architecture. Customers care about "Checkout" and "API", not "kafka-broker-3". Group by the thing the customer buys.
- A clear current verdict per service, with a non-color cue (a label or icon) so the page is readable by color-blind visitors.
- An incident log with timestamps and plain-language updates. One incident, updated over time, not three disconnected posts.
- A history strip (30 or 90 days) so a customer can judge your track record, not just this moment.
- Subscriptions by email and RSS at minimum, plus the chat tools your customers use.
- Scheduled maintenance announced ahead of time, so planned downtime never reads as an incident.
Leave off internal jargon, raw graphs that need context, and anything you cannot stand behind contractually.
What to leave off
Resist the urge to expose every metric. A public page is a verdict, not a dashboard. Raw time-series invite misreading ("why was latency 800ms at 3am?") and turn a reassurance tool into a support burden. Keep the detail on the internal page.
Avoid vanity uptime numbers you cannot defend. If your history strip says 100% but customers remember last month's outage, the page loses credibility, which is the opposite of its job.
Per-customer status
One number can mean two different things. A 247ms response time might be healthy for a customer on a 99% best-effort contract and a breach for one on a 99.99% contract. A single global page cannot tell both truths.
Per-customer status pages solve this: the same probe publishes a different verdict per customer, each against their own SLO. This used to be an enterprise-only feature; it does not have to be. See customer-scoped pages for the model.
Incident communication
The page is only half the job. The other half is how you communicate during an incident. Detection should be automatic so the clock starts the moment a threshold is crossed, but publishing to customers should stay a human decision. Auto-drafting an incident and asking on-call to approve it in one click is the balance most teams want: fast detection, reviewed communication. More on this in incidents.
Route status changes to where your team already works (Slack, PagerDuty, webhooks) so the page and your on-call rotation never fall out of sync.
How to choose a status page tool
Work through these questions in order:
- What decides status? Manual, uptime checks, or internal metrics? This is the single most important choice. If healthy is defined by a number your engineers watch, a manual or ping-only tool will always lag reality.
- Where does your data live? If the signal that defines healthy is an internal metric, the tool needs a way to read it without your data leaving your network.
- Do you need per-customer truth? If you sell different SLAs, you need per-customer verdicts.
- How does it price? Per-page, per-monitor, per-seat, or flat? Pricing that scales with your team or subscriber count gets expensive fast.
- Is it open and scriptable? An API, an open-source agent, and config-as-code matter if you want to manage status the way you manage everything else.
We maintain head-to-head breakdowns against the common tools on the compare pages, including Statuspage, Better Stack, Instatus, and UptimeRobot.
Where Observer fits
Observer is a status page driven by the metric your engineers already watch. An agent runs inside your network, reads Prometheus, OpenTelemetry, CloudWatch, logs, databases, and network probes, applies the threshold you already use, and pushes only the verdict outbound. Raw metrics never leave. Pricing is flat with no per-seat fees. If your status should follow your metrics rather than an operator's memory, start with a free page.