Skip to content
Blog

Uptime monitoring with a status page that checks itself

Taha Al-Jody

A status page answers one question: is the product working right now? Uptime monitoring answers a different question: is a specific endpoint reachable from a given vantage point? The two questions are related but not the same. This post is about connecting them so the answer to the first comes from the answer to the second, automatically.

What uptime monitoring checks

Uptime monitoring means running a probe against a target on a schedule and recording whether the target responded in the expected way. The probe type determines what "expected" means.

HTTP probes make an HTTP or HTTPS request to a URL and evaluate the response. A common check: did the server respond with 2xx in under 500ms? You can also check for a string in the response body, validate a JSON field against a path, or confirm TLS is valid. HTTP probes can follow redirects (up to five), send custom headers, and carry client certificates for mTLS endpoints.

TCP probes open a TCP connection to a host and port, confirm the connection succeeds, and close it. They do not read or write application data. This is the right probe when you care whether the port is listening and reachable, independent of the application layer.

DNS probes resolve a name against a DNS server and evaluate the answer. You can check the record type (A, AAAA, CNAME, MX, TXT, and others), the resolved value, and the time to resolve. Useful for confirming that a failover record has propagated, or that a CDN is routing to the right origin.

TLS certificate probes connect to a host, read the certificate chain, and report days until expiry. The probe deliberately does not reject expired certificates; it reports the number so you can set thresholds against it, which means you find out thirty days before expiry rather than the moment the cert dies.

ICMP probes send ping packets and report latency, packet loss percentage, or plain reachability as a 0 or 1. They require CAP_NET_RAW on the agent host; most cloud container runtimes do not grant this by default, so they are most useful in on-premises deployments or Kubernetes with explicit privilege grants.

The problem with external uptime checks

Most uptime services run their probes from cloud infrastructure they operate. Your endpoint gets checked from an IP in a data centre somewhere in us-east-1 or eu-west-1. That tells you something: if the probe fails, the endpoint is probably not reachable from the general internet. But it does not tell you whether the service is working from inside your VPC, whether an internal load balancer is healthy, or whether the database replica a given region depends on is lag-free.

More practically: external checks can only reach endpoints that are public. Internal services, private APIs, databases behind a VPN, and staging environments are invisible to them.

Checks that run where your services run

The Observer agent runs inside your network. Wherever that is, a VM, a Kubernetes cluster, bare metal, or a container on a developer laptop, the agent can check anything reachable from that vantage point. That means private endpoints, internal hostnames, services on RFC 1918 addresses, and anything behind a VPN.

The agent does not expose any inbound ports. It pulls probe definitions from Observer's cloud on a schedule, runs each probe locally, evaluates the result against the thresholds you configured, and pushes the verdict outbound. Raw metric values, query text, and connection strings never leave your network. The cloud receives healthy, degraded, or unhealthy.

From the status page reader's perspective, the check might as well be magic: the page shows green, and somewhere inside your infrastructure an agent confirmed it was green thirty seconds ago by actually connecting to the service.

Connecting the probe to the status page

Each probe maps to a metric definition in Observer. The definition says: run this probe, on this schedule, and compare the result against these thresholds. The thresholds decide the verdict.

For an HTTP probe reporting response time in milliseconds, the thresholds might read: healthy when under 400, unhealthy when over 1500. Values between those come back as degraded. The agent evaluates strictly, so a value exactly at a threshold boundary does not trigger the adjacent state.

That verdict feeds a status page. If the status page is public, your customers see it. If it is private, your on-call team sees it. Either way, the displayed status is the output of the last probe run, not a manually typed label.

Probe results and SLO tracking

A probe that consistently reports within threshold contributes to an SLO. An SLO in Observer says: for this service, the metric should be in a healthy state at least X% of the time over the last N days. Observer computes the error budget from those two numbers and tracks burn events as they happen.

This means an HTTP probe on your checkout endpoint, combined with a 99.9% SLO over 30 days, gives you a live view of how much budget that endpoint has consumed and how much remains. When the probe starts failing, the budget moves. When it recovers, the budget stops moving. The SLO drilldown shows each burn event and the exact readings that caused it.

What to configure for common cases

Certificate expiry: Use a TLS certificate probe. Set a healthy threshold of "over 30 days remaining" and an unhealthy threshold of "under 7 days remaining". Values between those come back as degraded, which lets you act before the unhealthy state appears on the public page.

Internal API availability: Use an HTTP probe against the internal hostname. Set a healthy response-time threshold that matches your internal SLA. The agent runs inside the VPC; the probe reaches addresses that external checks cannot.

Port listening on a private host: Use a TCP probe. No application-layer knowledge required. If the port is listening and the connection succeeds, the probe returns the connect time in milliseconds.

DNS propagation confirmation: Use a DNS probe against the nameserver you care about. Set the expected record type and optionally an expected value. When the A record propagates to the expected address, the probe flips to healthy.

ICMP reachability for on-premises hosts: Use an ICMP probe with interpretation: reachability. The probe returns 1 when the host responds to ping and 0 when it does not. Set healthy when equal 1, unhealthy when equal 0.

Keep reading

See a status page that reads the metrics — Observer's own: status.use.observer