What 'Degraded' Actually Means — and How to Pick Your Thresholds
Open ten status pages and you'll find ten different uses of the word "degraded." On one, it means the API is up but slow. On another, it means a non-critical feature is offline while the core works. On a third, it means error rate is 5% instead of 0.1%. None of those status pages explains which definition they're using.
This is fine when the definition is implicit and shared internally — until it isn't. The team that posts "API degraded" and means "error rate is 8%" leaves customers guessing whether they should retry, work around it, or wait. The customer doing a $40k purchase decision based on "is this API reliable enough" sees "degraded" and doesn't know if that means a five-minute blip or a six-hour brownout.
If "degraded" is a state your status page can be in, it deserves a definition. Here's how to write one.
The dimensions to pick from
"Degraded" usually means one of these, and choosing among them is the first decision:
- Elevated error rate. Some percentage of requests are failing where they normally don't. The traffic that succeeds completes fine.
- Elevated latency. Requests succeed, but slower than normal. The 95th-percentile response time has crept up beyond what users tolerate.
- Partial feature unavailability. The core product works, but a specific surface is offline (image uploads, search, AI generation). Affected customers experience a hard outage of that specific feature.
- Reduced capacity. The service is processing requests but can't keep up — backlogs growing, scheduled jobs delayed, eventually-consistent state taking longer than expected.
Most products experience all of these at different times. The question isn't which one is "real" degraded; it's which dimension you're flagging at any given moment.
A useful framing: pick whichever dimension the customer would describe in their own words. If your customer would say "the API is failing," that's #1. If they'd say "everything is slow," that's #2. If they'd say "image uploads aren't working," that's #3 with the specific component called out. The status page should match the language the customer would use for that incident.
The threshold question
Once you've picked the dimension, you need a threshold. A "degraded" label without a threshold is the team posting their gut feeling, which decays over time — what feels degraded at 9am isn't what feels degraded at 5pm after a long day.
A few useful starting points by dimension:
Error rate. A reasonable trigger: error rate above 1% for 5+ minutes. Below 1%, you're probably looking at noise (deploys, transient network blips, third-party flakiness). Above 1% for 5 minutes is sustained enough that customer reports start materializing.
Latency. Pick a percentile and a multiplier from baseline. The 95th percentile exceeding 2x baseline for 5+ minutes is a defensible "degraded" trigger for most APIs. The multiplier matters more than the absolute number — a service with 50ms baseline shouldn't trigger degraded at a blanket 500ms; a service with 800ms baseline shouldn't fail to trigger at 1200ms. The right way to talk about latency is in relation to your normal, not in absolute thresholds copied from someone else's product.
Partial unavailability. No threshold needed in the percentage sense — the feature is either up or down. The threshold is "is this customer-facing enough to flag?" An admin-only feature failing silently might not warrant a status page entry. A customer-facing feature failing always does.
Reduced capacity. Hardest to threshold, usually expressed as "backlog depth above N" or "scheduled job completion delayed by more than M minutes." This one tends to be product-specific; pick numbers that map to customer-noticeable delay, not engineer-noticeable delay.
The duration axis
Every threshold needs a duration. "Error rate above 1%" is a noisy signal; "error rate above 1% for 5+ minutes" is an incident. Without the duration component, you'll either flag every transient blip (and train customers to ignore your status page) or never flag the slow degradations (and lose trust the other way).
The trade-off is real:
- Too short (1–2 minutes): you'll catch real problems quickly but flag a lot of noise. The status page becomes flap-prone.
- Too long (15+ minutes): you'll only catch the obvious problems, and you'll catch them after customers have already complained.
Middle ground for most products is 3–5 minutes for error and latency thresholds. Partial-feature outages can be flagged immediately because there's no noise to filter — the feature is up or it isn't.
The two failure modes
Teams pick degraded thresholds wrong in two predictable ways.
Too tight. "Any error rate above 0.1% triggers degraded." Sounds rigorous, but at this threshold you're degraded most of the time — every deploy, every transient third-party hiccup, every minor regression. Customers see "degraded" so often it stops meaning anything. Eventually they ignore the status page entirely because they've learned that "degraded" doesn't predict anything they'll experience.
Too loose. "Customer must complain three times in an hour before we call it degraded." Sounds humble, but you're missing real degradations because they didn't generate enough complaints fast enough. Engineers know things are wrong from internal tooling but the public state stays "operational" because the threshold hasn't tripped. Customers experience the outage and notice the status page is lying.
The right place to land is somewhere customers would trust if they audited it — tight enough to flag real problems, loose enough that "operational" actually means operational.
A worked example
Consider a small SaaS API product with a 200ms baseline response time, a normal error rate around 0.1%, single-region deployment, and customer-facing surfaces that include the API itself, the dashboard, and customer-hosted status pages.
A defensible "degraded" definition for that product:
The status page shows "degraded" when any of the following holds for 5+ continuous minutes:
- API error rate above 1%
- 95th-percentile API response time above 500ms (2.5x baseline)
- Dashboard cannot load for new sessions
- Public status pages serve error responses
"Degraded" does not apply to:
- Single-customer issues (those are support tickets, not incidents)
- Deploy-window blips under 5 minutes
- Admin-only tooling outages
That's specific enough to trigger consistently, plain-language enough that customers and engineers will agree on whether it's tripped, and bounded enough to mean something when it does fire.
The shape of your threshold won't match this exactly. The point isn't the specific numbers — it's having numbers at all, written down, that your team agrees on.
The operational habit
Pick a threshold once, then revisit it every quarter or after a meaningful change to the product. Things that should trigger a re-pick:
- The product gets faster or slower at baseline (your latency multiplier needs updating).
- A new customer-facing feature ships (does it have its own degraded definition?).
- The customer base changes shape (a free-tier-heavy mix tolerates different things than a paid-only mix).
- A real incident exposes a gap (you wanted to flag something but the threshold didn't trigger, or you flagged something that didn't warrant it).
Write the threshold down somewhere the on-call engineer can look it up at 3am. A runbook entry, a Notion page, a comment in your monitoring config — anywhere except just-in-someone's-head.
The goal isn't a perfect definition. It's a definition stable enough that two engineers, looking at the same incident, would agree on whether the status page should say "degraded" or "operational." When you have that, the word starts to mean something.
Why this matters more than it sounds
"Degraded" is a label your customers see during your worst moments. If the label is meaningful — if "degraded" reliably means "you'll experience some problems but the core works" — customers can make decisions. They can decide to retry, wait, or work around. They can tell their team "the vendor is having a bad hour, but it's not a full outage."
If the label is meaningless — if "degraded" sometimes means a five-minute blip and sometimes means a four-hour brownout — customers can't decide anything from your status page. They have to email support to ask what's actually going on, which defeats the entire point of having a status page.
The threshold work is unglamorous. It pays off every time the label gets used for the rest of the product's life.
PageCalm helps small teams run status pages with AI-powered incident updates that sound human and ship fast. Try it free — no credit card required.