Skip to main content
Back to blog
·PageCalm

When to Open an Incident (and When to Wait)

incident communicationstatus pagesdecision-makingbest practices

Every incident on your status page starts with a decision you probably made in under thirty seconds: does this go public? That decision is the highest-leverage moment in the entire incident lifecycle, and most small teams don't have a framework for it. They go with gut feel, which is another way of saying they go with whatever their last incident taught them — and the lesson from that one was as often wrong as right.

Here's a framework for the decision, separate from the investigation that follows it.

The two errors

The decision to open a public incident has errors on both sides. You can post and it turns out to be nothing. You can not post and it turns out to be real.

Small teams tend to worry about the first error more than the second. The reasoning is understandable: posting a "we're investigating" update for a hiccup that resolves on its own in three minutes feels jumpy. It feels like over-communicating. It feels like the status page equivalent of crying wolf.

But the second error is far more expensive. When customers hit a problem and check your status page to find everything green, they don't conclude your product is stable. They conclude either you don't know what's happening or you're not interested in being honest about it. Both conclusions erode trust, and both are hard to recover from. A customer who saw you jump the gun once will barely remember it. A customer who concluded you were asleep through a real outage will remember it for a long time.

The calibration goal isn't "never post unless you're sure." It's "make sure the cost of a false negative is never higher than the cost of a false positive." For most small teams, that means erring toward posting.

The threshold

Not every signal belongs on the status page. The test is straightforward: is a customer, right now, seeing something that looks broken in your product?

That's the whole question. Not "is our monitoring alerting." Not "did something unusual happen in the infrastructure." Not "should customers be seeing something broken." The test is about what's actually happening at the user-facing surface of your product, right now.

Some signals clear the threshold:

  • Three customers emailed support in the last hour about slow dashboard loads. None of your monitors are alerting. The product is slow for at least some customers, and they've noticed. Open an incident to acknowledge it while you investigate.
  • Your payment processor's status page went yellow. Checkout is failing for some customers. The fault isn't yours, but the visible symptom is. Open an incident.
  • Login errors spiked for 60 seconds and stopped on their own. By the time you open your laptop, it's already recovered. But customers who hit it during that minute saw a broken product. Open a brief incident acknowledging it and noting it's already resolved — even if the post is retrospective.

Some signals don't:

  • A database failover completed in four seconds. No customer-visible surface degraded. No customer noticed. This is an internal note, not a public incident.
  • One customer emailed to say they can't upload a file, but the request ID shows they sent a file that's too large. That's a support ticket about your error messaging, not an incident.
  • Your nightly data job spiked CPU for seven minutes at 3am. It does this every night. No customer was using the product at the time. Not an incident.

The pattern: customer-visible is the gate. Infrastructure noise stays internal. Monitoring alerts that don't correspond to a customer-facing symptom stay internal. The question isn't "did something happen in our stack." It's "did a customer see something wrong."

The escalation decision

Incidents evolve. Something that doesn't clear the threshold now might clear it in twenty minutes. The tricky case is the one where you're watching a signal that hasn't reached customer-visible yet but plausibly could.

The right move is to start with internal monitoring and set a decision timer. Something like: "we're watching the API error rate — it's up but not customer-visible yet. If it hasn't resolved in ten minutes, we post." The timer does two things. It prevents the infinite-deferral trap where you keep saying "let's wait five more minutes" for an hour. And it separates the decision to escalate from the decision to keep watching — which is the same psychological move as separating "should we post" from "is this a real problem."

When the timer fires, you make the call. If the signal is still ambiguous at that point, post anyway. An update that says "we're investigating a possible issue affecting the API — some customers may be experiencing errors" is honest at both levels. It acknowledges the possibility without claiming certainty. It's the right update for the situation.

Over-posting and under-posting: the real reputations

Teams who worry about over-posting are usually imagining a specific reader: the customer who checks the status page three times a day, sees three different "investigating" updates that all resolve to nothing, and concludes the team is alarmist. That reader exists, but there are fewer of them than you think. And their conclusion is less damaging than you think.

The reputation a team earns from over-posting is "they're responsive, maybe a little quick on the trigger." Most customers read that as diligence.

The reputation a team earns from under-posting is "they're either unaware or dishonest." Customers read that as unreliability, which is worse than jumpiness by a wide margin.

There's a secondary effect worth naming: the support inbox. When you open an incident promptly, the customers who would have emailed "is it just me?" check the status page instead and see that you know. When you hesitate, those same customers email support, and now your team is debugging the incident and responding to support tickets simultaneously — a harder version of the same problem. The status page is a support-load reduction tool as much as it's a communication tool. Using it early reduces the total work.

The internal-notes distinction

If something clears the "customer-visible" threshold, it belongs on the status page. If it doesn't clear that threshold but you still need to track it — a database failover, a flapping probe, a vendor issue that might escalate — it belongs in internal notes. That's a separate mechanism from the public status page, and the distinction matters.

Internal notes are for your team. They're where you track things that matter operationally but aren't customer-facing. They don't need the same formatting, the same voice, or the same update cadence. They're notes you write for yourselves.

Keeping this distinction clean prevents two problems. The first is the over-posting problem — status page entries for things customers never noticed and didn't need to know about. The second is the opposite: teams who hesitate to use internal notes because they conflate them with public incidents end up with nothing tracked anywhere — no internal record, no public acknowledgment. When the signal escalates later, there's no note to pick up from, and the team is starting cold.

The pre-decision

Like most incident-communication decisions, this one is better made before the incident than during it. Your team should have a one-sentence rule that covers the threshold question. Something like:

"If a customer, right now, sees something broken in our product, we open a public incident. If a customer doesn't see it, we track it internally."

That's it. The rule doesn't remove judgment — there will always be edge cases where you have to interpret "sees something broken." But it gives the team a starting point that isn't "let's discuss whether this feels important enough," which is a terrible conversation to have while an alert is firing.

The teams that handle this well all do the same thing: they've accepted that some percentage of their incident posts will be for issues that turn out to be nothing, and they've decided that's a price worth paying for never having customers be the last to know. Once you make that tradeoff consciously, the decision gets faster. You see the signal, you check the threshold, you post. The hesitation that produces the worst outcomes — the thirty minutes of "should we say something?" — disappears.


PageCalm helps small teams run status pages with AI-powered incident updates that sound human and ship fast. Try it free — no credit card required.

Share

Stop wordsmithing during outages

PageCalm writes your incident updates so you can focus on fixing what's broken.

Get Started Free