Skip to main content
Back to blog
·PageCalm

What to Do When You're Not Sure If Something's Broken

incident communicationstatus pagesincident responsedecision-making

The easy incidents are the ones where something is obviously, completely broken. The API returns 500s. Login is down. You know what to do — post the update, acknowledge the issue, communicate the fix.

The hard ones are the ambiguous signals. One customer emails to say something is slow. A probe flaps for 90 seconds and recovers. Your dashboard metrics look mostly fine but one endpoint's p99 is creeping up. Is this a thing? Do you post about it?

Most teams default to "wait and see." That's sometimes right, and sometimes exactly wrong. Here's how to tell the difference.

The four signals to weigh

When the picture is fuzzy, you're usually weighing four things against each other.

1. How many sources are reporting it. One customer complaint is weak signal. Three customer complaints in an hour is strong. One monitor flapping is noisy. Multiple monitors flagging the same region is real. Cross-source corroboration is the single best filter for "is this actually a thing."

2. Whether it's getting worse, stable, or improving. A metric that spiked and recovered on its own is very different from one that's still climbing. Don't just look at the current value — look at the trajectory. A stable "slightly degraded" state for two hours warrants different action than the same state that's been trending down for twenty minutes.

3. How many customers are plausibly affected. A rare endpoint that only three of your power users hit is a different blast radius than something on your login flow. Ambiguous signals on high-traffic paths deserve more attention than clear signals on low-traffic ones.

4. How much reputation cost you eat if you're wrong in either direction. If you post and it turns out to be nothing, you look jumpy. If you don't post and customers were affected, you look oblivious. Small teams typically underestimate the second cost because the first is more visible in the moment.

The default stance: lean toward saying something

When in doubt, post. Specifically, post a low-key status update rather than a high-severity incident. Something like:

"We're looking into a report of slower-than-usual API response times. We haven't been able to reproduce yet, but we're investigating. We'll update when we know more."

That does several things at once:

  • Catches any affected customers before they email support. Support inboxes absorb the cost of a silent investigation.
  • Signals to your existing support-bound customers that you're on it. The reply to "is it just me?" becomes "check our status page — we're aware."
  • Documents the timestamp in case it escalates. If 90 minutes later you realize it's real, your timeline already has a starting point. Retroactively creating a "we knew at 2pm" marker is much worse for trust than just having posted at 2pm.

The downside — looking jumpy — is real but small. Customers forgive teams that err on the side of over-communicating. They do not forgive teams that go silent through problems.

When waiting is actually right

There are situations where "wait and see" is correct:

  • Single-source flapping that's clearly noise. If one probe has flapped forty-seven times this quarter and each time resolved in under two minutes, you don't need to start posting incidents for it. Fix the probe.
  • A customer issue that's almost certainly user-specific. One customer reports an error, their request ID shows a 400 because they sent bad input — that's not an incident. It's a support ticket.
  • Metrics noise that pattern-matches to known-harmless behavior. Your nightly data job spikes CPU for seven minutes at 3am every day. That's not an incident even if it looks like one on the dashboard.

The pattern: "wait and see" is right when you have high confidence the signal isn't real. It's wrong when you're using "wait and see" as a way to avoid making a call.

The downgrade-friendly approach

One reason teams avoid posting ambiguous incidents is fear of looking wrong later. There's a solution: start small and downgrade publicly.

If you post an "Investigating" update for a suspected slowdown and an hour later you're confident nothing was broken, post a resolution:

"We investigated the earlier reports of slower response times and did not find any actual service degradation. Response times are within normal range and we haven't reproduced the issue. Closing this out."

That's not embarrassing. It's professional. It shows you take reports seriously, investigate them, and report back honestly when the investigation comes up empty. That transparency is worth more than the appearance of never being wrong.

Customers who see a status page with a mix of real incidents, near-misses, and honest "false alarm" closures trust that page far more than one that only ever shows clean, confident statements. A too-tidy history feels curated. A realistic history feels honest.

The decision checklist

When an ambiguous signal lands, run through this quickly:

  1. Is this one source or multiple? Multiple → lean toward posting.
  2. What's the trajectory? Getting worse → lean toward posting.
  3. How many customers are plausibly affected? High-traffic path → lean toward posting.
  4. What's the cost of being wrong in each direction? If the "oh no we missed it" cost outweighs the "jumpy team" cost, post.
  5. If I don't post, will I regret it in 30 minutes? If yes, post now.

The goal isn't to be right about every ambiguous signal. It's to make sure your customers are never the last people to know something might be wrong.

The habit worth building

The teams that handle ambiguous signals well have one thing in common: they've separated the decision "should we post an update" from the decision "is this a real problem." Those are different questions. You can post an update about an investigation without claiming a problem exists. You can close an incident that turned out to be nothing without damaging credibility.

Once those two decisions are independent in your head, the fear of posting prematurely goes away. The status page becomes a place where you communicate what you know and what you're looking into — not a courtroom where you have to be certain before you speak.


PageCalm helps small teams run status pages with AI-powered incident updates that sound human and ship fast. Try it free — no credit card required.

Share

Stop wordsmithing during outages

PageCalm writes your incident updates so you can focus on fixing what's broken.

Get Started Free