When Monitoring Says Green but Customers Say Broken
It's 2pm. Three customer emails in the last fifteen minutes, all saying some version of "is something broken on your end?" You open your monitoring dashboard. Every probe is green. Latency is normal. Error rate is flat. Nothing is firing.
This is the scenario most operators handle worst — not because it's hard to investigate, but because the conflicting signals create a brief moment of paralysis. Your tooling is telling you everything is fine. Your customers are telling you everything isn't. Which one do you act on?
The answer is almost always: act on the customers. But the reasoning matters, because the same instinct that gets this one right gets a related one wrong.
Why the gap is structural, not a bug
Automated monitoring measures what you told it to measure. It probes specific endpoints, from specific places, on specific intervals, with specific expected responses. Whatever isn't on that list isn't being checked.
Customers do everything. They sign up from networks you don't have a probe in. They use features your synthetic monitor doesn't have a paid account for. They follow workflows that span auth, third-party integrations, async jobs, and email delivery — paths that often have no end-to-end check at all. They use browsers and devices your monitoring doesn't simulate.
The gap between "what monitoring covers" and "what customers actually experience" isn't a flaw in your tooling. It's the structural consequence of monitoring being explicit and customer behavior being open-ended. You can shrink the gap, but you can't close it.
That means a "monitoring green, customers broken" report isn't telling you your monitoring is wrong. It's telling you the broken thing is in the gap. That's information, not a contradiction.
The signal hierarchy in your head
When monitoring and customers disagree, the implicit ranking that produces the right action is:
- Multiple customers reporting the same issue, monitoring also red — clearest signal. Act immediately, post within minutes.
- Multiple customers reporting, monitoring green — almost as strong. The structural gap explains the green. Investigate as if it's real.
- One customer reporting, monitoring green — weak but worth investigating, especially if the report is specific. (See What to Do When You're Not Sure If Something's Broken for that decision.)
- No customers reporting, monitoring red — weakest. Could be a probe issue, a network blip, or the start of something real. Look but don't post yet.
The ranking matters because the second case — multiple customers, monitoring silent — is where teams most often delay. The instinct to "let me check if monitoring is really fine" eats fifteen minutes that should have gone to investigating the actual report.
The first five minutes
Don't open the monitoring dashboard. You already know what it says. Open the customer emails.
1. Read the reports closely. What exact action? What feature? What time did it start? Are they all using the same plan tier, the same browser, the same workflow? The reports themselves are probes — collectively they describe a slice of the gap that monitoring doesn't cover.
2. Reproduce from the customer's seat. Don't reproduce as your admin/internal account. Use a test account on the same plan tier. Hit the same endpoint with the same payload. If the reports are about a Pro-only feature and your default test account is Free, you've already filtered the bug out of your test.
3. Time-correlate against changes. What deployed in the last hour? What did the cron jobs run? Did any third-party vendor have something at that timestamp? The first thing to check is "what changed near when the reports started." More than half of these are recent-deploy-flavored.
4. Trust the report's specificity over the dashboard's silence. If three reports in twelve minutes all mention the same feature, that's not a coincidence with monitoring's view. The feature is broken in a way your probes don't see.
Where monitoring usually doesn't reach
Some places where the gap reliably opens up — worth checking first when reports come in and dashboards are green:
- Auth flows. Signup confirmation emails, password reset, magic links, SSO. These ride through services (email providers, identity providers) that your synthetic monitoring rarely covers end-to-end. A misconfigured SMTP password silently breaks signup confirmations while every "is the API up" probe stays green.
- Tier-gated paths. Pro-only features, admin tools, billing flows. Your monitoring probably has one Free-tier test account; the broken thing is in code only Pro-tier accounts hit.
- Recently shipped paths. Anything deployed in the last two weeks where you haven't yet added a probe. New features outpace monitoring coverage; that's normal.
- Async / scheduled jobs. Crons, queue consumers, webhook deliveries. The work runs minutes or hours after the request that triggered it; failures don't show up in request-path monitoring.
- Third-party dependencies. Your service might be perfectly healthy, but the payment processor, search index, or AI provider you depend on isn't. The customer can't tell the difference; for them, your product is broken.
- Geographic / network paths. Your probes hit from a few regions. Customers hit from everywhere — corporate networks with strange firewall rules, mobile carriers with weird routing, regions where your CDN is having localized trouble.
- Mobile-specific code paths. If your synthetic monitoring is desktop-only, a bug specific to mobile WebKit can affect a third of your traffic with zero monitoring noise.
A useful question when reports come in: which of these gaps is the report closest to? If the answer is obvious, start there.
What to post during the investigation
Even before you've identified the cause, post on the status page. Specifically, post about the area customers are reporting:
"We're investigating reports of issues with [feature/area]. We haven't identified the cause yet, and our automated monitoring isn't currently flagging it, but we're treating the reports as real. Next update by [time]."
The "automated monitoring isn't flagging it" line isn't required, but if your audience is technical it's a useful signal — it tells customers you're not just trusting your dashboards. It also subtly inoculates against the "but isn't your status page green?" reply that you'll otherwise field three times.
The bigger point: post even when monitoring is silent. Customer-reported incidents that go unannounced because dashboards looked clean produce the worst kind of trust damage. Customers tell each other "I emailed support and they had no idea anything was wrong" — and even after you fix it, that story sticks longer than the actual outage did.
The recovery decision
The harder version of this scenario is the end of the incident. Monitoring was green throughout. So when do you mark it resolved?
The wrong answer: "monitoring went back to baseline." It never left baseline. You have nothing to compare against.
The right answer: verify with the customers who reported it. If you have a support inbox and the affected customers are reachable, the cleanest thing is to reply: "Can you try this again and let me know if it's working for you now?" When two or three confirmations come back, post the resolution.
If reaching out feels heavy, watch the report rate. New reports stopping is weak signal — it could mean you fixed it, or it could mean affected customers gave up and went to lunch. The absence of new emails in a 20-minute window is enough confidence to post resolution; in a 5-minute window, it isn't.
Either way, your resolution update should reflect that the incident was real even though monitoring missed it. Don't write "automated monitoring detected normal levels and the incident has been resolved." That phrasing is correct in form and exactly wrong in tone — it suggests the dashboards are still your source of truth. They aren't, for this incident.
Don't write the "monitoring gap" story on the status page
When you eventually write the resolution and the postmortem, separate the two narratives:
- The status page version is the customer impact story. What was broken, who couldn't do what, for how long, what's fixed. In customer-facing language.
- The postmortem version can include the monitoring gap. Why didn't your tooling catch this? What probe is missing? What's the fix to detection? That's all useful context for engineering, but it's not a customer concern, and putting it on the status page reads as deflection ("our tools didn't see it" sounds like an excuse).
The customer doesn't care that your monitoring didn't flag it. They care that something they were trying to do didn't work. The status page acknowledges that. The postmortem (see Postmortems for Small Teams) is where you analyze the gap.
The feedback loop
Every "monitoring green, customers broken" incident is a probe waiting to be written. The investigation already produced everything you need:
- The exact failing path is now known.
- The reproduction steps are written down (in your investigation notes).
- The time-to-detection from customer signal is now measurable.
A useful habit: in the action items of every postmortem from one of these incidents, include one specific monitoring addition. Not "improve monitoring." A named probe: "Add a synthetic check for the signup confirmation email arriving within 60 seconds." Each incident closes one specific blind spot.
Done a few times, this turns "monitoring catches some things, customers report the rest" from a steady state into something that compounds. You don't need to monitor everything. You just need each incident to teach you the next thing to monitor.
The instinct that matters
When customer reports and monitoring disagree, a tool you set up months ago is overruling a person who is telling you, in real time, that something they're trying to do doesn't work. Trust the person.
That's not "ignore your monitoring." Monitoring is excellent at catching things customers won't notice or won't report — most outages it catches first, and it should. But when monitoring is green and customers aren't, the silence isn't evidence of nothing being wrong. It's evidence of nothing being wrong in the part of the system you decided to watch. The rest of the system is still there, and customers are doing whatever they want in it.
Your job, in the moment, is to investigate the area they're pointing at — not to first reassure yourself that the dashboards really are clean. The dashboards are clean. That's not the question.
PageCalm helps small teams run status pages with AI-powered incident updates that sound human and ship fast. Try it free — no credit card required.