Designing Anomaly Detection That Engineers Actually Trust

We shipped an anomaly detector last year that flagged a real incident and nobody acted on it for 22 minutes. The model was right. The page was clear. The on-call had been burned by false positives so many times that they muted the channel before reading.

That was a product failure, not a stats failure.

What we changed

The detector got smarter — sure. The bigger change was that every alert now carries three things, every time:

The metric and its baseline, plotted on the same axis. No alert without a graph.
A confidence score, plain language: “high / medium / monitoring.” Not p-values. Not z-scores.
A recommended next action. “Check pipeline latency in region us-east-1.” Even if it’s wrong, it gives the engineer a starting point.

What didn’t work

Auto-suppression of low-confidence alerts. We tried it. Engineers stopped trusting every alert because they knew some had been silently filtered. The right answer was to keep them all visible but rank them.

Alerts are a UX surface, not a notification firehose. The hardest review we do on every detection feature is “what does this look like at 3am.”

Designing Anomaly Detection That Engineers Actually Trust

What we changed

What didn’t work

More from the blog

Building Real-Time Data Pipelines at Scale

How We Cut Pipeline Latency by 85% With Adaptive Buffering

A Practical Guide to Vector Search at Production Scale