welcome: please sign in
location: AlertingPhilosophy

Moved to: https://osmium.morningside.edu/confluence/x/eoJF

My Philosophy on Alerting

based my observations while I was a Site Reliability Engineer at Google

Author: Rob Ewaschuk < rob@infinitepigeons.org > Source: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#

Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

Introduction

After seven years of being oncall for a variety of different services, including both massive- and small-scale, fast-moving products, and several parts of core infrastructure, I have developed a philosophy on monitoring and alerting. It reflects my fundamental view on pages and pagers:

Overall, it's a bit aspirational, but it guided me when I wrote or reviewed a new paging rule in the monitoring systems. These are some questions I use when I'm writing or reviewing a new rule that might result in a page:

The ideas below are certainly aspirational—no pager rotation of a growing, changing service is ever as clean as it could be—but there are some tricks that get you much closer.

Vernacular

This document uses the following terms:

Monitor for your usersI call this "symptom-based monitoring," in contrast to "cause-based monitoring". Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you're cringing already, in love with your Nagios rules for MySQL servers? Your users don't even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.

Users, in general, care about a small number of things:

That's pretty much it. There's a subtle but important difference between database servers being unavailable and user data being unavailable. The former is a proximate cause, the latter is a symptom. You can't always cleanly distinguish these things, particularly when you don't have a way to mimic the client's perspective (e.g. a blackbox probe or monitoring their perspective directly). But when you can, you should.

Cause-based alerts are bad (but sometimes necessary)

"But," you might say, "I know database servers that are unreachable results in user data unavailability." That's fine. Alert on the data unavailability. Alert on the symptom: the 500, the Oops!, the whitebox metric that indicates that not all servers were reached from the database's client. Why?

But sometimes they're necessary. There's (often) no symptoms to "almost" running out of quota or memory or disk I/O, etc., so you want rules to know you're walking towards a cliff. Use these sparingly; don't write cause-based paging rules for symptoms you can catch otherwise.

Alerting from the spout (or beyond!)

The best alerts in a layered client/server system come from the client's perspective:

For many services, this means alerting on what your front-most load-balancers see in terms of latency, errors, etc. This way you only see the result of broken servers if those results are making it to the user. Conversely, you're seeing a bigger class of problems than you can see from your servers: if they're all down, or serving out uncounted 500s, or dropping 10% of connections on the floor, your load balancer knows but your server might not.

Note that going too far can introduce agents that are beyond your control and responsibility. If you can reliably capture a view of exactly what your users sees (e.g. via browser-side instrumentation), that's great! But remember that signal is full of noise—their ISP, browser, client-side load and performance—so it probably shouldn't be the only way you see the world. It may also be lossy, if your external monitoring can't always contact you. Taken to this kind of extreme, it's still a useful signal but maybe not one you want to page on.

Causes are still useful

Cause-based rules can still be useful. In particular, they can help you jump quickly to a known deficiency in your production system.

If you gain a lot of value in automatically tying symptoms back to causes, perhaps because there are causes that are outside of your control to eliminate, I advocate this technique:

  1. When you write (or discover) a rule that represents a cause, check that the symptom is also caught. If not, make it so.
  2. Print a terse summary of all of your cause-based rules that are firing in every page that you send out. A quick skim by a human can identify whether the symptom they just got paged for has an already-identified cause. This might look like:

TooMany500StatusCodes
Served 10.7% 5xx results in the last 3 minutes!
  Also firing:

   . JanitorProcessNotKeepingUp
   UserDatabaseShardDown
   FreshnessIndexBehind

In this case it's clear that the most likely source of 500s is a database problem; if instead the firing symptom had been that a disk was getting full, or that result pages were coming back empty or stale, the other two causes might have been interesting.

  1. Remove or tune cause-based rules that are noisy or persistent or otherwise low-value.

Using this approach, the mental burden of the mistuned, noisy rules has been changed from a pager beep & ack (and investigation, and followup, and..) to a single line of text to be skimmed over. Finally, since you need clear debugging dashboards anyway (for problems that don't start with an alert), this is another good place to expose cause-based rules.

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.

Tickets, Reports and Email

One way or another, you have some alerts that need attention soon, but not right now. I call these "sub-critical alerts".

The underlying point is to create a system that still has accountability for responsiveness, but doesn't have the high cost of waking someone up, interrupting their dinner, or preventing snuggling with a significant other.

Playbooks

Playbooks (or runbooks) are an important part of an alerting system; it's best to have an entry for each alert or family of alerts that catch a symptom, which can further explain what the alert means and how it might be addressed.

In general, if your playbook has a long detailed flow chart, you're potentially spending too much time documenting what could be wrong and too little time fixing it—unless the root causes are completely out of your control or fundamentally require human intervention (like calling a vendor). The best playbooks I've seen have a few notes about exactly what the alert means, and what's currently interesting about an alert ("We've had a spate of power outages from our widgets from VendorX; if you find this, please add it to Bug 12345 where we're tracking things for patterns".) Most such notes should be ephemeral, so a wiki or similar is a great tool.

Tracking & Accountability

Track your pages, and all your other alerts. If a page is firing and people just say "I looked, nothing was wrong", that's a pretty strong sign that you need to remove the paging rule, or demote it or collect data in some other way. Alerts that are less than 50% accurate are broken; even those that are false positives 10% of the time merit more consideration.

Having a system in place (e.g. a weekly review of all pages, and quarterly statistics) can help keep a handle on the big picture of what's going on, and tease out patterns that are lost when the pager is handed from one human to the next.

You're being naïve!

Yup, though I prefer the term "aspirational". Here are some great reasons to break the above guidelines:


May the queries flow, and your pagers be quiet.

AlertingPhilosophy (last edited 2018-01-09 16:52:09 by meyersh)