davidmah.com

SRE 12: Latency - Diagnosis Demands Precise Instrumentation

This is part 12 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we finished discussing Protection, Verification, and Remediation, the building blocks of incident response.

Essays will now be moving towards discussing some nuances of the reliability dimensions. We'll start here with Latency, where we will discuss how diagnosis of latency incidents is extraordinarily hard without having the proper tools ready in advance.

Imagine a situation where the site is intolerable slow. The operator tries it for themselves, and they find that it takes over 8 seconds for the front page to load.

The next step is to figure out which subcomponent the slowness is coming from. Is it in the:

Network?
Web servers?
- Connection handling?
  - Which data structure?
- Application logic?
  - Which loop?
- Data serialization?
Database?
- Which query?
  - In lookup?
  - In retrieval?
  - In postprocessing?

It'd be quite easy if the site was instead returning errors! The operator could look at the logs and see the source of the errors. Unfortunately, that's not the situation.

An operator without tools will be stuck here. They're clear on exactly what information they want to know, but they won't know how to get that. Perhaps they will consider trying to add print statements to production servers, but then it dawns on them that deploying and reading print statements to every production machine in itself would be quite the hassle and detour. There are paths forward, but none of them are simple, so the outage is going to be quite prolonged.

Lacking the right tools, this person isn't set up for success. They need a way to answer those specific questions about the timing breakdown. Ideally, they'd have a dashboard that exposes the timing breakdown of a sample of requests.

- front_page: 8.1s
    - network: 0.1s
    - web server: 7.8s
        - connection handling: 0.1s
        - application logic: 7.6s
            - loop1: 0.1s
            - loop2: 7.3s
            - loop3: 0.2s
        - data serialization: 0.2s
    - database: 0.2s
        - select *: 0.001s
            - lookup: 0.001s
            - retrieval: 0.000s
            - postprocessing: 0.000s
        - select count(*): 0.199s
            - lookup: 0.001s
            - retrieval: 0.002s
            - postprocessing: 0.197s

At a glance, the operator can answer that --

Most of the time is spent in the web server
Most of that time is in application logic
Most of that time is the second loop

Which is enough information for them to go look at the code to investigate that second loop.

This instrumentation has to exist in advance. It's an essential tool for reducing the time taken to diagnose incidents. Without it, diagnosing latency incidents will be a scramble.

That's a good place to stop for this essay. In the next essay, we will discuss how latencies tend to creep their way upwards without the team noticing. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.