davidmah.com

SRE 18: Where Infrastructure and SRE Fits In

This is part 18 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we finished giving flavor to each of the reliability dimensions.

For this final essay, we'll be connecting SRE into the big picture. We'll answer these questions --

What is the goal of an "infrastructure team"?
Where does Site Reliability Engineering fit within infrastructure engineering?
How should you pick which projects to appropriately staff your infrastructure people?

To answer these, imagine this narrative. At the initial launch of your company, it was just you and the rest of your founding team.

When you built your product MVP, you were responsible for all outcomes. The features had to work. it had to be responsive. The site had to be up. You needed to make sure to not lose any data. That is, you were responsible for latency, availability, and durability, the reliability outcomes.

Before long, the technology stack got more complicated, and you separated your concerns by carving a line between frontend and backend logic. As this progressed, you also began hiring staff to place on either side of this line. Unfortunately, some clarity is lost. Who owns outcomes, for instance, latency, which can't be pinned to just one side of the stack?

I'll suggest a slightly different approach. Instead of separating your staffing between frontend and backend, separate your staffing between product features and redundant responsibilities. Your infrastructure team's job is to own redundant responsibilities. We'll plug reliability back in a bit, but let's just first discuss redundant responsibilities.

Let's say you have a product feature that needs a new data model that no other features will use. Even though it is "backend", there is no need for the work to go to a different team. The staff in charge of that feature can implement the data model themselves. This lets people move fast by reducing unnecessary collaboration.

On the other hand, let's say you have two different product features that need a similar data model. It would be a lot to ask product staff to build this, as there is enough complexity just in considering their feature. Infrastructure staff can act as the bridge here, asking the product staff what their use cases are in order to coalesce it into a coherent single model.

In this way, the infrastructure team is not a staffing vacuum that takes away from product. It's quite the opposite, in that a successful infrastructure team reduces burdens for product, yielding a product team that feels like they "still move fast like a start-up."

To do this, the infrastructure team needs clarity on how to prioritize projects to serve product. It's critical that when you first build out your infrastructure team, you do not just dump ownership of lots of systems onto them. If you do that, then it is extremely unclear how their work funnels back to product. Instead, as it was in the beginning, product developers continue to stay in ownership. Nothing changes, for now.

Infrastructure's first task is to take ownership for the single most complex redundancy. Once that is smoothly sailing, they move on to the second. Over time, the team takes on more and more with a continued emphasis on whatever brings the most impact for product.

Let's plug reliability back in. It's quite likely, but not guaranteed, that facets of the reliability outcomes are redundant responsibilities that become projects for infrastructure to take on. As an example, product engineers aren't likely to be experienced with implementing defensive algorithms for availability. If there is nothing as complex, then it might be an early infrastructure project. Or perhaps not. Maybe your first infrastructure project will be to build re-usable database model logic, which has nothing to do with the reliability outcomes.

You have much to gain from starting with what is most legitimately impactful. If you focus on this and go step by step, it won't be long before infrastructure does come to encompass the reliability outcomes. The point is, reliability outcomes aren't necessarily infrastructure's purpose, it just happens to turn out to be an impactful redundant responsibility.

That's all for the whole sequence! I hope you're feeling a lot more equipped now to tackle SRE in your company. If you have questions or want to explore this further, feel free to email me at david@davidmah.com.

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.