I'm David Mah, a Site Reliability Engineering coach.

At Dropbox, I was the Tech Lead and Engineering Manager responsible for storage reliability. I was a core member of 'Magic Pocket', a colossal undertaking to move all user data out of AWS and into Dropbox's own data centers.

Now, I coach technical founders of early traction startups to stabilize the reliability of their infrastructure and architect it towards accelerating product development.

Together, we dive into the details of how their sites are built and what causes reliability issues. To be efficient, we implement fixes that aren't merely band-aids but also aren't overly complex. Once the site is stable, we redirect our attention towards architecture improvements that reduce outage risk and speed up development in general. Though our main vehicle is coaching discussions, if it's necessary, I jump in to debug systems and write code.

If you don't take reliability seriously, the pitfalls might destroy your company. In practice, this manifests as your engineering team spending an ever-increasing amount of time fixing outages while your product stagnates.

Good solutions involve the leanest surgical cuts to sustainably achieve reliability, so that you can then return to working on your product. In the end, your product is what matters, and your infrastructure only exists to serve it. Your infrastructure needs to be architected with that in mind.

NOTE: I've joined Figma to lead efforts in Infrastructure Reliability! I'm consulting now only in a limited capacity. I'm no longer performing hands-on technical audits, but I'm continuing to offer coaching to bring engineers up to speed on reliability.

For my take on engineering and other topics, read my essays.

If you want a deep dive on SRE, check out my essay sequence, Site Reliability Engineering Birds Eye View.

If you're interested in being coached, read my page on coaching.