Holding the line: 11 years on a tier-1 AWS service
What I learned operating a high-throughput AWS service from 2012 to 2023 — the parts that don't show up in postmortems.
Note: this is a public, anonymized retrospective. No internal architecture, customer data, or proprietary metrics. The lessons travel; the specifics stay at Amazon.
Context
I spent eleven years at Amazon, the bulk of it inside AWS, on services that customers expected to behave like a utility. Tier-1, multi-region, no acceptable downtime, the kind of system where “the graph went flat for nine seconds” gets a serious meeting.
I came in as an SDE II, moved into systems engineering, then into tech-ops — a path that’s less common than the canonical SDE-to-Senior-SDE ladder, and one I’d recommend to anyone who actually likes operating production systems. You learn things on the floor that you’d never learn from feature work.
Constraints
The constraints that shaped almost every decision:
- No customer-visible regression, ever. “Better in the average case” is not a thing if the worst case got worse.
- Backwards compatibility is forever. APIs we shipped in 2014 still had clients in 2022. A “small” deprecation was a multi-quarter program.
- Operability beats elegance. A system that’s 30% slower but can be operated by a tired on-call at 3am beats a beautiful system that requires the original author to debug.
- Blast radius is the budget. Anything that could take down a region was treated like a financial control, not an engineering choice.
What I learned that wasn’t obvious going in
1. Most outages are slow-motion
The dramatic ones get the writeups. The real pattern is something drifts — a queue depth, an error rate, a memory curve — for hours or days, and the page fires when the drift crosses a threshold. The work isn’t heroic firefighting. It’s noticing earlier. Almost every postmortem I wrote ended with “the signal was visible at T-90 minutes; we noticed at T-0.”
The fix was rarely a smarter alarm. It was a flatter dashboard, a less noisy on-call, and an explicit norm that drift is a page-worthy signal. Culture, not code.
2. Runbooks are a refactoring target
Every service had runbooks. The good services treated them like code: versioned, tested, owned, deleted when stale. The bad services treated them like documentation, which is to say: written once, never read, technically present in case of audit.
A runbook that hasn’t been executed in six months is a liability. A runbook that an on-call new to the service can run end-to-end is a moat.
3. The on-call ratio is a leading indicator
If your team is 6 engineers and your rotation is 1-in-3, your team is two illnesses and a vacation away from someone hating their life. By the time pages-per-week shows up in a metrics review, you’re already six months late.
The number I tracked instead was uninterrupted weekend hours per engineer per quarter. When that dropped below a floor, I’d freeze feature work and do reliability investments until it recovered. Every time I did this, the next two quarters of feature throughput went up, not down. Burnout debt compounds faster than tech debt.
4. The incident command system isn’t optional
Big companies have it. Small teams skip it. They’re wrong to. Even a two-person incident benefits from one person being explicitly the IC. It’s not bureaucracy — it’s a way to avoid two engineers doing the same thing while a third thing rots.
What I’d do differently
I wish I’d written more. Not for the company — for myself. Eleven years of incidents, and the only durable record of what I learned is in my head. Every time I do a job interview now, I’m reconstructing examples from memory that I should have written down at the time. That’s part of why this site exists.
I’d also have spent less time defending systems that should have been deleted. “It still works” is not a reason for a system to be alive in 2023. Every dollar of operational toll on a deprecated thing is a dollar that wasn’t paid into the thing customers actually use.
Credit
I’m naming nobody, because that’s not mine to do. But: I worked with an absurdly good set of on-calls, especially the SRE/tech-ops crews who held the floor through the bad nights. Most of what I know about operating production I learned from them.
Written from the safe distance of three years out. If you’re an Amazon engineer reading this and I got something wrong, tell me — ask Matt.