We practice blameless postmortems because humans make mistakes and systems enable them, but we also maintain high expectations for ownership. Our approach treats SLOs as commitments, so when they break, we explain how we’ll recover. We build public timelines with facts first and interpretations after. We use five whys to drive toward system changes, ensuring action items modify code, configurations, or processes. We assign named ownership because dates are commitments, not aspirations.
What we avoid is equally important. We don’t conduct witch hunts because we fix causes, not people. We don’t create “action items” that are merely reminders. We don’t close incidents without verifying fixes actually work. Blameless is how we learn, but accountability is how we improve.
I remember an incident that changed our practices fundamentally. We shipped a small configuration change on Friday that silently disabled retries for a high-traffic queue. Everything appeared normal until traffic peaked, then errors spiked, dashboards lagged, and manual rollback took too long because the on-call engineer had to hunt for runbook documentation.
The postmortem found no villains, just a system that made breaking things easy and fixing them difficult. There were no CI guardrails for risky late-week changes, no unified view of queue health and retry policy changes, and runbooks scattered across three locations with inconsistent steps.
The action items weren’t “be more careful.” Instead, we addressed systemic issues. We implemented a CI policy requiring approval for Thursday and Friday configuration changes with rationale, though we included an emergency override path. We built a unified queue dashboard showing saturation, retries, and back-pressure. We consolidated runbooks in a single repository with standardized format and alert links.
The next incident saw MTTR halve, not from trying harder but from better systems. This experience reinforced that incidents are tuition—you should get value by changing systems and ensuring someone owns the change.
Our postmortem process starts by building a factual timeline from logs, metrics, alerts, and commits using no adjectives. We identify impact by noting which SLOs broke, who was affected, and for how long. We map contributing factors across people, process, tooling, and architecture. We write three to seven action items that change systems, each with owner, date, and verification method. Finally, we share widely because if we’re hesitant to publish internally, we’re probably not being honest enough.
The action items we create must change code, configuration, or process rather than being reminders. They should be small and testable, taking one to three days of work, or explicitly milestoned with incremental value. Each includes “how we’ll know it worked” criteria, like “Synthetic check fails if retries are disabled.” We track everything in our work management system with links back to the postmortem.
We verify fixes by testing in staging and production with game days when possible. We add synthetic monitoring when applicable so failure classes wake us in minutes rather than hours. We tag incidents by category to identify quarterly patterns.
The leading indicators we monitor include SLO attainment trends and burn rates both per service and globally, mean time to detect and mean time to restore, pages per on-call shift and after-hours load, plus repeat incident patterns and their trends.
Our cultural practices support this approach. We recognize clear incident writeups and changes that remove sharp edges. We rotate facilitators so more people practice running calm, factual reviews. We thank messengers because the first person to identify risky changes helps everyone.
For teams getting started, adopt an SLO for one customer-critical path and publish it. Add linked runbooks for your top three alert types. Create an incident template with timeline, impact, contributing factors, and system-changing actions. Implement CI guardrails for risky changes with legitimate override paths.