Postmortems for Small Teams: What to Actually Write
The incident is over. Someone in Slack says "we should do a postmortem." Everyone agrees. Nobody wants to do it.
Most postmortem templates make this worse. They have eight sections, three of which assume you had an incident commander, two of which assume a formal review meeting, and one labeled "Five Whys" that always feels manufactured. You spend an hour filling in boxes for an audience that doesn't exist. The document gets saved, never read, and the next time something breaks the team is starting from scratch on what they learned last time.
There's a tighter version that actually works for small teams. It has four sections. It takes thirty minutes. The team will look at it again — which is the entire point.
Why postmortems matter more for small teams, not less
The standard reasoning for skipping a postmortem on a small team is "we already know what happened — we just lived through it." That's true today. It won't be true in three months.
Pattern memory in a small team is short. Without a written record, here's what gets lost:
- The actual root cause. Not the surface symptom — the real one. Was it the recent deploy? The third-party API timeout? The race condition that's been latent for six months? Three months later, you'll remember "we had an outage in April" but not which underlying thing caused it.
- What you tried that didn't work. The first three theories. The hypothesis you ran down for an hour. These are valuable for the next incident — but only if they're written down somewhere.
- The action item that almost happened. "We should add monitoring for X" — said in the heat of the postmortem, never followed through, the same incident recurs six months later.
Bigger orgs have institutional memory in the form of senior engineers, oncall handoffs, and incident management tools. Small teams have one or two people, both of whom were debugging the incident at the time and now have to write it down or accept that it will eventually happen again the same way. The postmortem isn't a ritual — it's the only thing standing between you and the same outage at midnight three months from now.
What enterprise templates get wrong
Most postmortem templates on the internet are written by or for ops orgs with a clear separation of roles: an on-call engineer who handled the incident, an incident commander who coordinated, a scribe who tracked the timeline, and a separate review group that meets to discuss it later. The template reflects that workflow.
For a team of two, that workflow doesn't exist. The on-call engineer, the commander, the scribe, and the reviewer are all the same person. The "review meeting" is a Slack message that says "look at this when you have a minute." Trying to fill in a template designed for the other workflow produces a document where half the sections are blank or padded with rituals.
The sections that don't survive contact with a small team:
- "Five Whys" analysis. The structure assumes a deep org and a complex chain of causes. For a small team incident, the cause is usually one or two things: a deploy went bad, an external dependency was unreliable, a query started running slow, an alert was missing. Forcing it into a Five Whys frame produces fake intermediate steps. ("Why did the query run slow?" "Because the table grew." "Why did the table grow?" "Because we have customers." Now what?)
- "Contributing factors." Often becomes speculation about what could have made things worse. Rarely produces actions and inflates the document.
- "Lessons learned." The vague summary section. Almost never says anything specific enough to act on.
- Blameless framing rituals. A formal "this is a blameless postmortem, no fault is being assigned" preamble is theater on a team where you already know who pushed the deploy. The blameless principle is real and worth keeping; the ceremony is unnecessary.
What's left after stripping those is small enough to fit on one screen. That's the right size.
The four sections
A useful postmortem for a small team has exactly four sections.
1. What customers experienced
In customer language, not yours.
"Between 14:42 and 15:18 (36 minutes), customers couldn't sign in. Signed-in users were unaffected. Signups were also blocked during this window."
Write this first because it's the part you'll reference later. When you're triaging a similar incident in the future, the question is "have we seen this before?" and the answer comes from this paragraph. It needs to be specific enough that you can pattern-match against it without re-reading the whole document.
What it should not contain:
- Internal service names ("auth-svc-2", "session-cache-edge")
- Infrastructure terms ("connection pool", "queue depth")
- Raw error codes
If the next person who scans your postmortem archive needs to look up a term to interpret this section, it's written wrong.
2. What actually broke
Now you can be technical. Keep it to one paragraph.
"An out-of-memory kill on the primary auth pod caused a brief restart. Pod recovery took about 45 seconds, but a stale DNS cache in the load balancer kept routing requests to the dead pod for another 30 minutes until the TTL expired. The stale cache plus the missed health check is the actual cause — the OOM by itself would have been a minute-long blip."
The discipline here is naming the real cause, not just the trigger. Most incidents have a trigger (the deploy, the traffic spike, the dependency that timed out) and an underlying cause (the missing health check, the absent timeout, the misconfigured retry). The trigger is what made the incident happen today. The underlying cause is why it could happen at all. The action items in section 4 should target the underlying cause, not the trigger.
3. The timeline
Four to six bullets with timestamps:
- 14:42 — First customer report ("can't sign in")
- 14:48 — Started investigating, initially suspected the recent deploy
- 14:55 — Ruled out the deploy; checked auth service health
- 15:03 — Identified stale DNS cache, manually flushed
- 15:18 — Customer impact resolved
- 15:30 — Confirmed sustained recovery, posted resolution update
The timeline matters mostly because it surfaces gaps. Was detection slow? (First report at 14:42, but the underlying failure started at 14:30 — alerting missed 12 minutes.) Was diagnosis slow? (21 minutes from detection to identification — what slowed it down?) Was the fix slow once you knew?
If the timeline reads "fast detection, fast diagnosis, fast fix," you don't need many action items. If it shows a 20-minute gap between detection and diagnosis, that's where the next action item goes.
4. What you're changing
Maximum three things. Better: one.
- Add a synthetic probe against the sign-in endpoint, paged at the 2-minute error threshold. (Alice, by Friday)
- Set a shorter DNS TTL (60s) on internal load balancer entries. (Bob, by Monday)
Each item has a person and a date. Items without those are wishes, not commitments.
The ratio matters: small teams that list eight action items will do one of them within a month. Small teams that list one action item will do it within a week. The temptation to list everything you noticed is strong. Resist. A postmortem with one shipped action item is more valuable than a postmortem with eight that languish.
What to leave out
Beyond the enterprise-template artifacts above, a few specific things commonly bloat postmortems and add no value:
- The full Slack transcript. Nobody will read it later. If a specific message is load-bearing, paste that one line into the timeline.
- Speculation about how this could have been worse. Compelling in the moment, useless three months later. If "could have been worse" implies a real risk, the action item lives in section 4.
- Praise for individual heroics. The point of a postmortem isn't to recognize the engineer who debugged at 2am. (Recognize them separately, in a Slack message.) The postmortem is a reference document.
- Vague preventive measures. "We need to think about better monitoring" is not an action item. "We've added a synthetic probe for the sign-in endpoint" is. If the action isn't specific enough that someone can mark it done, it doesn't go in section 4.
Internal vs customer-facing
What you've written so far is the internal version. It's the document you'll reference next time something similar happens.
Whether to publish a customer-facing version depends on the incident. The threshold is roughly: did the incident last long enough that customers noticed and remembered? A 15-minute blip almost never warrants one. A multi-hour degradation, especially one affecting transactions, almost always does.
The customer version is a translation, not a copy. It's shorter, written entirely in customer language (translate every section-2 technical detail), and skips the parts that are operationally interesting but not customer-relevant. The timeline can collapse into a single duration. The action items become reassurances.
Roughly half the length, none of the internal service names, all the parts that affect what a customer would do differently. That's the public version.
When a postmortem isn't needed
You don't need a postmortem for every incident. The 5-minute blip with one customer report and an obvious cause doesn't need a four-section document — it needs a Slack message that says "this happened, this fixed it, here's the timeline." Save those messages in a thread or a running doc; that's the log of small things.
The threshold for graduating from "Slack note" to "postmortem" is: did anything about this incident make you stop and think? An unexpected cause, a slow diagnosis, a near-miss in detection, an action item that came out of it? If yes, it's a postmortem. If no, the Slack note is enough.
The reverse mistake — writing too few postmortems — is worse than the standard one. Most teams underdocument because the work feels disproportionate to the incident size. The fix is to make the postmortem itself smaller, not to skip it.
A template you can copy
# Postmortem: [date] — [short description]
## What customers experienced
[Customer-facing impact: who, how long, what they couldn't do.]
## What actually broke
[Technical cause in one paragraph. Trigger and underlying cause.]
## Timeline
- HH:MM — First signal
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Customer impact resolved
## What we're changing
1. [Specific action — owner — date]
2. [Specific action — owner — date]
That's it. No "five whys," no "contributing factors," no "lessons learned." The whole thing fits on one screen, takes thirty minutes to write, and is short enough that the team will look at it again.
Write it before you close the tab
The biggest predictor of whether a postmortem gets done is when it gets done. If you wait until the next morning, the details have already softened. If you wait a week, the document is half guesswork. If you start it during the incident — even just stubbing the timeline as you go — it writes itself when the incident closes.
The thirty minutes you spend on the postmortem now buys back hours of debugging six months from now, when the same class of incident recurs and the document tells you everything you tried last time. That's the only ROI calculation that matters. Small teams aren't too busy to write postmortems. They're too busy not to.
PageCalm helps small teams run status pages with AI-powered incident updates that sound human and ship fast. Try it free — no credit card required.