Outage Run books

Immediately after an incident has been mitigated we should identify an engineer who will perform a post mortem analysis
- Post Mortems will also have an accompanying meeting where we go over the causes and discuss mitigations. We will have a blameless post mortem culture, discussing who's "fault" an outage is isn't productive, the goal is to understand what happened, you're safe to explain what happened without fear of repercussions, we just want to make the system run better.
- The value of a post mortem isn't just learning what happened it's also in digging into the overall reasons the system failed. For example, "The Database Ran out of Disk Space" is the nominal reason the system went down, but what could we have done differently to catch the issue? Should we add more monitoring? Is there appropriate alerting with enough context going to a responsible person who can proactively fix the problem?
- My goal is for post mortems is for us to gain additional information about how the system operates, and for the time being the expectation is that post mortems will come with a set of "mitigation tasks" that are either p0 or p1. Priority 0 mitigation tasks will have an SLA of 7 days to implement, p1 level mitigations will have a 30 day SLA. Keeping the system running is more important than new features and these SLAs are designed to give engineers the space that they need to fix stability and reliability issues.

Option for short-term resolution: Force pushing to production through Vercel
- It is an option to force push deployments to Vercel using vc —prod from the command line. However this has its own share of issues as 1) it could cause additional breaks as the code is not checked through the same channels and 2) it will be overwritten whenever the next code touching the app is deployed.

Other Runbooks

For Analytics failures where data isn’t being copied seeLogical Replication
We use Kafka for processing inbound email (and for other inbound change data capture). Check out