Communicating the Status Of An Outage
- [Link to appropriate slack channel]
- [Instructions on starting a zoom/call for tracking status]
Searching for Causes
- Look for recent Deploys [Link to deploys/merged Pull Requests]
- Recently Closed Github PRs
- Vercel
- Vercel Activity Log (shows recent deploys/changes in Vercel services)
- Use the vercel cli with alias command.
- Use
vercel
or vc alias set <vercel cannonical url of deployment> <url you want to point to it> -S invisible-tech
- Example
vc alias set manticore-f3qqk756r-invisible-tech.vercel.app app.inv.tech -S invisible-tech
Mitigating Causes / Post-Mortems
- Immediately after an incident has been mitigated we should identify an engineer who will perform a post mortem analysis
- Post Mortems will also have an accompanying meeting where we go over the causes and discuss mitigations. We will have a blameless post mortem culture, discussing who's "fault" an outage is isn't productive, the goal is to understand what happened, you're safe to explain what happened without fear of repercussions, we just want to make the system run better.
- The value of a post mortem isn't just learning what happened it's also in digging into the overall reasons the system failed. For example, "The Database Ran out of Disk Space" is the nominal reason the system went down, but what could we have done differently to catch the issue? Should we add more monitoring? Is there appropriate alerting with enough context going to a responsible person who can proactively fix the problem?
- My goal is for post mortems is for us to gain additional information about how the system operates, and for the time being the expectation is that post mortems will come with a set of "mitigation tasks" that are either p0 or p1. Priority 0 mitigation tasks will have an SLA of 7 days to implement, p1 level mitigations will have a 30 day SLA. Keeping the system running is more important than new features and these SLAs are designed to give engineers the space that they need to fix stability and reliability issues.
Resolution steps
- Option for short-term resolution: Force pushing to production through Vercel
- It is an option to force push deployments to Vercel using
vc —prod
from the command line. However this has its own share of issues as 1) it could cause additional breaks as the code is not checked through the same channels and 2) it will be overwritten whenever the next code touching the app is deployed.
Other Runbooks
- For Analytics failures where data isn’t being copied seeLogical Replication
- We use Kafka for processing inbound email (and for other inbound change data capture). Check out
Playbook Engineer Agent Pay
Runbook: Kafka Issues