- Engineers that are on call should not be allocated in product sprints. The on-call engineer is protecting the rest of the engineering team from distractions and is providing service to the rest of the company. They can't do this well if they're juggling feature development work.
- Engineers are expected to be available to respond to and triage technical issues around the company, please use our Incident Severity Levels docs to understand the expectations for response time and escalating to out people for outages.
- The On-Call engineer runs the incident, investigates what's happening, determines the severity level, and is empowered to mobilize other parts of the team based on their discretion. It's totally valid to loop @Adam Haney into these decisions if you're unsure.
- If it’s a P3 or more severe, the On-Call communicates (or delegates communication) about the status of the incident to the rest of the company. This includes sending out a company-wide(partners) email and tagging the relevant stakeholders on Slack (leads, operators, PMs).
- After the incident, the On-Call engineer digs in and researches what went wrong, the steps that lead us there, and investigates our systems. They then write a Post Mortem doc that goes in the database in notion. For P0 incidents they should also email this document to the rest of the company.
- The On-Call engineer will be asked to troubleshoot security issues as they arise. @Drew Sutherland is the RP for security and issues will be discussed in
#tool-vanta
where the @egt-on-call
will be tagged.
- On-Call should triage errors in Sentry.io to determine if those errors constitute incidents. Using their best judgment they are encouraged to ship fixes for errors in sentry during their on-call week.
- When the on-call is not responding to incidents they should use their time to make our systems better.
- Check the SYS JIRA project (https://invisible.atlassian.net/jira/software/projects/SYS/boards/50/backlog) if you need ideas for projects to work on
- Or fix bugs you see coming through Sentry
- On-call should follow
#egt-alerts
in Slack closely and respond to those alerts
- At the start of your shift make sure to take over in the
@egt-oncall
group in Slack. Follow this guide to set yourself as the @egt-oncall:
How To: Set Yourself as egt-oncall in Slack
- At the end of your on-call shift, you should write up what happened while you were on call, the status of any ongoing incidents, and a description of the better engineering tasks you were able to complete. You can also refer to this template On Call Report Template to write the on-call report (please send this both as an email and add it to the database. To make it easier to write the on call report at the end of your shift it’s suggested that you create the report at the beginning of your shift to take notes in, then clean up / edit the report before the hand-off meeting.
Tools We Use
- Ops Genie is our on-call scheduling and alerting tool you can find it at: https://invisible.app.opsgenie.com/
- To view the current on-call schedule look here: https://invisible.app.opsgenie.com/settings/schedule/detail/ecab26ad-b22c-42a7-846c-712078d8281b
Setting Up On Call
- Make sure you're set up in the on-call schedule in Ops Genie (https://invisible.app.opsgenie.com/alert/list)
- Download the Ops Genie App and make sure you have alerts set up
- Make sure that you have SMS and Voice notifications set up, you're expected to respond promptly to alerts so use your best judgment around how you wish to be alerted for a timely response.
- You can set alert preferences here: https://invisible.app.opsgenie.com/settings/user/notification
- @Keenahn Jung sez: TEST the connections by creating a P1 alert and ensuring that you receive the proper notifications. As of 2/2/22, the voice calls did not seem to be working. If this is the case, please manually add your phone number to the
#egt-alerts
channel and the @egt-oncall
group descriptions in slack.
General Guidance
- What should I do if I discover a bug (NOT an incident) should I fix it or create a bug ticket?
- This is very case dependent. Help when you can, prioritize incidents.