On-Call manual: Onboarding a new person to the on-call rotation

On-Call manual: Onboarding a new person to the on-call rotation

One (selfish) reason to celebrate a new team member is that they will eventually join the on-call rotation. And when they do, the existing shifts will move farther apart. However, adding an unprepared engineer to the on-call rotation can be a disaster. This post describes what on-call onboarding looks like on our team.

The on-call onboarding process is the same for each new team member. It consists of the following steps:

Regular ramp-up
On-call overview
Shadow shift
Reverse shadow shift
First solo shift

Let’s look into each of these steps in more detail.

Regular ramp-up

The regular ramp-up aims to help new team members familiarize themselves with the problems the team is solving and teach them how to work effectively in the team’s codebase. We want new colleagues to work on the code they will be responsible for when they are on call later. This approach allows them to acquire basic context that will be useful for maintaining this code and troubleshooting issues.

On-call overview

Regular ramp-up is rarely sufficient for new people to grasp the entire infra the team is responsible for. And knowing this infra is just the tip of the iceberg. There is much more an effective on-call needs to be familiar with, for instance:

what are the dependencies, and what is the impact of their failures
how to find dashboards and use them for debugging
where to find the documentation (e.g., runbooks)
expectations, e.g., is the on-call responsible for alerts raised outside working hours
how to do deployments and rollbacks
tools used to troubleshoot and fix issues
standard operating procedures
and more

On our team, we organize knowledge-sharing sessions that give new team members an overview of all these areas. We record these sessions to make revisiting unclear topics easy.

Shadow on-call shift

During the shadow on-call shift, the on-call-in-training (a.k.a. secondary on-call) shadows an experienced on-call (a.k.a. primary on-call). Both on-calls are subscribed to all tasks and alerts, but resolving issues is the primary on-call’s responsibility. The primary on-call is expected to show the secondary on-call how to deal with outages. This is usually limited to problems occurring during working hours. Finally, the primary on-call can ask the secondary on-call to handle non-critical tasks, providing guidance as needed.

Reverse shadow on-call shift

After the shadow shift, things get real: the on-call in training becomes the primary on-call. They are now responsible for handling all alerts, tasks, deployments, etc. However, they are not alone—they have an experienced on-call having their back during the entire shift.

We schedule shadow and reverse shadow shifts back-to-back. This way, everything the on-call-in-training learned during the first shift is fresh when they become the primary on-call.

First solo shift

Once shadowing is complete, we add the new team member to the on-call rotation. We add them to the queue’s end, giving them additional time to learn more about our systems and the infrastructure.

In addition to training new on-calls, our team maintains a chat to discuss on-call problems and get help when resolving issues. Both new and experienced on-calls regularly use this chat when they are stuck because they know someone will be there to help them.

💙 If you liked this article…

I publish a weekly newsletter for software engineers who want to grow their careers. I share mistakes I’ve made and lessons I’ve learned over the past 20 years as a software engineer.

Sign up here to get articles like this delivered to your inbox:
https://www.growingdev.net/

Please follow and like us:
Pin Share