How we handle on-call rotations

At Resend, on-call exists to give every production issue a clear owner.

The goal of on-call is not heroics. The goal is to restore service quickly, escalate early, communicate clearly, and improve the system after every incident.

On-call is for production alerts and customer-impacting reliability issues.

Weekly rotation

Our on-call rotation runs weekly. Each week, we assign:

One primary on-call engineer
One secondary on-call engineer

On-call coverage

Primary

The primary on-call engineer is the first responder for alerts during the week.

The primary is expected to:

Stay reachable throughout the rotation
Have a laptop and reliable internet access
Acknowledge alerts quickly
Assess impact and severity
Start mitigation immediately
Escalate early when help is needed
Declare an incident when there is customer impact or meaningful degradation

Being primary does not mean solving everything alone. Good on-call engineers ask for help early.

Secondary

The secondary on-call engineer supports the primary and steps in when needed.

The secondary is expected to:

Be reachable throughout the rotation
Cover short planned gaps when coordinated in advance
Take over if the primary is unavailable
Help with diagnosis, mitigation, and decision-making during incidents

The secondary should be ready to jump in, not catch up from scratch.

Availability

It is fine to have errands or short personal commitments during your week. What matters is that coverage is always clear.

If you expect to be away from your laptop or without internet for a period of time, coordinate with the secondary ahead of time and make sure they explicitly confirm coverage.

If you cannot provide coverage for the week because of travel, illness, PTO, or other commitments, you are responsible for arranging a swap, ideally with at least one week's notice, and updating the on-call calendar. There should never be ambiguity about who is on call.

Responding to alerts

When an alert fires, the on-call engineer should:

Acknowledge it quickly
Assess customer impact
Start with mitigation
Escalate early if the issue is unclear, high risk, or moving too slowly
Communicate status clearly with the customers and in the incident channel

We optimize for restoring service first. A full diagnosis can happen after the system is stable.

If the issue has customer impact or degraded performance, follow our incident process.

Resolving issues

We prefer safe, reversible actions that reduce impact quickly.

This usually means:

Using runbooks when they exist
Rolling back recent changes when appropriate
Disabling risky features or flags when needed
Avoiding single-threaded debugging for too long
Leaving clear notes so others can join quickly

The fastest path to stability is usually better than the most elegant technical fix in the moment.

Post-incident duties

Once the issue is resolved, the on-call engineer should ensure the operational context is captured and any immediate follow-up work is created.

This includes:

Documenting what happened and how it was resolved
Capturing missing context in the incident tool or channel
Identifying weak alerts, missing runbooks, or tooling gaps
Creating follow-up tasks to prevent the same class of issue from happening again
Including unresolved risks in the next handoff, if needed

If a formal incident was declared, follow our post-incident review process.

Handling stress

On-call can be stressful. That is normal.

A few things to remember:

You are not expected to know everything
You are not expected to handle incidents alone
Asking for help early is good judgment
Repeated pager noise is a systems problem
Every incident is a chance to improve our tooling and documentation

We want on-call to be sustainable. If the rotation is too noisy or too stressful, we should improve the system, not normalize the pain.

What good on-call looks like

We evaluate on-call quality through a few simple questions:

Are we detecting real issues quickly?
Are alerts acknowledged fast?
Are we restoring service quickly?
Are our alerts actionable, or are they noisy?
Are incidents turning into better monitoring, runbooks, and product quality?

PreviousHow we handle incidents NextHow we handle post incident reviews