How we handle on-call rotations

Reading time6min


Updated


Author

At Resend, on-call exists to give every production issue a clear owner.

The goal of on-call is not heroics. The goal is to restore service quickly, escalate early, communicate clearly, and improve the system after every incident.

On-call is for production alerts and customer-impacting reliability issues.

Weekly rotation

Our on-call rotation runs weekly. Each week, we assign:

  • One primary on-call engineer
  • One secondary on-call engineer

On-call coverage

Primary

The primary on-call engineer is the first responder for alerts during the week.

The primary is expected to:

  • Stay reachable throughout the rotation
  • Have a laptop and reliable internet access
  • Acknowledge alerts quickly
  • Assess impact and severity
  • Start mitigation immediately
  • Escalate early when help is needed
  • Declare an incident when there is customer impact or meaningful degradation

Being primary does not mean solving everything alone. Good on-call engineers ask for help early.

Secondary

The secondary on-call engineer supports the primary and steps in when needed.

The secondary is expected to:

  • Be reachable throughout the rotation
  • Cover short planned gaps when coordinated in advance
  • Take over if the primary is unavailable
  • Help with diagnosis, mitigation, and decision-making during incidents

The secondary should be ready to jump in, not catch up from scratch.

Availability

It is fine to have errands or short personal commitments during your week. What matters is that coverage is always clear.

If you expect to be away from your laptop or without internet for a period of time, coordinate with the secondary ahead of time and make sure they explicitly confirm coverage.

If you cannot provide coverage for the week because of travel, illness, PTO, or other commitments, you are responsible for arranging a swap, ideally with at least one week's notice, and updating the on-call calendar. There should never be ambiguity about who is on call.

Responding to alerts

When an alert fires, the on-call engineer should:

  1. Acknowledge it quickly
  2. Assess customer impact
  3. Start with mitigation
  4. Escalate early if the issue is unclear, high risk, or moving too slowly
  5. Communicate status clearly with the customers and in the incident channel

We optimize for restoring service first. A full diagnosis can happen after the system is stable.

If the issue has customer impact or degraded performance, follow our incident process.

Resolving issues

We prefer safe, reversible actions that reduce impact quickly.

This usually means:

  • Using runbooks when they exist
  • Rolling back recent changes when appropriate
  • Disabling risky features or flags when needed
  • Avoiding single-threaded debugging for too long
  • Leaving clear notes so others can join quickly

The fastest path to stability is usually better than the most elegant technical fix in the moment.

Post-incident duties

Once the issue is resolved, the on-call engineer should ensure the operational context is captured and any immediate follow-up work is created.

This includes:

  • Documenting what happened and how it was resolved
  • Capturing missing context in the incident tool or channel
  • Identifying weak alerts, missing runbooks, or tooling gaps
  • Creating follow-up tasks to prevent the same class of issue from happening again
  • Including unresolved risks in the next handoff, if needed

If a formal incident was declared, follow our post-incident review process.

Handling stress

On-call can be stressful. That is normal.

A few things to remember:

  • You are not expected to know everything
  • You are not expected to handle incidents alone
  • Asking for help early is good judgment
  • Repeated pager noise is a systems problem
  • Every incident is a chance to improve our tooling and documentation

We want on-call to be sustainable. If the rotation is too noisy or too stressful, we should improve the system, not normalize the pain.

What good on-call looks like

We evaluate on-call quality through a few simple questions:

  • Are we detecting real issues quickly?
  • Are alerts acknowledged fast?
  • Are we restoring service quickly?
  • Are our alerts actionable, or are they noisy?
  • Are incidents turning into better monitoring, runbooks, and product quality?