Who is responsible for late-night emergency response and repair in case of a production system outage?

Christa B.Eng.
Christa B.Eng.
Young tech entrepreneur, recently launched an AI-powered SaaS.

This situation usually has a specific term: "On-call." Whoever is on duty has to get up.

You can think of it like a doctor on call at a hospital. We engineers have a schedule; for example, it's my turn this week, and next week it's Zhang's. During my on-call shift, my phone will have special software installed, or the company will provide a dedicated phone. Once an online service encounters an issue (e.g., the website is down, users can't log in, payments fail, etc.), the monitoring system automatically detects it and immediately calls me or sends an alert. That alarm sound is usually very jarring, guaranteed to wake you from a deep sleep.

Upon receiving the call, you have to immediately get out of bed, open your computer, and start troubleshooting the problem.

So, who actually fixes it? There are several scenarios:

  1. The development engineer themselves: This is the most common situation. "Whoever wrote the code is responsible." Because you wrote the program yourself, you know best where it might go wrong. So, in many companies, members of the development team take turns being on call.
  2. Operations/SRE engineers: In some large companies, there are dedicated teams responsible for maintaining the stability of online systems. They are called "Operations Engineers" or "SREs (Site Reliability Engineers)." Their core job is to ensure 24/7 uninterrupted service, so putting out fires in the middle of the night is part of their "job description." However, even they, if they discover a bug in specific business code, will ultimately call up the development engineer who wrote the code to resolve it together.
  3. "Senior firepower": If the on-call engineer can't handle it and the problem is particularly tricky, he/she will need to "call for backup." This might involve pulling more senior engineers, architects, or even the CTO out of bed. Everyone then joins an emergency meeting to collaborate and tackle the issue online.
  4. Startup founders/CTOs: In very small startups with limited staff, it might be the founder themselves or the CTO (Chief Technology Officer) who gets up in the middle of the night to fix bugs. After all, it's their company, and a downed service hurts them more than anyone.

The whole process generally goes like this:

  • "Beep, beep, beep!" The alarm goes off.
  • Groggily answer the call, listening to a robot voice announce, "Service X is down."
  • Roll out of bed and rush to the computer.
  • Log into various systems, check monitoring dashboards, review logs, searching for clues like a detective to pinpoint the root cause.
  • Once the problem is identified, begin "emergency treatment." Sometimes it's as simple as restarting a service (like rebooting a computer), sometimes it involves rolling back a newly deployed version of code to the previous stable one (this is called "rollback"), and sometimes it requires urgently writing a piece of code to fix a bug.
  • After resolving the issue, you can't immediately go back to sleep. You need to monitor it for a period to ensure the service is truly stable.
  • Finally, write an "incident report" to explain to the entire team in the morning: what happened last night, how it was resolved, and how to prevent it in the future.

So, if you see a website go down at 3 AM but it's back up ten minutes later, it's highly likely that an engineer just experienced a frantic dash from bed to computer. This is indeed a part of an IT engineer's job; though tough, it's also a matter of responsibility.