Operations
Is your team’s new engineer ready to take on-call? Use wargames for training
At Qualtrics, our engineering organization is expanding rapidly. For my team, Text Analytics, we’ve gone from four to eight members in less than six months and there are more coming. As team members come in, part of the on-boarding process is preparing them to take on-call. The on-call engineer is the first responder for incidents -- responsible for either resolving the issue or escalating the issue if they need help. Culturally, we’ve decided that teams are responsible for the systems they build to promote ownership and solving of maintainability issues.
To be ready for on-call rotation, a team member must:
- be familiar with our processes,
- have certain accounts, permissions, and tools, and
- be competent to diagnose and correct problems in our tech stack.
Although some of our new members who have transferred from other teams are well versed in the company’s on-call procedures, every team varies in their tech stack, so all engineers go through a learning process. We’ve been using wargames as part of this process to train new employees and stress test our processes and runbooks.
For the latest wargame, we established two teams: a red and blue team. The blue team was made up of three new team members who were charged to diagnose and correct problems as they arose. The red team consisted of our team lead who volunteered to act as the gremlin, causing issues. I acted as a judge and coordinator. I scheduled a conference room for a half-hour, but ultimately the games lasted longer than an hour because we were having a lot of fun.
The red team prepared six issues beforehand. The issues were selected from common or instructional incidents that had occurred over the past six months. The list was:
Issue and Resolution | Caused By |
---|---|
Service dies. Blue team restarts service. |
kill service
|
Check disk alert; disk is low on space. Red team “hid” the file so blue team had to use tools (e.g. ncdu ) to find it and delete it (after verifying it was safe to delete). |
fallocate -l 6g file
|
A downstream service dies. Red team killed a downstream service whose health is reported in a higher-level service’s health. Restarting the higher-level service is insufficient for restoring health. Blue restarts the downstream service and the upstream service, restoring connectivity. |
kill service
|
High CPU load. Blue team finds the offending process and kill it. (This was difficult to stimulate; automated processes killed fork bombs and handled some runaway queries automatically. Go tech ops for hardening our base systems!) |
dd if=/dev/zero of=/dev/null
|
High count of 404 errors. This was to simulate an inconsistent issue (some servers had the file, some did not) that required understanding how files were served in our architecture. Blue team resyncs static assets. |
rm file
|
High count of authentication errors, without alarms. This required the Blue team to trace authentication errors between services and, once they found the root issue (a dead service), determine why alarms hadn’t fired, driving understanding of how monitoring and alerts travel among our systems. |
stash alert
|
During the wargame, the blue team rotated on-call, but collectively dug into each issue as they arose. To maintain some idea of what the leader was doing, we shared their desktop onto the conference room’s display.
The first few scenarios went smoothly. In these scenarios (e.g. service dead), the alerts align with the corrective action and the blue team was familiar with the core commands to restore health based on their regular development activities. They had a more difficult time solving a downstream or “deep” health check problem. For one, the alert looks very similar to a regular health check alert (the “deep” is buried at the end of a long string). Secondly, once they checked the logs, they knew it was a downstream service, but didn’t know which one. (Clearer logs became a story for our backlog.)
By the time we got to the complex scenarios, the team had the service documentation, run books, and architecture diagrams up, so they could quickly run through a number of diagnostics and explore multiple options in parallel. However, since the scenarios started focusing on second-order effects or problems that were more infrastructure driven than in our own code, the blue team was still challenged. The training moved away from understanding processes and our architecture and became focused on troubleshooting tools and system reasoning.
Lessons Learned
We learned a few things right away: one engineer’s phone was dead and another had alerts setup incorrectly. Although we annotate most of our alerts with links to a runbook with instructions on how to correct issues, we had failed to tell the new engineers about their existence -- a hole in our training.
Also, sharing desktops did not work well; the activity was too fluid and detailed to track on a single display. We will probably simulate a call leader scenario next time, with the team adopting leadership and scribe roles to better simulate larger incidents.
Within their local development environments, our newly hired engineers were experienced troubleshooters, but it turned out this knowledge did not generalize well to production. For example, restarting services in production is a different process than within docker-machine. Files are served from a local directory in development but are served from linked data containers in production. Their difficulties showed holes in our documentation and training.
Because all three engineers were working to diagnose and correct the issue at the same time, activity was hectic and communication broke down. Since the on-call engineer can correct most issues by themselves, this isn’t normally an issue, but it does indicate we may need to spend more time establishing structured processes for handling communication and coordination in the event of a major incident. In fact, later experience led us to develop a wargame specifically around these incident roles.
Finally, the blue team realized (but did not take advantage of the fact) that the actions of the red team could be tracked by the command line history. If we have a more nefarious group of hires, we may need to find a way to cheat the auditing systems.
Closing Thoughts
With a few collective hours of work, we were able to provide training and a stress test for our on-call engineers. It was far more enjoyable and engaging than a training session and revealed problems that would only appear under stress. Wargaming can be a powerful technique for on-boarding and maintaining a team’s ability to respond to issues as they arise. W.O.P.R. was wrong; the winning move is to play.