From time to time, incidents occur that our operations team needs to quickly fix in order to keep all of Second Life working well 24x7 for users around the world.
How does the Linden Lab ops team collaborate to quickly tackle these incidents? Our VP of Operations and Platform Engineering, Landon McDowell (Landon Linden), has written a great description of an early experience he had with our approach as well as some thoughts on why it works so well. This is a bit outside the usual “Tools & Tech” topics for this blog, but we thought Second Life users familiar with how operations teams work would appreciate the inside look at our team’s approach:
Two weeks into my tenure in the Operations group at Linden Lab I was confronted with my first major incident there. It was early afternoon, and I was well into a post-ramen food coma when alarms started popping off in IRC. All of our major charts were taking a header - logins, concurrency, etc.
The call went out in #ops for hands, but I had already jumped in. This wasn’t my first rodeo. I was primed to hop onto a conference call or pile into a room to marshall a response. But that never happened.
Instead, responders starting piping up in IRC with, “Hands.” Soon I was completely overwhelmed by a stream of text flying across my screen as engineers were reporting back and discussing findings.
The problem was quickly narrowed down to a particular load balancer. I was barely into the box before an engineer chimed in, “It's running out of ports.” From there the resolution was straight-forward: some quick TCP tuning and adding another backend to the pool to quickly stabilize things before proceeding to long-term fixes.
I, though, just sat there staring at the screen wondering what the hell had just happened, wondering what the hell I had gotten myself into. I thought I was a seasoned pro, but I had never ever seen an incident response go that smoothly or quickly. Panic started to set in. I was out of my league.
In the day that followed, I was able to review the incident by reading the chat log, referred to as the scrollback. My confidence slowly began to rebuild. I stepped through the incident response line by line, server by server, action by action. After we completed the postmortem, I felt that with more practice and experience I could do this. I also realized, to the initiated, chat-centric incident response is far and away the best, most efficient method of handling outages.
The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.
In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.
In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.
Responders never all show up simultaneously. Often they have to be pulled in mid-incident. The power of the chat log really comes through here as latecomers get an automatic up-to-the-second sitrep. “Reading scrollback” is our standard entrance letting everyone know someone new has engaged and needs a minute to catch up. Even in cases when a quick briefing to a newcomer is necessary, one person can break off into a separate channel or in private message without having to disengage from the main conversation.
Other kinds of text sidebars are of course useful in incident responses. For example, emotions run high during outages and occasionally you have to ask someone to cool their jets. This is done quickly and effectively in private chat message without embarrassing them in front of the rest of the team.
At Linden Lab, we use a designated Incident Commander to orchestrate incident responses. Chat systems give an easy way to flag whoever is running the show by chat handle and/or in the channel topic. Anyone jumping in knows immediately who is in charge without having to distract the response team by asking.
Running an incident response in a chat channel is also an incredibly effective way of passively disseminating information to a wide audience. A large number of people can quietly lurk in a chat channel unlike in a physical space. More formal status updates to various parties, like support, are of course sometimes necessary but enabling those parties to follow along in real time gives them context that would not otherwise be conveyed in a terse status report.
As a final bonus, we are able to respond to a problem at peak efficiency regardless of where anyone is at that moment. Issues don’t wait for office hours to crop up. Being a distributed team, this is really our only option, but it rocks that being distributed is an advantage in incident response.
The benefits of chat-based incident responses don’t end with the incident though. Having a detailed log of events is invaluable in conducting postmortems. People have a terrible memories, especially during high stress events. The log gives a history of events with precise times that could never be achieved by relying on responders' recollections.
Likewise, the chat log for an incident is potent teaching tool. New hires can use them to learn about the particulars and eccentricities of systems in a way that is rarely captured in documentation or direct instruction. But in general, the log gives a remarkably clear picture of what went right and what went wrong in an incident response, letting the team better iterate and improve on their process over time.
Chat-based incident response isn’t easy. It requires group discipline and commitment as it runs counter to our instincts about communication. It can be nerve-racking for newcomers to the practice. Not everyone can hack it. Extremely smart people can and do wash out from not being able to keep up. But when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.