Hi! I’m a member of the Second Life operations team, and I was the primary on-call systems engineer this past weekend. We had a very difficult weekend, so I wanted to take a few minutes to share what happened.
We had a series of independent failures happen that produced the rough waters Residents experienced inworld.
Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.
This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.
After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.
That brings us to Sunday morning.
Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.
It didn’t take us long to realize that things were about to get interesting again, however.
Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.
Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” - it’s what makes the texture you see on your avatar - thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.
One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.
I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.
See you inworld (after I get some sleep!),