As many Residents saw, we had a pretty rough day on the Grid yesterday. I wanted to take a few minutes and explain what happened. All of the times in this blog post are going to be in Pacific Time, aka SLT.
Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.
A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.
Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.
We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.
On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.
My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life.
Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.
We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better
My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.