Heya! April Linden here.
We had a pretty rough morning here at the Lab, and I want to tell you what happened.
Early this morning (during the grid roll, but it was just a coincidence) we had a piece of hardware die on our internal network. When this piece of hardware died, it made it very difficult for the servers on the grid to figure out how to convert a human-readable domain name, like www.secondlife.com, into IP addresses, like 220.127.116.11.
Everything was still up and running, but none of the computers could actually find each other on our network, so activity on the grid ground to a halt. The Second Life grid is a huge collection of computers, and if they can’t find other other, things like switching regions, teleports, accessing your inventory, changing outfits, and even chatting fail. This caused a lot of Residents to try to relog.
We quickly rushed to get the hardware that died replaced, but hardware takes time - and in this case, it was a couple of hours. It was very eerie watching our grid monitors. At one point the “Logins Per Minute” metric was reading “1,” and the “Percentage of Successful Teleports” was reading “2%.” I hope to never see numbers like this again.
Once the failed hardware was replaced, the grid started to come back to life.
Following the hardware failure, the login servers got into a really unusual state. The login server would tell the Resident’s viewer that the login was unsuccessful, but it was telling the grid itself that the Resident had logged in. This mismatch in communication made finding what was going on really difficult, because it looked like Residents were logging in, when really they weren't. We eventually found the thing on the login servers that wasn’t working right following the hardware failure, and corrected it, and at this point the grid returned to normal.
There is some good news to share! We are currently in the middle of testing our next generation login servers, which have been specifically designed to better withstand this type of failure. We’ve had a few of the next generation login servers in the pool for the last few days just to see how they handle actual Resident traffic, and they held up really well! In fact, we think the only reason Residents were able to log in at all during this outage was because they happened to get really lucky and got randomly assigned to one of the next generation login servers that we’re testing.
The next step for us is to finish up testing the next generation login servers and have them take over for all login requests entirely. (Hopefully soon!)
We’re really sorry about the downtime today. This one was a doozy, and recovering from it was interesting, to say the least. My team takes the health and stability of Second Life really seriously, and we’re all a little worn out this afternoon.
Your friendly long eared GridBun,