Well, this is not the way we wanted the past week to go -- multiple outages in the same week! Not fun for anyone. Here’s what happened.
Login and Inventory Issues: Monday, 1/24 - Friday, 1/28
On Monday, January 24, residents’ inventories stopped responding to requests. Some residents were unable to log in as a result. We identified the affected accounts, addressed the problem and restored services to full operation in a little under 3 hours.
Beginning on Wednesday of this week, we received reports of intermittent inventory issues – failures to rez, unpack and delete inventory items; and intermittent errors upon login.
When the number of reports increased, it became clear that this was an outage. Several Lindens jumped in to diagnose the reports. After some digging, we discovered our back end infrastructure was being overloaded. Once this was resolved, the positive impact was almost immediate. The data now looks good -- on the back end and we are no longer receiving inventory reports.
We’re taking steps, including a deploy late last week, to prevent these issues in the future - and we have already seen progress in making the service more robust.
Weekly Rolling Restarts: Tuesday, 1/25
Every week we restart the simulator servers to keep them running smoothly. We only restart a certain number at the same time, allow them to finish, then start another batch. That’s why we call them “rolling restarts.” But on Tuesday a recent upgrade to the simulators meant that the usual number of simultaneous restarts was too much for the system to handle. The result was load spikes and numerous regions going down. Ultimately more than 12,000 regions were stuck in restart mode, which is 40% of the grid. Not good!
The team came together quickly to bring simulators back up in smaller batches, and then manually fixed blocks of regions. After some trials and monitoring, we found a smaller number of concurrent restarts that worked better. By 1:00 PM all regions were restored and operating normally. We apologize for this lengthy outage!
Advance… Retreat!: Wednesday, 1/26
We had planned a deploy to the production grid that would help us gather information on group chat performance. Unfortunately, the procedure used during testing did not work in production. We figured -- OK, we’ll just roll back. But then the rollback itself had problems! Residents experienced issues with login, group chat, and presence information. The team was able to isolate the problem and complete the rollback, getting a few more gray hairs in the process. We’re going to take what we’ve learned and do a better deploy that will give us the information to improve group chat in the long run.
Last week was no fun for a lot of people at the Lab. Thank you for your patience. We really don’t like interrupting your enjoyment of Second Life. Here’s hoping we don’t have to come back with another one of these blog posts for quite some time!