Hi everyone! Mazidox here. I’d like to give you an overview of what happened on Wednesday (09/06) that ended up with some Residents’ objects being mass returned.
Two weeks ago, we had several problems crop up all at once - starting with a DNS server outage (a server that helps route requests between different parts of Second Life). Unfortunately, when the dust settled, we started seeing a disturbing trend: mass-returns of objects.
We diagnosed an issue where a region starts up with incorrect mesh Land Impact calculations, which could lead to a lot of objects getting returned at once, as we had encountered several months ago. At that time we applied what we call a speculative fix. A speculative fix means that while we can’t recreate the circumstances that led to a problem, we are still fairly confident that we can stop it from happening again. Unfortunately, in this case we were mistaken; because the fix we applied was speculative, the problem wasn’t fixed as completely as it could have been, and we found out how incomplete the fix was in a dramatic fashion that Wednesday night.
When a problem like this occurs with Second Life we have three priorities:
Stop the problem from getting worse
Fix the damage that has been done
Keep the problem from happening again
We had the first priority taken care of by the end of the initial outage; we could be certain at that point that our servers could talk to each other and there weren’t going to be any more mass-returns of objects that day. At that point, we started assessing the damage and figuring out how to fix as much as we could. In this case it turned out that restarted affected regions where no objects had been returned fixed the problem of some meshes showing the wrong Land Impact.
For regions where a mass-return had happened, there wasn’t a quick fix. Our Ops team managed to figure out a partial list of what regions were affected by a mass object return, which kept our Support team very busy with clean up. Once we helped everyone we knew, who had experienced mass object returns our focus shifted once more, this time to keeping the problem from happening again.
In order to recreate all the various factors that caused this object return we needed to first identify each contributing factor, and then put those pieces together in a test environment. Running tests and finding strange problems is the Server QA team’s specialty so we’ve been at it since the morning after this all happened. I have personally been working to reproduce this, along with help from our Engineering and Ops teams. We’re all focused on trying to put each of the pieces together to ensure that no one has to deal with a mass-return again.
Your local bug-hunting spraycan,