Jump to content

Tools and Technology

  • entries
    126
  • comments
    916
  • views
    301,380

Contributors to this blog

Recent February 1st Outage


Linden Lab

3,007 views

Hello!

This Wednesday, February 1st, was a rocky one for Second Life, especially for regions on our Release Candidate (RC) channels. As with many large-scale services, we experience service interruptions from time to time, but yesterday's problems were unique and noteworthy enough that they deserve a sincere apology and more thorough explanation.

Summary

  • The incident lasted 6 hours
  • 7029 regions were reverted to a previous stable version
  • The root cause was a bug in object inventory behavior that was introduced by new code. This bug caused some objects' scripts to enter a non-running state.
  • We are making changes to our deploy and testing processes to make events like this less likely to happen in the future.

Timeline

Second Life deploys simulator changes in two major phases: first, to Release Candidate (RC) channels which comprise roughly one-third of all production regions, then to the remainder on a main Second Life Server (SLS) channel. Before changes are deployed to RC, they go through several develop-test cycles, including deployments and testing on staging environments.

RC rolls are regularly scheduled for 7AM Pacific (SLT) every Wednesday. While conducting one of these deployments, we received initial reports of issues relating to script behavior at 8:30AM. We immediately started investigating, and by 9AM it was clear that we needed to stop the roll, because the issue was pervasive and clearly related to the new changes. As we do in these situations, we declared an “Incident” and an Incident Commander took charge. 

By 9:30 we started a rollback, reverting the affected simulators to their previous version. We also evaluated what additional actions we would need to take, as it was unclear that a rollback alone would start the scripts which had stopped. This quickly proved to be the case, and we came up with a new plan - one that would ensure that scripts would perform as expected going forward, but possibly undo the changes our residents had been making since the time we had introduced the bug. Although this meant more downtime, it prevented further content loss, and proved to be the best way to put the grid back in order. The team came up with a quick, clean, and efficient way to achieve this and get everyone back on track. 

By 12PM the decision was made to take this direction. By 1:15PM the code was complete, by 1:40PM we confirmed that it worked. By 2:25PM all regions were brought back up. 

Of course after an event like this there’s cleanup to be done, and we had quite a few Lindens reviewing the fallout and taking care of our Residents. Our amazing support team was working with creators whose operations were affected. 

Trouble doesn’t come alone: we had to deal with another outage at the same time - this one on the web side of things.  Some of our services were getting overwhelmed. We added a few hamsters to those wheels, and they seem to be turning just fine again. 

Technical Details

What happened? To quote part of our internal discussion, “scripts are weird”:

Quote

Sometimes we refer to scripts in an object by a globally unique identifier, and sometimes we refer to scripts in an object by an object-local identifier. There was an issue in the version we deployed on 2023-02-01 where object-local script identifiers would always refer to the first script in an object, causing only that script to be marked as running and receive events like touch and linked_message.

Conclusion

What remains after an outage? We write up an internal “postmortem” report: what happened, what lessons we learned, and what future measures we’re going to put in place. We share it widely, so that we can all learn from it. On days like this, we also write a blog post to let you all know what happened as well. 

Of all the reasons to write a blog, a postmortem ranks pretty low for me, because it means something went seriously wrong. But of all the reasons I love working here, watching everyone come together during an outage ranks pretty high. I’m grateful to know that while we may make mistakes, we will not sweep them under the rug, nor look for someone to blame - we will come together and make it right. 

Thank you for your patience and understanding. This is your world, and we hope to take better care of it. 

Grumpity and the Second Life team

  • Like 6
  • Thanks 35

0 Comments


Recommended Comments

There are no comments to display.

×
×
  • Create New...