Jump to content

Tools and Technology

  • entries
    128
  • comments
    916
  • views
    331,704

Contributors to this blog

The Story Behind Last Week's Unexpected Downtime


Hi! I’m a member of the Second Life Operations team. On Friday afternoon, major parts of Second Life had some unplanned downtime, and I want to take a few minutes to explain what happened.

Shortly before 4:15pm PDT/SLT last Friday (May 6th, 2016), the primary node for one of the central databases that drive Second Life crashed. The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.

When the primary node in this database is offline we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.

My team quickly sprung into action, and we were able to promote one of the replica nodes up the chain to replace the primary node that had crashed. All services were fully restored and turned back on in just under an hour.

One additional (and totally unexpected) problem that came up is that for the first part of the outage, our status blog was inaccessible. Our support team uses our status blog to inform Residents of what’s going on when there are problems, and the amount of traffic it receives during an outage is pretty impressive!

A few weeks ago we moved our status blog to new servers. It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home. (Don’t forget that you can also follow us on Twitter at @SLGridStatus. It’s really handy when the status blog in inaccessible!)

As Landon Linden wrote a year ago, being around my team during an outage is like watching “a ballet in a war zone.” We work hard to restore Second Life services the moment they break, and this outage was no exception. It can be pretty crazy at times!

We’re really sorry for the unexpected downtime late last week. There’s a lot of fun things that happen inworld on Friday night, and the last thing we want is for technical issues to get in the way.


April Linden

0 Comments


Recommended Comments

Guest

Posted

You all do such amazing work. Thank you, I apprecite you all.

Vivienne Daguerre

Posted

There was a time a few years back when things like this happened frequently. Now it is a rare event, and it was a short lived one. Kudos to Linden Lab for progress made. This event just reminded us of how things once were, and the success of Linden Lab in dealing with technical issues to create a very stable grid.

Caleb Kit

Posted

Thank you for letting us know what happened. Work in tech myself. Your team's efforts are much appreciated.

Fallacy DeCuir

Posted

Thanks for letting us know what happened, and for your prompt fix! <3

Marcus Perry

Posted

I really apprechiate these post mortens to service outages. Well done !

FireyMae Serenity

Posted

kudos for the team that did great jobs on times like this one, well done guys thank you for bringing up SL back to life ;)

Feliciana Zabaleta

Posted

Thank you guys for all your hard work in the background. We all appreciate your hard work and keeping us all in the loop.

CassieMaia

Posted

Thanks for the follow-up explanation.  Much appreciated!

×
×
  • Create New...