Due to continued changes in the Facebook API, as of today the Second Life viewer will no longer be able to support Facebook Connect for sharing your inworld photos and posts. We apologize for this inconvenience and will be removing the UI from the viewer shortly. We will, of course, be happy to see your SL posts on Facebook going forward, and you can always say hello and check out what’s happening on our official page: https://www.facebook.com/secondlife
About this blog
Entries in this blog
Many Residents have noted that in the last few weeks we have had an increase in disconnects during a teleport. These occur when an avatar attempts to teleport to a new Region (or cross a Region boundary, which is handled similarly internally) and the teleport or Region crossing takes longer than usual. Instead of arriving at the expected destination, the viewer disconnects with a message like:
Darn. You have been logged out of Second Life.
You have been disconnected from the region you were in.
We do not currently believe that this is specific to any viewer, and it can affect any pair of Regions (it seems to be a timing-sensitive failure in the hand-off between one simulator and the next). There is no known workaround - please continue logging back in to get where you were going in the meantime.
We are very much aware of the problem, and have a crack team trying to track it down and correct it. They’re putting in long hours and exploring all the possibilities. Quite unfortunately, this problem dodged our usual monitors of the behavior of simulators in the Release Channels, and as a result we're also enhancing those monitors to prevent similar problems getting past us in the future.
We're sorry about this - we empathize with how disruptive it has been.
Quite some time ago, we introduced viewer changes that moved the fetching of most assets from using UDP through the simulator to HTTP through our content delivery network (CDN) to improve fetch times and save cycles on the simulator for simulation. At the time, we did not disable the UDP fetching path so that all users would have time to upgrade to newer viewer versions that fully support the new HTTP/CDN path. We warned that at some time in the future we would disable the older mechanism, and that time has come.
The simulator rolling to the BlueSteel and LeTigre channels this week removes support for UDP asset fetching. In the normal course of events that version should be grid wide within a couple of weeks. Viewers that have not been updated to the newer (and faster) protocol will no longer be able to fetch many asset types, including
- body parts
Since some specific body parts are required to render avatars, anyone on viewers that cannot load them will be a cloud or “Ruth” avatar and unable to change from it.
All current official viewer versions are able to load assets normally. As far as we are aware, all actively maintained Third Party Viewers have had support for the HTTP/CDN asset fetching for many releases, so no matter what viewer you prefer there should be an upgrade available for you.
There are a number of scripts available that animate the Group Role title for your avatar. While no doubt entertaining, these take advantage of a functionality, which was never intended to serve this purpose. Server side changes which will break Group Tag Animators are going to be rolling out to the grid soon. This change is deliberate: it will sharply restrict the rate at which these updates are allowed. The new limits are generous enough that a human changing things through a viewer should not exceed them, but strict enough to break the 'title animators'.
The current Marketplace products which do this are being removed and the sellers notified.
Image from xkcd
Residents on our Release Candidate Regions got an extra bonus Region restart today. We apologize if this extra restart disrupted your Second Life fun, so we want to explain what happened.
The Release Candidate channels exist so that we can try new server versions under live conditions to discover problems that our extensive internal testing and trials on the Beta Grid don't uncover. Unfortunately, it's not nearly possible for us to simulate the tremendous variety of content and activities that are found on the main Grid. We appreciate that Region owners are willing to be a part of that process, and regret those occasions when a bug gets past us and disrupts those Regions.
Normally, Region state is saved periodically many times a day as well as when the Region is being shut down for a restart. The most recent Region state is restored when a Region restarts.
The extra roll today was needed because we found a problem that could have caused long-running Regions to fail to save that state. Without the roll, there would be a significant chance that changes made to those regions might not be there following the regularly scheduled roll because the save data would be out of date. Good news! It’s been fixed, and today's roll applies that fix. The roll took a little longer than usual because we took extra care to ensure that the Region saves would work normally for this roll.
We apologize for any disruption to your Second Life today, but at least you can rest assured that Second Life saves!
You may have been among those who had problems seeing avatars yesterday; this is a quick note to share with you what went wrong and what comes next.
Early yesterday morning, we deployed an update to the backend service that creates the baked textures and other data that make up your avatar’s appearance. The time was chosen to be at a low point in concurrency because we knew the update would create unusual load (making all-new appearances for every active avatar). It turned out that the load was a little higher than we expected; that probably would have been OK, but we had two simultaneous system failures in the servers that provide some of the key data to the service. By the time we had diagnosed those failures, the backlog of work to be done had grown considerably and the number of active users had also increased. It took a few hours for the backlog of avatar appearances to get caught up.
We realize how disruptive events like these can be (our own inworld meetings were full of mostly-gray Lindens and clouds yesterday), and very much regret the inconvenience.
We're increasing the redundancy of these services and modifying our deploy procedures to avoid a repeat of this kind of failure in the future.
The good news is that this was one of the last backend changes needed for us to be able to roll out the new Bakes-on-Mesh feature on agni; it will, with the updated viewer, allow you to apply system clothing and skins on your avatar even when you're using mesh body parts. Look for the viewer release announcement soon.
- Oz Linden
Hello amazing Residents of Second Life!
A few days ago (on Sunday, October 28th, 2018) we had a really rough day on the grid. For a few hours it was nearly impossible to be connected to Second Life at all, and this repeated several times during the day.
The reason this happened is that Second Life was being DDoSed.
Attacks of this type are pretty common. We’re able to handle nearly all of them without any Resident-visible impact to the grid, but the attacks on Sunday were particularly severe. The folks whothat were on call this weekend did their best to keep the grid stable during this time, and I’m grateful they did.
Sunday is our busiest day in Second Life each week, and we know there’s lot of events folks plan during it. We’re sorry those plans got interrupted. Like most of y’all, I too have an active life as a Resident, and my group had to work around the downtime as well. It was super frustrating.
As always, the place to stay informed of what’s going on is the Second Life Grid Status Blog. We do our best to keep it updated during periods of trouble on the grid.
Thanks for listening. I’ll see you inworld!
Second Life Operations Team Lead
The software which we use for keeping track of the bugs found in Second Life is long overdue for an upgrade. If you've never interacted with the system before, you don't need to start now; however if you are one of the dedicated Residents who spends time helping us improve Second Life via https://jira.secondlife.com then this message is for you.
We are planning the upgrade on Wednesday, August 29, 2018, starting at 8:30 pm PDT. We're allowing a 6-hour window to give us time to chase down any problems, though hopefully we will be done much more quickly.
If you are interested in what this upgrade actually means, the good news is that most of the changes should be improvements behind the scenes, or cosmetic upgrades. The look and feel of the user interface will be changing, which isn't surprising since we are going from Jira version 5 to version 7. Importantly for some of you, the new login system actually updates the email address that jira uses for you, every time you log in, instead of only the very first time as the current system does.
The very first change you will likely see is a new login page:
Mention this blog to me inworld and get a free Linden Bear!
As I’m sure most of y’all have noticed, Second Life has had a rough 24 hours. We’re experiencing outages unlike any in recent history, and I wanted to take a moment and explain what’s going on.
The grid is currently undergoing a large DDoS (Distributed Denial of Service) attack. Second Life being hit with a DDoS attack is pretty routine. It happens quite a bit, and we’re good at handling it without a large number of Residents noticing. However, the current DDoS attacks are at a level that we rarely see, and are impacting the entire grid at once.
My team (the Second Life Operations Team) is working as hard as we can to mitigate these attacks. We’ve had people working round-the-clock since they started, and will continue to do so until they settle down. (I had a very late night, myself!)
Second Life is not the only Internet service that’s been targeted today. My sister and brother opsen at other companies across the country are fighting the same battle we are. It’s been a rough few days on much of the Internet.
We’re really sorry that access to Second Life has been so sporadic over the last day. Trying to combat these attacks has the full attention of my team, and we’re working as hard as we can on it. We’ll keep posting on the Second Life Status Blog as we have new updates.
See you inworld!
Second Life Operations Team Lead
We are pleased to announce that our newest viewer update (126.96.36.1991732 AlexIvy) is the first Linden Lab viewer to be built as a 64-bit application on both Windows and Mac. We'd like to send a shout out to the many third party viewer developers who helped with this important improvement! For Windows users whose systems are not running 64-bit yet, there is a 32-bit build available as well; you don't need to figure out which is best for your system because the viewer will do it for you (see below, especially about upgrading your system).
Building the viewer as a 64-bit application gives it access to much more memory than before, and in most cases improves performance as well. Users who have been running the Release Candidate builds have had significantly fewer crashes.
This viewer also has updates to media handling because we've updated the internal browser technology. This version will display web content better than before, and more improvements in that area are on the way. You may notice that this version runs more processes on your system for some media types; this is expected.
There is one other structural difference that you may notice. The viewer now has one additional executable - the SL_Launcher. This new component manages the viewer update process, and on Windows also ensures that you've got the best build for your system (in the future it may pick up some other responsibilities). For Windows systems, the best build is usually the one that matches your operating system. For example, if you're running a 64-bit Windows, then you’ll get the 64-bit viewer. If not, then you’ll get the 32-bit viewer. However, some older video cards are not supported by Windows 10, so the launcher may switch you to the 32-bit build which is compatible for those cards. You won’t have to do anything to make this work - it's all automatic - if you get an update immediately the first time you run this new viewer, it's probably switching you to the better build for your system.
Important: If you have created shortcuts to run the viewer, you should update them to run the SL_Launcher executable (if you don't, the viewer will complain when you run it, and updates won't work). On Macs, the SL_Launcher and Second Life Viewer processes both show as icons on the Dock when running (hover over them to see which is which); this is known bug, and in a future update we'll fix it so they only show as a single icon - we apologize for the temporary inconvenience, but think you'll agree that the performance improvement (quite noticeable on most Macs) is worth it.
Having a 64-bit viewer will help to make your SL experience more reliable and performant (and we have quite a few projects in the queue for this year to that end). However, if you're running older versions of Windows, and especially if you're not running a 64-bit version, you won't be able to get most of those benefits. In our Release Candidate testing, users on 32-bit Windows are seeing crash rates as much as three times as often as those on 64-bit Windows 10. Almost any Windows system sold in the last several years can run 64-bit Windows 10, even if it didn't come with that OS originally. We strongly suggest that upgrading will be worth your while (this is true even if you run a Third Party Viewer, by the way).
About Linux … at this time, we don't have a Linux build for this updated viewer. We do have a project set up to get that back. We're reorganizing the Linux build so that instead of a tarball, it produces a Debian package you can install with the standard tools, and rather than statically linking all the libraries it will just declare what it needs through the standard package requirements mechanism. We'll post separately on the opensource-dev mailing list with information on where that project lives and how to contribute to it.
A fun bit of trivia: AlexIvy name comes from LXIV, the roman numerals for 64.
Things were a little bumpy for users that tried to log into Second Life on Monday morning as a result of a scheduled code deploy. I wanted to share with you what happened, and what we're going to do to try and prevent this in future.
That morning, I attempted to deploy a database change to an internal service. Without going into too much detail, the deploy was to modify an existing database table in order to add an extra column. These changes had been reviewed multiple times, had passed the relevant QA tests in our development and staging environments, and had met all criteria for a production deploy. Although this service isn't directly exposed to end-users, it is used as part of the login process and it is designed to fail open, i.e. if the service is unavailable, users should still be able to log in to Second Life without a problem.
During the database change, the table being altered was locked to prevent changes to it while it was being altered. This table turned out to be almost a billion rows in size and the alteration took significantly longer than expected. Furthermore, the service did not fail open as designed, and caused logins to Second Life to fail, along with a handful of other ancillary services. Our investigation was further complicated by other problems seen on the Internet on Monday due to a configuration issue at one of the big ISPs in North America. Many of us work remotely and while we saw problems early on, it wasn't immediately clear to us that it was internal, rather than one caused by a third party service. After some investigation, the lock on the database was removed, and services slowly began to recover. We did have to do some additional work to restore the login service, as the Next Generation login servers (as described by April here) are not yet fully deployed.
I'm still looking to complete this deploy in the near future, but this time we'll be using another method which doesn't require locking the database tables, and won't cause a similar problem. We're also investigating exactly why the service didn't fail open as it was designed to, and how we can prevent it from happening in the future.
Hi everyone! Mazidox here. I’d like to give you an overview of what happened on Wednesday (09/06) that ended up with some Residents’ objects being mass returned.
Two weeks ago, we had several problems crop up all at once - starting with a DNS server outage (a server that helps route requests between different parts of Second Life). Unfortunately, when the dust settled, we started seeing a disturbing trend: mass-returns of objects.
We diagnosed an issue where a region starts up with incorrect mesh Land Impact calculations, which could lead to a lot of objects getting returned at once, as we had encountered several months ago. At that time we applied what we call a speculative fix. A speculative fix means that while we can’t recreate the circumstances that led to a problem, we are still fairly confident that we can stop it from happening again. Unfortunately, in this case we were mistaken; because the fix we applied was speculative, the problem wasn’t fixed as completely as it could have been, and we found out how incomplete the fix was in a dramatic fashion that Wednesday night.
When a problem like this occurs with Second Life we have three priorities:
Stop the problem from getting worse
Fix the damage that has been done
Keep the problem from happening again
We had the first priority taken care of by the end of the initial outage; we could be certain at that point that our servers could talk to each other and there weren’t going to be any more mass-returns of objects that day. At that point, we started assessing the damage and figuring out how to fix as much as we could. In this case it turned out that restarted affected regions where no objects had been returned fixed the problem of some meshes showing the wrong Land Impact.
For regions where a mass-return had happened, there wasn’t a quick fix. Our Ops team managed to figure out a partial list of what regions were affected by a mass object return, which kept our Support team very busy with clean up. Once we helped everyone we knew, who had experienced mass object returns our focus shifted once more, this time to keeping the problem from happening again.
In order to recreate all the various factors that caused this object return we needed to first identify each contributing factor, and then put those pieces together in a test environment. Running tests and finding strange problems is the Server QA team’s specialty so we’ve been at it since the morning after this all happened. I have personally been working to reproduce this, along with help from our Engineering and Ops teams. We’re all focused on trying to put each of the pieces together to ensure that no one has to deal with a mass-return again.
Your local bug-hunting spraycan,
Heya! April Linden here.
We had a pretty rough morning here at the Lab, and I want to tell you what happened.
Early this morning (during the grid roll, but it was just a coincidence) we had a piece of hardware die on our internal network. When this piece of hardware died, it made it very difficult for the servers on the grid to figure out how to convert a human-readable domain name, like www.secondlife.com, into IP addresses, like 188.8.131.52.
Everything was still up and running, but none of the computers could actually find each other on our network, so activity on the grid ground to a halt. The Second Life grid is a huge collection of computers, and if they can’t find other other, things like switching regions, teleports, accessing your inventory, changing outfits, and even chatting fail. This caused a lot of Residents to try to relog.
We quickly rushed to get the hardware that died replaced, but hardware takes time - and in this case, it was a couple of hours. It was very eerie watching our grid monitors. At one point the “Logins Per Minute” metric was reading “1,” and the “Percentage of Successful Teleports” was reading “2%.” I hope to never see numbers like this again.
Once the failed hardware was replaced, the grid started to come back to life.
Following the hardware failure, the login servers got into a really unusual state. The login server would tell the Resident’s viewer that the login was unsuccessful, but it was telling the grid itself that the Resident had logged in. This mismatch in communication made finding what was going on really difficult, because it looked like Residents were logging in, when really they weren't. We eventually found the thing on the login servers that wasn’t working right following the hardware failure, and corrected it, and at this point the grid returned to normal.
There is some good news to share! We are currently in the middle of testing our next generation login servers, which have been specifically designed to better withstand this type of failure. We’ve had a few of the next generation login servers in the pool for the last few days just to see how they handle actual Resident traffic, and they held up really well! In fact, we think the only reason Residents were able to log in at all during this outage was because they happened to get really lucky and got randomly assigned to one of the next generation login servers that we’re testing.
The next step for us is to finish up testing the next generation login servers and have them take over for all login requests entirely. (Hopefully soon!)
We’re really sorry about the downtime today. This one was a doozy, and recovering from it was interesting, to say the least. My team takes the health and stability of Second Life really seriously, and we’re all a little worn out this afternoon.
Your friendly long eared GridBun,
Recently, our bug reporting system (Jira) was hit with some spam reports and inappropriate comments, including offensive language and attempts at impersonating Lindens. The Jira system can email bug reporters when new comments are added to their reports, and so unfortunately the inappropriate comments also ended up in some Residents' inboxes.
We have cleaned up these messages, and continue to investigate ways to prevent this kind of spam in the future. We appreciate your understanding as we work to manage an open forum and mitigate incidents like this.
In the short term, we have disabled some commenting features to prevent this from recurring. This means that you will not be able to comment on Jiras created by other Residents. We apologize for this inconvenience as we look into long term solutions to help prevent this type of event from occurring.
Heya! April Linden here.
Yesterday afternoon (San Francisco time) all of the Place Pages got reset back to their default state. All customizations have been lost. We know this is really frustrating, and I want to explain what happened.
A script that our developers use to reset their development databases back to a clean state was accidently run against the production database. It was completely human error. Worse, none of the backups we had of the database actually worked, leaving us unable to restore it. After a few frantic hours yesterday trying to figure out if we had any way to get the data back, we decided the best thing to do was just to leave it alone, but in a totally clean state.
An unfortunate side effect of this accident is that all of the web addresses to existing Place Pages will change. There is a unique identifier in the address that points to the parcel that the Place Page is for, and without the database, we’re unable to link the address to the parcel. (A new one will automatically be generated the first time a Place Page is visited.) If you have bookmarks to any Place Pages in your browser, or on social media sites, they'll have to be updated.
Because of this accident, we’re taking a look at the procedures we already have to make sure this sort of mistake doesn’t happen again. We’re also doing an audit of all of our database backups to make sure they’re working like we expect them to.
I’d like to stress that we’re really sorry this accident occurred. I personally had a bunch of Place Pages I’d created, so I’m right in there with everyone else in being sad. (But I’m determined to rebuild them!)
Since we’re on the topic of human error, I’d like to share with you a neat piece of the culture we have here at the Lab.
We encourage people to take risks and push the limits of what we think is possible with technology and virtual worlds. It helps keep us flexible and innovative. However… sometimes things don’t work out the way they were planned, and things break. What we do for penance is what makes us unique.
Around the offices (and inworld!) we have sets of overly sized green ears. If a Linden breaks the grid, they may optionally, if they choose to, wear the Shrek Ears as a way of owning their mistake.
If we see a fellow Linden wearing the Shrek Ears, we all know they’ve fessed up, and they’re owning their mistake. Rather than tease them, we try to be supportive. They’re having a bad day as it is, and it’s a sign that someone could use a little bit of niceness in their life.
At the end of the day, the Linden takes off the Shrek Ears, and we move on. It’s now in the past, and it’s time to learn from our mistakes and focus on the future.
There are people wearing Shrek Ears around the office and inworld today. If you see a Linden wearing them, please know that’s their way of saying sorry, and they’re really having a bad day.
Baloo Linden and April Linden, in the ops’ team inworld headquarters, the Port of Ops .
The AssetHttp Project Viewer is now available on the alternate viewers page! We expect it to help with speed and reliability in fetching animations, gestures, sounds, avatar clothing, and avatar body parts - but we need your help to make sure everything works!
Historically, loading assets such as textures, animations, clothing items and so on has been an area where problems were common. All such items were requested through the simulator, which would then find the items on our asset hosts and retrieve them. Especially in heavily populated regions it was possible for things to fail to load because the simulators would get overloaded. Having these requests routed through an intermediate host also made them much slower than necessary.
A few years ago we made the change to allow textures to be loaded directly via HTTP. Now instead of asking the simulator for every texture, the Viewer could fetch textures directly from a CDN, the same way web content is normally distributed. Performance improved and people noticed that the world loaded faster, and clouded or gray avatars became much less common.
Today we are taking the next step in enabling HTTP-based fetching. If you download this test Viewer, then several other common types of assets will also be retrieved via HTTP. Supported asset types include animations and gestures, sounds, avatar clothing, and avatar body parts such as the skin and shape.
Based on our tests, this change also helps with performance. We need your help to make sure it works for people all over the world, and to identify any remaining issues that we need to fix. Please download the Viewer and give it a spin; we hope it will make your Second Life even faster and more reliable.
Sometime after these changes go into the default Viewer download, we will be phasing out support on the simulators for the old non-HTTP asset fetching process. We will let you know well ahead of that time so you can download a supported Viewer, and so that makers of other viewers will have time to add these changes to their products. Please let us know how it goes!
April Linden here. I’m a member of the Second Life Operations team. Second Life had some unexpected downtime on Monday morning, and I wanted to take a few minutes to explain what happened.
We buy bandwidth to our data centers from several providers. On Monday morning, one of those providers had a hardware failure on the link that connects Second Life to the Internet. This is a fairly normal thing to happen (and is why we have more than one Internet provider). This time was a bit unusual, as the traffic from our Residents on that provider did not automatically spill over to one of the other connections, as it usually does.
Our ops team caught this almost immediately and were able to shift traffic around to the other providers, but not before a whole bunch of Residents had been logged out due to Second Life being unreachable.
Since a bunch of Residents were unexpectedly logged out, they all tried to log back in at once. This rush of logins was very high, and it took quite a while for everyone to get logged back in. Our ops team brought some additional login servers online to help with the backlog of login attempts, and this allowed the login queue to eventually return to its normal size.
Some time after the login rush was completed the failed Internet provider connection was restored, and traffic shifted around normally, without disruption, returning Second Life back to normal.
There was a bright spot in this event! Our new status blog performed very well, allowing our support team to be able to communicate with Residents, even in a state where it was under much higher load than normal.
We’re very sorry for the unexpected downtime on Monday morning. We know how important having a fun time Inworld is to our Residents, and we know how unfun events like this can be.
See you Inworld!
As promised, we’re sharing some release note summaries of the fixes, tweaks., and other updates that we’re making to the Marketplace and the Web properties, so that those following along can read through at their leisure.
12/01/16 - Maps: Maps would disappear at peak use times. That’s fixed now.
11/28/16 - We have a new shiny Grid Status blog! You may notice an updated look and feel. If you followed https://community.secondlife.com/t5/Status-Grid/bg-p/status-blog, be sure to update your subscriptions to status.secondlifegrid.net
11/22/16 - No more slurl.com. All http://maps.secondlife.com/ all the time.
11/21/16 - We did a minor deploy to the lindenlab.com web properties.
11/09/16 - Events infrastructure stabilization to fix a few listing bugs.
11/08/16 - Fixes to maps.secondlife.com were released, including:
- Viewing a specific location on maps.secondlife.com no longer throws a 404 error in the console
- Adding a redirect from slurl.com/secondlife/ requests to maps.secondlife.com
11/04/16 - A minor Security fix was released.
11/03/16 - We released a large infrastructure update to secondlife.com along with security fixes and several minor bug fixes.
As always, we appreciate and welcome your bug reports in Jira!
Stay tuned to the blogs for future updates as we complete new releases.
There's been a lot going on with the Marketplace and our Web properties, and in an effort to give you a more granular view into what we're working on, we're going to put out release notes summaries on this blog going forward. Of course, some things will have to remain behind the scenes, but here are all the news that's fit to print:
10/31/16 New Premium Landing page
10/28/16 Several bug fixes to the support portal support.secondlife.com
10/24/16 We made an update to the Marketplace with the following changes:
- Fix sorting reviews by rating
- Fix duplicate charging for PLE subscriptions
- Fix some remaining hangers-on from the VMM migration (unassociated items dropdown + “Your store has been migrated” notifications
Fix to Boolean search giving overly broad results (BUG-37730)
10/18/16 Maps: We deployed a fix for “Create Your Own Map” link, which used to generate an invalid slurl.
10/11/16 Marketplace: We disabled fuzzy matches in search on the Marketplace so that search results will be more precise.
10/10/16 We made an update to the Marketplace with the following changes:
- We will no longer index archived listings
- We will now reindex a store's products when the store is renamed
- We made it so that blocked users can no longer send gifts through the Marketplace
We added a switch to allow us to enable or disable fuzzy matches in search
9/28/16 We deployed a fix to the Marketplace for an issue where a Firefox update was ignoring browser-specific style sheet settings on Marketplace.
9/22/16 We made a change to the Join flow for more consistency in password requirements.
9/22/16 We updated System Requirements to reflect the newest information.
As always, we appreciate and welcome your bug reports in Jira! Please stay tuned to the blogs for updates as we complete new releases.
As many Residents saw, we had a pretty rough day on the Grid yesterday. I wanted to take a few minutes and explain what happened. All of the times in this blog post are going to be in Pacific Time, aka SLT.
Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.
A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.
Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.
We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.
On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.
My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life.
Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.
We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better
My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.
Hi! I’m a member of the Second Life Operations team. On Friday afternoon, major parts of Second Life had some unplanned downtime, and I want to take a few minutes to explain what happened.
Shortly before 4:15pm PDT/SLT last Friday (May 6th, 2016), the primary node for one of the central databases that drive Second Life crashed. The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.
When the primary node in this database is offline we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.
My team quickly sprung into action, and we were able to promote one of the replica nodes up the chain to replace the primary node that had crashed. All services were fully restored and turned back on in just under an hour.
One additional (and totally unexpected) problem that came up is that for the first part of the outage, our status blog was inaccessible. Our support team uses our status blog to inform Residents of what’s going on when there are problems, and the amount of traffic it receives during an outage is pretty impressive!
A few weeks ago we moved our status blog to new servers. It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home. (Don’t forget that you can also follow us on Twitter at @SLGridStatus. It’s really handy when the status blog in inaccessible!)
As Landon Linden wrote a year ago, being around my team during an outage is like watching “a ballet in a war zone.” We work hard to restore Second Life services the moment they break, and this outage was no exception. It can be pretty crazy at times!
We’re really sorry for the unexpected downtime late last week. There’s a lot of fun things that happen inworld on Friday night, and the last thing we want is for technical issues to get in the way.
Hi! I wanted to take a moment to share why we had to do a full grid roll on a Friday. We know that Friday grid rolls are super disruptive, and we felt it was important to explain why this one was timed the way it was.
Second Life is run on a collection of thousands of Linux servers, which we call the “grid.” This week there was a critical security warning issued for one of the core system libraries (glibc), that we use on our version of Linux. This security vulnerability is known as CVE-2015-7547.
Since then we’ve been working around-the-clock to make sure Second Life is secure.
The issue came to light on Tuesday morning, and the various Linux distributions made patches for the issue available shortly afterwards. Our security team quickly took a look at it, and assessed the impact it might have on the grid. They were able to determine that under certain situations this might impact Second Life, so we sprang into action to get the grid fully patched. They were able to make this determination shortly after lunch time on Tuesday.
The security team then handed the issue over to the Operations team, who worked to make the updates needed to the machine images we use. They finished in the middle of the night on Tuesday (which was actually early Wednesday morning).
Once the updates were available, the development and release teams sprung into action, and pulled the updates into the Second Life Server machine image. This took until Wednesday afternoon to get the Second Life Server code built, tested, and the security team confirmed that any potential risk had been taken care of.
After this, the updates were sent to the Quality Assurance (QA) team to make sure that Second Life still functioned as it should, and they finished up in the middle of the night on Wednesday.
At this point we had a decision to make - do we want to roll the code to the full grid at once? We decided that since the updates were to one of the most core libraries, we should be extra careful, and decided to roll the updates to the Release Candidate (RC) channels first. That happened on Thursday morning.
We took Thursday to watch the RC channels and make sure they were still performing well, and then went ahead and rolled the security update to the rest of the grid on Friday.
Just to make it clear, we saw no evidence that there was any attempt to use this security issue against Second Life. It was our mission to make sure it stayed that way!
The reason there was little notice for the roll on Thursday is two fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.
We know how disruptive full grid rolls are, and we know how busy Friday is for Residents inworld. The timing was terrible, but we felt it was important to get the security update on the full grid as quickly as we could.
Thank you for your patience, and we’re sorry for the bumpy ride on a Friday.
Over the past week, a number of Second Life customers may have noticed that they were not being billed promptly for their Premium membership subscriptions, mainland tier fees, and monthly private region fees, with some customers inadvertently receiving delinquent balance notices by email, as we described on our status blog.
This incident has now been corrected, and our nightly billing system has since processed all users that should have been billed over the past week.
I wanted to share with you some of the details of what happened to cause this outage from our internal post mortem and more importantly, what we’re doing to prevent this happening in the future.
Every night, one of our batch processes collects a list of users that should be billed on that day, and processes that list through one of our internal data service subsystems. Internally, we refer to this process as the 'Nightly Biller'.
A regularly scheduled deploy to that same data service subsystem for a new internal feature inadvertently contained a regression which prevented this Nightly Biller process from running to completion.
On February 1st, 2016, we began a rolling deploy of code to one of our internal dataservice subsystems. For this particular deploy, we opted to deploy the code to the backend machines over four days, deploying to six hosts each day. The deploy was completed on February 4th.
The first support issue regarding billing issues was raised on February 8th, however, as we only had one incident reported to our payments team, we decided to wait and see if they were billed correctly the next night.
However, we were notified on the morning of February 9th that 546 private regions had not been billed, and an internal investigation began with a team assembled from Second Life Engineering, Payments, QA and Network Operations teams. This team identified the regression by 9am, and had pushed the required code fixes to our build system. By noon, the proposed fix was pushed to our staging environment for testing.
Unfortunately, testing overnight uncovered a further problem with this new code that would have prevented new users from being able to join Second Life. On February 10th, investigation into this failure, and how it was connected with both the Nightly Biller system, and our new internal tool code continued.
By February 11th, we had made the decision to roll back to the previous code version that would have allowed the Nightly Biller to complete successfully, but would have disabled our new internal feature. One final review of the new code uncovered an issue with an outdated version of some Linden-specific web libraries. Once these libraries were updated and deployed to our staging environment, our QA team were able to successfully complete the tests for our Nightly Biller, our new internal tool, and the User Registration flow.
The new code was pushed out to our production dataservice subsystem by 7pm on February 11th, and the Payments team were able to confirm that the Nightly Biller ran successfully later that evening.
As a result of this incident, we’re making some internal process changes:
- Firstly, we’ll be changing our build system to ensure that when new code is built, we’re always using the latest version of our internal libraries.
- Secondly, we are implementing changes to our workflow around code deploys to ensure that such regressions do not occur in the future
We're always striving for low risk software deploys at the Lab, and each code deploy request made is evaluated for a potential risk level. Further reducing our risk is internal documentation that describes the release process. Unfortunately, a key step was missed in the process, which inadvertently led to a high risk situation, and the failure of our nightly biller. The above changes are already in progress which will reduce the likelihood of incidents such as this recurring.
Chris Linden here. I wanted to briefly share an example of some of the interesting challenges we tackle on the Systems Engineering team here at Linden Lab. We recently noticed strange failures while testing a new API endpoint hosted in Amazon. Out of 100 https requests to the endpoint from our datacenter, 1 or 2 of the requests would hang and eventually time out. Strange. We understand that networks are unreliable, so we write our applications to handle it and try to make it more reliable when we can.
We began to dig. Did it happen from other hosts in the datacenter? Yes. Did it fail from all hosts? No, and this was maddening to be true. Did it happen from outside the datacenter? No. Did it happen from different network segments within our datacenter? Yes.
Sadly, this left our core routers as the only similar piece of hardware between the hosts showing failures and the internet at large. We did a number of traceroutes to get an idea of the various paths being used, but saw nothing out of the ordinary. We took a number of packet captures and noticed something strange on the sending side.
1521 9.753127 216.82.x.x -> 184.108.40.206 TCP 74 53819 > 443 [sYN] Seq=0 Win=29200 Len=0 MSS=1400 SACK_PERM=1 TSval=2885304500 TSecr=0 WS=128
1525 9.773753 220.127.116.11 -> 216.82.x.x TCP 74 443 > 53819 [sYN, ACK] Seq=0 Ack=1 Win=26960 Len=0 MSS=1360 SACK_PERM=1 TSval=75379683 TSecr=2885304500 WS=128
1526 9.774010 216.82.x.x -> 18.104.22.168 TCP 66 53819 > 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2885304505 TSecr=75379683
1527 9.774482 216.82.x.x -> 22.214.171.124 SSL 583 Client Hello
1528 10.008106 216.82.x.x -> 126.96.36.199 SSL 583 [TCP Retransmission] Client Hello
1529 10.292113 216.82.x.x -> 188.8.131.52 SSL 583 [TCP Retransmission] Client Hello
1530 10.860219 216.82.x.x -> 184.108.40.206 SSL 583 [TCP Retransmission] Client Hello
We saw the tcp handshake happen, then at the SSL portion, the far side just stopped responding. This happened each time there was a failure. Dropping packets is normal. Dropping them consistently at the Client Hello every time? Very odd. We looked more closely at the datacenter host and the Amazon instance. We poked at MTU settings, Path MTU Discovery, bugs in Xen Hypervisor, tcp segmentation settings, and NIC offloading. Nothing fixed the problem.
We decided to look at our internet service providers in our datacenter. We are multi-homed to the internet for redundancy and, like most of the internet, use Border Gateway Protocol to determine which path our traffic takes to reach a destination. While we can influence the path it takes, we generally don't need to.
We looked up routes to Amazon on our routers and determined that the majority of them prefer going out ISP A. We found a couple of routes to Amazon that preferred to go out ISP B, so we dug through regions in AWS, spinning up Elastic IP addresses until we found one in the route preferring to go out ISP B. It was in Ireland. We spun up an instance in eu-west-1 and hit it with our test and ... no failures. We added static routes on our routers to force traffic to instances in AWS that were previously seeing failures. This allowed us to send requests to these test hosts either via ISP A or ISP B, based on a small configuration change. ISP A always saw failures, ISP B didn't.
We manipulated the routes to send outbound traffic from our datacenter to Amazon networks via the ISP B network. Success. While in place, traffic preferred going out ISP B (the network that didn't show failures), but would fall back to going out ISP A if for any reason ISP B went away.
After engaging with ISP A, they found an issue with a piece of hardware within their network and replaced it. We have verified that we no longer see any of the same failures and have rolled back the changes that manipulated traffic. We chalk this up as a win, and by resolving the connection issues we've been able to make Second Life that much more reliable.
Hi! I’m a member of the Second Life operations team, and I was the primary on-call systems engineer this past weekend. We had a very difficult weekend, so I wanted to take a few minutes to share what happened.
We had a series of independent failures happen that produced the rough waters Residents experienced inworld.
Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.
This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.
After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.
That brings us to Sunday morning.
Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.
It didn’t take us long to realize that things were about to get interesting again, however.
Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.
Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” - it’s what makes the texture you see on your avatar - thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.
One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.
I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.
See you inworld (after I get some sleep!),