April Linden here. I’m a member of the Second Life Operations team. Second Life had some unexpected downtime on Monday morning, and I wanted to take a few minutes to explain what happened.
We buy bandwidth to our data centers from several providers. On Monday morning, one of those providers had a hardware failure on the link that connects Second Life to the Internet. This is a fairly normal thing to happen (and is why we have more than one Internet provider). This time was a bit unusual, as the traffic from our Residents on that provider did not automatically spill over to one of the other connections, as it usually does.
Our ops team caught this almost immediately and were able to shift traffic around to the other providers, but not before a whole bunch of Residents had been logged out due to Second Life being unreachable.
Since a bunch of Residents were unexpectedly logged out, they all tried to log back in at once. This rush of logins was very high, and it took quite a while for everyone to get logged back in. Our ops team brought some additional login servers online to help with the backlog of login attempts, and this allowed the login queue to eventually return to its normal size.
Some time after the login rush was completed the failed Internet provider connection was restored, and traffic shifted around normally, without disruption, returning Second Life back to normal.
There was a bright spot in this event! Our new status blog performed very well, allowing our support team to be able to communicate with Residents, even in a state where it was under much higher load than normal.
We’re very sorry for the unexpected downtime on Monday morning. We know how important having a fun time Inworld is to our Residents, and we know how unfun events like this can be.
See you Inworld!
As promised, we’re sharing some release note summaries of the fixes, tweaks., and other updates that we’re making to the Marketplace and the Web properties, so that those following along can read through at their leisure.
12/01/16 - Maps: Maps would disappear at peak use times. That’s fixed now.
11/28/16 - We have a new shiny Grid Status blog! You may notice an updated look and feel. If you followed https://community.secondlife.com/t5/Status-Grid/bg
11/22/16 - No more slurl.com. All http://maps.secondlife.com/ all the time.
11/21/16 - We did a minor deploy to the lindenlab.com web properties.
11/09/16 - Events infrastructure stabilization to fix a few listing bugs.
11/08/16 - Fixes to maps.secondlife.com were released, including:
- Viewing a specific location on maps.secondlife.com no longer throws a 404 error in the console
- Adding a redirect from slurl.com/secondlife/ requests to maps.secondlife.com
11/04/16 - A minor Security fix was released.
11/03/16 - We released a large infrastructure update to secondlife.com along with security fixes and several minor bug fixes.
As always, we appreciate and welcome your bug reports in Jira!
Stay tuned to the blogs for future updates as we complete new releases.
There's been a lot going on with the Marketplace and our Web properties, and in an effort to give you a more granular view into what we're working on, we're going to put out release notes summaries on this blog going forward. Of course, some things will have to remain behind the scenes, but here are all the news that's fit to print:
10/31/16 New Premium Landing page
10/28/16 Several bug fixes to the support portal support.secondlife.com
10/24/16 We made an update to the Marketplace with the following changes:
- Fix sorting reviews by rating
- Fix duplicate charging for PLE subscriptions
- Fix some remaining hangers-on from the VMM migration (unassociated items dropdown + “Your store has been migrated” notifications
Fix to Boolean search giving overly broad results (BUG-37730)
10/18/16 Maps: We deployed a fix for “Create Your Own Map” link, which used to generate an invalid slurl.
10/11/16 Marketplace: We disabled fuzzy matches in search on the Marketplace so that search results will be more precise.
10/10/16 We made an update to the Marketplace with the following changes:
- We will no longer index archived listings
- We will now reindex a store's products when the store is renamed
- We made it so that blocked users can no longer send gifts through the Marketplace
We added a switch to allow us to enable or disable fuzzy matches in search
9/28/16 We deployed a fix to the Marketplace for an issue where a Firefox update was ignoring browser-specific style sheet settings on Marketplace.
9/22/16 We made a change to the Join flow for more consistency in password requirements.
9/22/16 We updated System Requirements to reflect the newest information.
As always, we appreciate and welcome your bug reports in Jira! Please stay tuned to the blogs for updates as we complete new releases.
As many Residents saw, we had a pretty rough day on the Grid yesterday. I wanted to take a few minutes and explain what happened. All of the times in this blog post are going to be in Pacific Time, aka SLT.
Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.
A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.
Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.
We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.
On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.
My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life.
Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.
We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better
My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.
Hi! I’m a member of the Second Life Operations team. On Friday afternoon, major parts of Second Life had some unplanned downtime, and I want to take a few minutes to explain what happened.
Shortly before 4:15pm PDT/SLT last Friday (May 6th, 2016), the primary node for one of the central databases that drive Second Life crashed. The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.
When the primary node in this database is offline we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.
My team quickly sprung into action, and we were able to promote one of the replica nodes up the chain to replace the primary node that had crashed. All services were fully restored and turned back on in just under an hour.
One additional (and totally unexpected) problem that came up is that for the first part of the outage, our status blog was inaccessible. Our support team uses our status blog to inform Residents of what’s going on when there are problems, and the amount of traffic it receives during an outage is pretty impressive!
A few weeks ago we moved our status blog to new servers. It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home. (Don’t forget that you can also follow us on Twitter at @SLGridStatus. It’s really handy when the status blog in inaccessible!)
As Landon Linden wrote a year ago, being around my team during an outage is like watching “a ballet in a war zone.” We work hard to restore Second Life services the moment they break, and this outage was no exception. It can be pretty crazy at times!
We’re really sorry for the unexpected downtime late last week. There’s a lot of fun things that happen inworld on Friday night, and the last thing we want is for technical issues to get in the way.
Hi! I wanted to take a moment to share why we had to do a full grid roll on a Friday. We know that Friday grid rolls are super disruptive, and we felt it was important to explain why this one was timed the way it was.
Second Life is run on a collection of thousands of Linux servers, which we call the “grid.” This week there was a critical security warning issued for one of the core system libraries (glibc), that we use on our version of Linux. This security vulnerability is known as CVE-2015-7547.
Since then we’ve been working around-the-clock to make sure Second Life is secure.
The issue came to light on Tuesday morning, and the various Linux distributions made patches for the issue available shortly afterwards. Our security team quickly took a look at it, and assessed the impact it might have on the grid. They were able to determine that under certain situations this might impact Second Life, so we sprang into action to get the grid fully patched. They were able to make this determination shortly after lunch time on Tuesday.
The security team then handed the issue over to the Operations team, who worked to make the updates needed to the machine images we use. They finished in the middle of the night on Tuesday (which was actually early Wednesday morning).
Once the updates were available, the development and release teams sprung into action, and pulled the updates into the Second Life Server machine image. This took until Wednesday afternoon to get the Second Life Server code built, tested, and the security team confirmed that any potential risk had been taken care of.
After this, the updates were sent to the Quality Assurance (QA) team to make sure that Second Life still functioned as it should, and they finished up in the middle of the night on Wednesday.
At this point we had a decision to make - do we want to roll the code to the full grid at once? We decided that since the updates were to one of the most core libraries, we should be extra careful, and decided to roll the updates to the Release Candidate (RC) channels first. That happened on Thursday morning.
We took Thursday to watch the RC channels and make sure they were still performing well, and then went ahead and rolled the security update to the rest of the grid on Friday.
Just to make it clear, we saw no evidence that there was any attempt to use this security issue against Second Life. It was our mission to make sure it stayed that way!
The reason there was little notice for the roll on Thursday is two fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.
We know how disruptive full grid rolls are, and we know how busy Friday is for Residents inworld. The timing was terrible, but we felt it was important to get the security update on the full grid as quickly as we could.
Thank you for your patience, and we’re sorry for the bumpy ride on a Friday.
Over the past week, a number of Second Life customers may have noticed that they were not being billed promptly for their Premium membership subscriptions, mainland tier fees, and monthly private region fees, with some customers inadvertently receiving delinquent balance notices by email, as we described on our status blog.
This incident has now been corrected, and our nightly billing system has since processed all users that should have been billed over the past week.
I wanted to share with you some of the details of what happened to cause this outage from our internal post mortem and more importantly, what we’re doing to prevent this happening in the future.
Every night, one of our batch processes collects a list of users that should be billed on that day, and processes that list through one of our internal data service subsystems. Internally, we refer to this process as the 'Nightly Biller'.
A regularly scheduled deploy to that same data service subsystem for a new internal feature inadvertently contained a regression which prevented this Nightly Biller process from running to completion.
On February 1st, 2016, we began a rolling deploy of code to one of our internal dataservice subsystems. For this particular deploy, we opted to deploy the code to the backend machines over four days, deploying to six hosts each day. The deploy was completed on February 4th.
The first support issue regarding billing issues was raised on February 8th, however, as we only had one incident reported to our payments team, we decided to wait and see if they were billed correctly the next night.
However, we were notified on the morning of February 9th that 546 private regions had not been billed, and an internal investigation began with a team assembled from Second Life Engineering, Payments, QA and Network Operations teams. This team identified the regression by 9am, and had pushed the required code fixes to our build system. By noon, the proposed fix was pushed to our staging environment for testing.
Unfortunately, testing overnight uncovered a further problem with this new code that would have prevented new users from being able to join Second Life. On February 10th, investigation into this failure, and how it was connected with both the Nightly Biller system, and our new internal tool code continued.
By February 11th, we had made the decision to roll back to the previous code version that would have allowed the Nightly Biller to complete successfully, but would have disabled our new internal feature. One final review of the new code uncovered an issue with an outdated version of some Linden-specific web libraries. Once these libraries were updated and deployed to our staging environment, our QA team were able to successfully complete the tests for our Nightly Biller, our new internal tool, and the User Registration flow.
The new code was pushed out to our production dataservice subsystem by 7pm on February 11th, and the Payments team were able to confirm that the Nightly Biller ran successfully later that evening.
As a result of this incident, we’re making some internal process changes:
- Firstly, we’ll be changing our build system to ensure that when new code is built, we’re always using the latest version of our internal libraries.
- Secondly, we are implementing changes to our workflow around code deploys to ensure that such regressions do not occur in the future
We're always striving for low risk software deploys at the Lab, and each code deploy request made is evaluated for a potential risk level. Further reducing our risk is internal documentation that describes the release process. Unfortunately, a key step was missed in the process, which inadvertently led to a high risk situation, and the failure of our nightly biller. The above changes are already in progress which will reduce the likelihood of incidents such as this recurring.
Chris Linden here. I wanted to briefly share an example of some of the interesting challenges we tackle on the Systems Engineering team here at Linden Lab. We recently noticed strange failures while testing a new API endpoint hosted in Amazon. Out of 100 https requests to the endpoint from our datacenter, 1 or 2 of the requests would hang and eventually time out. Strange. We understand that networks are unreliable, so we write our applications to handle it and try to make it more reliable when we can.
We began to dig. Did it happen from other hosts in the datacenter? Yes. Did it fail from all hosts? No, and this was maddening to be true. Did it happen from outside the datacenter? No. Did it happen from different network segments within our datacenter? Yes.
Sadly, this left our core routers as the only similar piece of hardware between the hosts showing failures and the internet at large. We did a number of traceroutes to get an idea of the various paths being used, but saw nothing out of the ordinary. We took a number of packet captures and noticed something strange on the sending side.
1521 9.753127 216.82.x.x -> 126.96.36.199 TCP 74 53819 > 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1400 SACK_PERM=1 TSval=2885304500 TSecr=0 WS=128
1525 9.773753 188.8.131.52 -> 216.82.x.x TCP 74 443 > 53819 [SYN, ACK] Seq=0 Ack=1 Win=26960 Len=0 MSS=1360 SACK_PERM=1 TSval=75379683 TSecr=2885304500 WS=128
1526 9.774010 216.82.x.x -> 184.108.40.206 TCP 66 53819 > 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2885304505 TSecr=75379683
1527 9.774482 216.82.x.x -> 220.127.116.11 SSL 583 Client Hello
1528 10.008106 216.82.x.x -> 18.104.22.168 SSL 583 [TCP Retransmission] Client Hello
1529 10.292113 216.82.x.x -> 22.214.171.124 SSL 583 [TCP Retransmission] Client Hello
1530 10.860219 216.82.x.x -> 126.96.36.199 SSL 583 [TCP Retransmission] Client Hello
We saw the tcp handshake happen, then at the SSL portion, the far side just stopped responding. This happened each time there was a failure. Dropping packets is normal. Dropping them consistently at the Client Hello every time? Very odd. We looked more closely at the datacenter host and the Amazon instance. We poked at MTU settings, Path MTU Discovery, bugs in Xen Hypervisor, tcp segmentation settings, and NIC offloading. Nothing fixed the problem.
We decided to look at our internet service providers in our datacenter. We are multi-homed to the internet for redundancy and, like most of the internet, use Border Gateway Protocol to determine which path our traffic takes to reach a destination. While we can influence the path it takes, we generally don't need to.
We looked up routes to Amazon on our routers and determined that the majority of them prefer going out ISP A. We found a couple of routes to Amazon that preferred to go out ISP B, so we dug through regions in AWS, spinning up Elastic IP addresses until we found one in the route preferring to go out ISP B. It was in Ireland. We spun up an instance in eu-west-1 and hit it with our test and ... no failures. We added static routes on our routers to force traffic to instances in AWS that were previously seeing failures. This allowed us to send requests to these test hosts either via ISP A or ISP B, based on a small configuration change. ISP A always saw failures, ISP B didn't.
We manipulated the routes to send outbound traffic from our datacenter to Amazon networks via the ISP B network. Success. While in place, traffic preferred going out ISP B (the network that didn't show failures), but would fall back to going out ISP A if for any reason ISP B went away.
After engaging with ISP A, they found an issue with a piece of hardware within their network and replaced it. We have verified that we no longer see any of the same failures and have rolled back the changes that manipulated traffic. We chalk this up as a win, and by resolving the connection issues we've been able to make Second Life that much more reliable.
Hi! I’m a member of the Second Life operations team, and I was the primary on-call systems engineer this past weekend. We had a very difficult weekend, so I wanted to take a few minutes to share what happened.
We had a series of independent failures happen that produced the rough waters Residents experienced inworld.
Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.
This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.
After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.
That brings us to Sunday morning.
Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.
It didn’t take us long to realize that things were about to get interesting again, however.
Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.
Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” - it’s what makes the texture you see on your avatar - thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.
One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.
I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.
See you inworld (after I get some sleep!),
Since its introduction, the Linux version of the Second Life Viewer has been considered a Beta status project, meaning that it might have problems that would not have been considered acceptable on the much more widely used Windows or Mac versions. Because "Linux" isn't really one platform - it's a large (and fluid) number of similar but distinct distributions - doing development, builds, and testing for the Linux version has always been a difficult thing to do and a difficult expense to justify. Today, Linux represents under half of one percent of official Viewer users, and just a little over one percent of users on all viewers. We at Linden Lab need to focus our development efforts on the platforms that will improve the experience of more users.
While we hope to be able to continue to distribute a Linux version, from now on we will rely on the open source community for Linux platform support. Linden Lab will integrate open source community contributions to update the Linux platform support, and will build and distribute the resulting viewers, but our development engineering, including bug fixing, will be focused on the platforms more popular among our users. We hope that the community will take up this challenge; anyone interested in ensuring that their fellow Linux users can continue on their preferred platform is encouraged to reach out to us to find out where help is most needed.
Available now is the ability for LSL to return an avatar’s shape type to scripted objects. With this information, scripters and creators of objects can determine the best animation, pose, or position to play when avatars interact with their objects.
Scripts can now read the avatar’s shape type (male or female) and hover height values. For a complete list of object and avatar agent size details please visit the LLGetObjectDetails() and LLGetAgentSize() wiki pages..
Facebook recently announced plans to deprecate an old Open Graph API and required all apps running 1.0 to force update to 2.0. We have completed this update for SLShare, but Facebook anticipates that the process on their end for migrating a given app may take up to a couple of weeks. During this migration period, there may be some service interruptions for some apps.
This means that when using SLShare (updating status, photo uploads, and check-ins from the Viewer) you may experience some temporary problems. Please be assured that we are aware of this and any issues you encounter should be resolved once the migration period is complete.
Thank you for your patience!
Yesterday, with much rejoicing, we promoted Viewer 188.8.131.520918 (Tools Update) to release. While this Viewer doesn’t have a shiny new featureset on the surface (other than reverting to a single-button login), it’s what’s inside that really matters - we’ve updated the numerous tools used to build the Viewer. The immediate expected effect of this is performance stability and a decreased crash rate.
We go to great lengths to maintain backwards compatibility in Second Life, both to never break users’ creations and to support the wide range of systems our Residents use to log in. However, sometimes we have to make the hard decisions: a year ago we announced that we were dropping support for Windows XP and Mac OSX 10.5 & 10.6 (a complete list of current system requirements is available here). Today, with the Tools Update release, the Viewer will no longer run on those systems. You will still be able to log in with an older Viewer until it is aged out based on our deprecation policy, however we strongly recommend updating your system.
It's unfortunate that we have to stop supporting some older systems, but upgrading the tools we use to build the viewer will help us to bring you other improvements to your Second Life experience more quickly and reliably.
Keeping the systems running the Second Life infrastructure operating smoothly is no mean feat. Our monitoring infrastructure keeps an eye on our machines every second, and a team of people work around the clock to ensure that Second Life runs smoothly. We do our best to replace failing systems proactively and invisibly to Residents. Unfortunately, sometimes unexpected problems arise.
In late July, a hardware failure took down four of our latest-generation of simulator hosts. Initially, this was attributed to be a random failure, and the machine was sent off to our vendor for repair. In early October, a second failure took down another four machines. Two weeks later, another failure on another four hosts.
Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.
After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our datacentre to inspect every potentially affected system and replace the defective component to prevent any more failures.
The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.
Last week we deployed the change to serve all texture and mesh data primarily through the CDN, as we've been doing with avatar textures since March. In addition to reviewing feedback from Residents we've been monitoring and measuring the effects of the change, and thought it would be interesting to share some of what we've learned.
First the good news:
- Load on some key systems on the simulator hosts has been reduced considerably. The chart below shows the frequency of high-load conditions in the simulator web services, and you can see the sharp drop as the CDN takes on much of that job. This translates into other things, including region crossings and teleports, being faster and more reliable.
- For most users most of the time there has been a big performance improvement in texture and mesh data loading, resulting in faster rez times in new areas. The improvement has been realized both on the official viewer and on third party viewers.
However, we have also seen that some users have had the opposite experience, and have worked with a number of those users to collect detailed data on the nature of the problems and shared it with our CDN provider. We believe that the problems are the result of a combination of the considerable additional load we added to the CDN, and a coincidental additional large load on the CDN from another source. Exacerbating matters, flaws in both our viewer code and the CDN caused recovery from these load spikes to be much slower than it should have been. We are working with our CDN provider to increase capacity and to configure the CDN so that Second Life data availability will not be as affected by outside load. We are also making changes to our code and in the CDN to make recovery quicker and more robust.
We are confident that using the CDN for this data will make the Second Life experience better. Making any change to a system at the scale of Second Life has some element of unavoidable risk; no matter how carefully we simulate and test in advance, once you deploy at scale in live systems there's always something to be learned. This change has had some problems for a small percentage of users; unfortunately, for those users the problems were quite serious for at least part of the time. We appreciate all the help we've gotten from users in quickly diagnosing those problems. We think that the changes we've begun making will reduce the frequency of failures to below what they were before we adopted the CDN, while keeping the considerable performance gains.
11.07.2014: An update on performance improvements and adjustments is available here.
Second Life was originally designed for nearly all data and Viewer interactions to go through the Simulator server. That is, the Viewer would talk almost exclusively to the specific server hosting the region the Resident was in. This architecture had the advantage of giving a single point of control for any session. It also had the disadvantage of making it difficult to address region resource problems or otherwise scale out busy areas.
Over the years we’ve implemented techniques to get around these problems, but one pain point proved difficult to fix: asset delivery, specifically textures and meshes. Recently we implemented the ability to move texture and mesh traffic off the simulator server onto a Content Delivery Network (CDN), dramatically improving download times for Residents while significantly reducing the load on busy servers.
Download times for textures and meshes have been reduced by more than 50% on average, but outside of North America those the improvements are even more dramatic. That is great news, but the most amazing improvement has been on the simulator servers themselves. The following chart graphs servers on a production release-candidate channel with high HTTP load conditions before and after we rolled the CDN code onto them:
The high load conditions almost completely disappeared! We knew that we would get a major drop in load with the move, but this blew us away. At first we didn’t believe it and spent two days trying to figure out what we did wrong. There was nothing wrong; this was real.
The results of all of this are faster scene loads, quicker object rezzing, far fewer problems with fuzzy or cloudy avatars, fewer teleport failures, and more! The feedback from Residents has been fantastic. We’re loving it, too! Everything is just so much snappier.
We have finished rolling the CDN code out to the grid, and the results have remained extraordinary. This week, we are also fully releasing our HTTP Project Viewer, which will make the CDN change even better by taking advantage of the elimination of server-side rate limiting. We have been extremely happy with the results so far (psst, we’re talking an 80% reduction in content download times).The CDN benefits are available to everyone regardless of which Viewer you choose to use. All users of the official Viewer will also be able to enjoy the results of the HTTP improvements, and third party developers are able to adopt these changes in their Viewers as well.
We are very happy to be finally releasing these improvements to everyone. Give it a try and let us know what you think!
HTTP Project Recap
Earlier this year we blogged about the HTTP project and how, step-by-step, the project is overcoming various limitations. Viewer release 3.4.3 introduced a new HTTP library that made better use of network resources. Texture fetches were the first operations to take advantage of this library, which improved throughput while using fewer connections. But the viewer was still constrained by a one-request-per-connection model.
Changing that model required back-end modifications. Those shipped early in 2013 in the DRTSIM-203 simulator release. For the first time, texture fetches could re-use existing HTTP connections. And for most users, this doubled the theoretical maximum texture request rate.
But all the world is not textures, and mesh fetches were next to receive attention. Meshes required quite a bit more work. Both back-end and viewer engineering was needed, culminating in viewer release 3.7.2. This release brought mesh fetching into behavioral parity with textures.
These releases have brought the viewer up to the request rate limits of region 'C'. Here, the limits are dictated by serialization, distance, and the speed of light. We are preparing to move beyond this region with changes to concurrency and locality. HTTP request concurrency will be vastly increased by the introduction of HTTP pipelining. Locality will be changed by the use of a Content Delivery Network (CDN) to move texture and mesh data nearer to most users.
Common HTTP communication is a simple back-and-forth exchange of requests and responses. A request is issued, a response is returned, and only then is another request issued. As distance increases between endpoints, the time to perform this ping-pong increases, which lowers the effective request rate. Pipelining attacks this distance-induced loss by issuing multiple requests at once without waiting for responses.
The pipelining viewer will use this more aggressive request model for both texture and mesh fetches. This viewer is currently in QA and is expected to go to RC soon after the 3.7.16 release. It has also had trials outside of North America and the results have been very good. The effective request rate has approached the limits imposed by our servers and has exceeded the download rate of UDP texture fetching.
As for those server limits, the operations team at Linden has been making rapid progress on the CDN project (DRTSIM-258). This project will replicate the Linden services that supply meshes and textures to a CDN's PoPs (Points-of-Presence). With PoPs on multiple continents, request service time will be reduced for most users.
Combined with pipelining, CDN experiments are producing results that have only been dreamed about. How fast? Well, the engineer's universal response of "It depends" applies. But several 100's of fetches per second have been seen far from the USA. This takes us into never before encountered performance territory.
And one more thing
The HTTP Project has focused on textures and meshes. But the inventory system, which maintains item ownership, is often described as... sluggish. So as an exercise in expanding the use of the new HTTP library, the pipelining viewer was modified to use it for inventory fetches. As with textures and meshes before, inventory is now fetching in the 'C' region of its specific performance graph. The difference can be surprising.
For several years, HTTP has figured prominently in Linden's plans for Second Life. "HTTP will give you speed and throughput, consistency and robustness." A promised land, but never quite realized. These next steps are payment on that promise. HTTP done well can support an amazing experience. And you'll have no reason to look back to the V1 world of UDP textures and inventory.
Nous sommes embarqués,
As of today, several projects have reached the Project Viewer stage, and we wanted to share a bit about how you can expect to see your Second Life experience improve with these initiatives.
Graphics Settings Benchmarking:
This is a new way of figuring out the best default graphics settings. Maybe this has happened to you: you got an awesome new graphics card, fired up SL… only to discover your graphics settings are set to Low, and can’t be changed? No more! This Viewer does away with the old GPU table and instead uses a quick benchmark measurement to detect your GPU to assign appropriate default graphics settings on startup. The settings on shiny powerful hardware should really let that hardware shine. Get a Project Benchmark Viewer today and help us gather metrics! Please file bugs in JIRA if you find them.
Installation and Login screen changes:
A new look for the login screen is coming in stages. We’re tweaking the login screen and A/B testing the results. We’ve simplified the new user login screen to remove distractions, and we’re adding instructions for new downloads and installs. We’re also streamlining the returning user login screen to help you get where you’re going faster - or find a new place to visit. You’re likely to see some incremental changes continue over the coming weeks.
Two complementary projects are going in neck-and-neck:
A new texture and mesh asset service (a CDN) is in testing on a limited set of regions, and so far it’s showing encouraging results. Particularly for those who login from places far away from our US data centers, this has the potential to significantly improve how quickly textures and meshes load. In the coming weeks, we will expand the number of regions using this new service as the next step in our testing.
We’re also taking the next step with the HTTP project - Pipelining. We will soon put out a viewer that will pipeline HTTP requests for texture and mesh fetches, improve inventory folder and item fetches, and have some general adjustments for using a CDN-enabled grid.
Separately, each of these will improve texture and mesh loading performance, but put together, you should really see some exciting improvements in how long it takes to load new areas and objects - making touring the many fabulous places in Second Life you have not yet visited even better!
From time to time, incidents occur that our operations team needs to quickly fix in order to keep all of Second Life working well 24x7 for users around the world.
How does the Linden Lab ops team collaborate to quickly tackle these incidents? Our VP of Operations and Platform Engineering, Landon McDowell (Landon Linden), has written a great description of an early experience he had with our approach as well as some thoughts on why it works so well. This is a bit outside the usual “Tools & Tech” topics for this blog, but we thought Second Life users familiar with how operations teams work would appreciate the inside look at our team’s approach:
Two weeks into my tenure in the Operations group at Linden Lab I was confronted with my first major incident there. It was early afternoon, and I was well into a post-ramen food coma when alarms started popping off in IRC. All of our major charts were taking a header - logins, concurrency, etc.
The call went out in #ops for hands, but I had already jumped in. This wasn’t my first rodeo. I was primed to hop onto a conference call or pile into a room to marshall a response. But that never happened.
Instead, responders starting piping up in IRC with, “Hands.” Soon I was completely overwhelmed by a stream of text flying across my screen as engineers were reporting back and discussing findings.
The problem was quickly narrowed down to a particular load balancer. I was barely into the box before an engineer chimed in, “It's running out of ports.” From there the resolution was straight-forward: some quick TCP tuning and adding another backend to the pool to quickly stabilize things before proceeding to long-term fixes.
I, though, just sat there staring at the screen wondering what the hell had just happened, wondering what the hell I had gotten myself into. I thought I was a seasoned pro, but I had never ever seen an incident response go that smoothly or quickly. Panic started to set in. I was out of my league.
In the day that followed, I was able to review the incident by reading the chat log, referred to as the scrollback. My confidence slowly began to rebuild. I stepped through the incident response line by line, server by server, action by action. After we completed the postmortem, I felt that with more practice and experience I could do this. I also realized, to the initiated, chat-centric incident response is far and away the best, most efficient method of handling outages.
The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.
In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.
In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.
Responders never all show up simultaneously. Often they have to be pulled in mid-incident. The power of the chat log really comes through here as latecomers get an automatic up-to-the-second sitrep. “Reading scrollback” is our standard entrance letting everyone know someone new has engaged and needs a minute to catch up. Even in cases when a quick briefing to a newcomer is necessary, one person can break off into a separate channel or in private message without having to disengage from the main conversation.
Other kinds of text sidebars are of course useful in incident responses. For example, emotions run high during outages and occasionally you have to ask someone to cool their jets. This is done quickly and effectively in private chat message without embarrassing them in front of the rest of the team.
At Linden Lab, we use a designated Incident Commander to orchestrate incident responses. Chat systems give an easy way to flag whoever is running the show by chat handle and/or in the channel topic. Anyone jumping in knows immediately who is in charge without having to distract the response team by asking.
Running an incident response in a chat channel is also an incredibly effective way of passively disseminating information to a wide audience. A large number of people can quietly lurk in a chat channel unlike in a physical space. More formal status updates to various parties, like support, are of course sometimes necessary but enabling those parties to follow along in real time gives them context that would not otherwise be conveyed in a terse status report.
As a final bonus, we are able to respond to a problem at peak efficiency regardless of where anyone is at that moment. Issues don’t wait for office hours to crop up. Being a distributed team, this is really our only option, but it rocks that being distributed is an advantage in incident response.
The benefits of chat-based incident responses don’t end with the incident though. Having a detailed log of events is invaluable in conducting postmortems. People have a terrible memories, especially during high stress events. The log gives a history of events with precise times that could never be achieved by relying on responders' recollections.
Likewise, the chat log for an incident is potent teaching tool. New hires can use them to learn about the particulars and eccentricities of systems in a way that is rarely captured in documentation or direct instruction. But in general, the log gives a remarkably clear picture of what went right and what went wrong in an incident response, letting the team better iterate and improve on their process over time.
Chat-based incident response isn’t easy. It requires group discipline and commitment as it runs counter to our instincts about communication. It can be nerve-racking for newcomers to the practice. Not everyone can hack it. Extremely smart people can and do wash out from not being able to keep up. But when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.
At one time or another, many of us have suffered a Viewer crash that interrupted our time in Second Life. At Linden Lab, we collect crash data to help us in our ongoing efforts to identify and fix issues in the Viewer that can cause crashes. Thanks to that information, nearly every release we make includes fixes of this sort.
The data we collect also reveals that many users can greatly reduce their risk of Viewer crashes by taking a few steps to update their software outside of Second Life.
The nature of Second Life as a platform for user creativity means that the Viewer faces different challenges than client software for an online game, for example, which would just need to handle the limited and carefully optimized content created by the game’s developer. This can make Second Life a demanding application for your computer and can mean that if your operating system is out of date, your Viewer is more likely to crash.
The good news is you can take steps today to help this! Here are a couple of tips:
Upgrade your Operating System
There is a very clear pattern in our statistics - the more up to date your operating system is, the less likely your Viewer is to crash. This applies on both Windows and Macintosh (Linux is a little harder to judge, since "up to date" has a more fluid meaning there, and the sample sizes are small). Some examples:
Windows 8.1 reports crashes only half as often as Windows 8.0
Those of you who stuck with Windows 7 (roughly 40% of users of our Viewer right now) rather than upgrade to 8.0 made a good choice at the time; version 7 still has a much better crash rate than 8.0, but not quite as good as 8.1 (now about 15% of users), so waiting is no longer the best approach.
Mac OSX 10.9.3 reports crashes a third less than 10.7.5
OSX rates do not have as much variation as Windows versions do, but newer is still better, and there are other non-crash reasons to be on the up to date version, including rendering improvements.
Upgrading will probably also better protect you from security problems, so it's a good idea even aside from allowing you to spend more time in Second Life.
Use the 64 bit version of Windows if you can
For each version of Windows for the last several years, you have had a choice between 32 bit and 64 bit variants; if your system can run the 64 bit variant, then you will probably crash much less frequently by changing to it. While we don't have a fully 64 bit version of the Viewer yet, you can run it on 64 bit Windows, and statistically you'll be much better off if you do.
Generally speaking the 64 bit Windows versions report crashes half as often as the 32 bit versions.
According to the data we collect, a little more than 20% of users are running 32 bit Windows versions; most of you can probably upgrade and would benefit by it.
If you bought your computer any time in the last 5 years, chances are very good that it can run the 64 bit version of Windows (as will some systems that are even older). Microsoft has a FAQ page on this topic; go there and read the answer to the question "How do I tell if my computer can run a 64-bit version of Windows?". That page also explains how to do the upgrade and other useful information.
We'll of course continue working hard to find and fix things that lead to Viewer crashes. Even as we do that, though, you can decrease your chances of crashing today by taking the steps above.
We're ready to start a limited beta test of an exciting new tool for creators: Experience Keys. These are new LSL functions and calls that make it possible to bypass the multiple permissions dialogs that you encounter with scripted objects today. Experience Keys will make it possible for users to create more immersive experiences inworld, because those interacting with the experience will be able to grant all the permissions necessary to participate just once, instead of having the experience interrupted by multiple permissions requests. To learn more, check out this brief video.
We used this technology when creating the Linden Realms game, and we're now ready to start putting this tool in the talented hands of creators in the Second Life community. Experience Keys is a powerful tool, and we need to be sure we test and roll out the feature carefully, so the first step will be a limited beta, then the viewer and server releases shortly after.
If you’d like to participate, send an email to firstname.lastname@example.org with “Experience Key Beta” as the subject along with:
Your experience name.
What genre does it fit in?
Give us a brief description of your experience.
How would your customers benefit from Experience Keys?
When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.
In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.
With a lot of hard work and countless long nights we stabilized the service and started making major improvements to the overall stability and performance of Second Life. However, despite our continued improvements, and the relative tranquility they have created, the spectres of technical debt and single points of failure still loom over our operations. In recent weeks some of them have struck and disrupted Second Life. So much so that I want to explain the outages that have occurred, how we addressed them, and what we are doing going forward.
First, that core MySQL database cluster still exists. It is still the core of many of our central functions. When the write server fails it takes a minimum of thirty minutes to promote a new server into position. The promotion itself is actually relatively quick, but its numerous dependent services must all be taken down and brought back up carefully to ensure that they are all functioning properly.
In the last two months the core MySQL write database has hit two different fatal hardware faults, driving us to temporarily halt most Second Life operations. In some sense, two major write database failures close together is bad luck, but we cannot depend on luck to ensure the reliability of Second Life. In the very near future, we are moving the core MySQL write server to a new hardware class, on which production read servers are already running. Moving the write server will further improve overall database performance and make failures less frequent. It does not of course solve the root single point of failure problem so in the coming days, weeks, and months we will be reducing the impact of database failures even more. This includes continued improvement to the rotation process, extracting more functions out of the core database cluster, and further reducing the number of features that depend on the single write server.
The core MySQL database, however, has not been our only problem recently. A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through. Most seriously though was this week’s four and a half hour long login outage.
On Tuesday morning, users stopped being able to get into Second Life. The root cause was created over ten years ago in a system designed to assign a unique identifier to the hand-off of sessions from login to users’ initial regions. At 7:40AM Pacific Time, that system quietly ran out of possible numbers to assign. It took us four hours to isolate the problem, test a fix, and deploy the change. Users could immediately log in at that point, but it took an additional two hours for systems to settle out. When tens of thousands of users rush back into Second Life following an outage, we have to deliberately throttle some services to prevent further breakage.
Having such a hidden fault in a core service is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.
We want to apologize for all of the recent problems and the frustration they have caused. We too are frustrated and are intent on making our service better. Few things give me more pleasure than every day helping to make Second Life a happy and fun place. Thank you for your patience and support. We simply could not have a more devoted user base and for that we owe you better.
UPDATE: 04.24.2015 - For an update on how the Tool Viewer Release affects these systems, visit this blog.
We have made some changes to the Second Life System Requirements to bring them more up to date, and are making some related changes to the Viewer:
We have removed Windows XP and Mac OSX 10.6 from the list of supported operating systems. Microsoft has announced the end of support for XP, and it has been some time since Apple has released updates for 10.6. For some time now, the Viewer has been significantly less stable on these older systems, and the lack of security updates to them make them more hazardous to use.
We have no plans to actually block those systems, but problems reported on them that cannot be reproduced on supported systems will not likely be fixed.
The Windows installer has been modified to verify that the system has been updated with the most recent Service Packs from Microsoft. While we will not block installation on Windows 8 at this time, we strongly recommend upgrading to 8.1 for greater stability. Our data shows that the Viewer is significantly less stable on systems that have not been kept up to date, so the installer will now block installation until the updates have been applied. This change will be effective in a Viewer version to be released in the next few weeks, so it would be a good idea to get your system up to date before then. You can find information on how to install the latest updates at the Microsoft Windows Update page.
Last week, we made a new page available as a replacement for the old Transaction History page. Due to your feedback, we rolled back the changes to this page to allow us to gather more feedback, and we are now providing this new page for review, without removing the old Transaction History page.
We have not yet made any changes to the new page, because we would like time to collect your feedback and review it. We have created a wiki page giving background on why changes were made to this page, where the new page is, and how to provide feedback. We will be closing feedback on April 30, 2014, so please take a look before then.
Many of you may have read about the Heartbleed SSL vulnerability that is still affecting many Internet sites.
You do not need to take extra action to secure your Second Life password if you have not used the same password on other websites. Your Second Life password was not visible via Heartbleed server memory exposure. No secondlife.com site that accepts passwords had the vulnerable SSL heartbeat feature enabled.
If you used the same password for Second Life that you used on a third-party site, and if that third-party site may have been affected by the vulnerability, you should change your password.
Supporting sites such as Second Life profiles are hosted on cloud hosting services. Some of these sites were previously vulnerable to Heartbleed, which may have exposed one of these servers' certificates. As an extra precaution, we are in the process of replacing our SSL certificates across the board. This change will be fully automatic in standard web browsers.
Thank you for your interest in keeping Second Life safe!
We’re happy to report that the Photo Upload feature of SL Share is once again working (as we previously blogged, that portion of the service had been temporarily disabled by Facebook). Thank you again for your patience as we worked to resolve this issue.
The restored feature no longer automatically includes SLURLs when you share a picture to Facebook, but it’s still possible to let your Facebook friends know where you are inworld by using the Check-In function in SL Share.
As you may have seen, we’re expanding the functionality of SL Share to include not only Facebook, but also Flickr, and Twitter. You can read about that work here and try it out with the project Viewer now available here.
UPDATE - April 3, 2014: this issue is resolved and the Photo Upload feature is once again working. Thank you for your patience!
Facebook recently contacted us to let us know that the Photo Upload feature of SL Share is not permitted to automatically include location SLURLs in posts made from the application. We’re working with them to get a hotfix out ASAP, but in the meantime the Photo Upload feature in SL Share will not work, as Facebook has temporarily disabled that part of the application. SL Share’s Status Update and Check-In features will continue to work.
When SL Share’s full functionality is restored, SLURLs will no longer be included when you share a picture using Photo Upload, but you will still be able to let your Facebook friends know where to join you in Second Life by using the Check-In feature.
We apologize for the inconvenience this may cause you and are working to get a fix out ASAP. We’ll use this blog to keep everyone posted with any updates and will of course let you know once the issue is resolved as well. Thank you for your patience.
The Oculus Rift offers exciting possibilities for Second Life - the stereoscopic virtual reality headset brings a new level of immersion to our 3D world, making Second Life a more compelling experience than ever before.
Though a consumer version of the headset isn’t available yet, we’ve been working with the development kit to integrate the Oculus Rift with the Second Life Viewer. We now have a Viewer ready for beta testers, and if you have an Oculus Rift headset, we’d love to get your feedback.
If you have the Oculus Rift development hardware and would like to help us with feedback on the Viewer integration, please write to email@example.com to apply for the limited beta.
As we blogged about last week, we’re making some changes to our JIRA implementation to make our bug reporting system a more transparent and productive experience. We just wanted to take a moment to let everyone know that these changes are now live!
One of the questions we’ve seen in the past week is how previously submitted issues would be treated - namely, will those also be viewable by everyone and open for comment prior to being triaged?
While we want to make issues visible for the reasons described in our last post, we’re not going to extend this to old issues, because at the time they were created, users knew that those reports would have limited visibility and they may have included sensitive and/or private information. We don’t want to take information that someone thought would be private and suddenly make that visible to everyone, so the new visibility settings will apply only to new issues.
Today, we’re happy to announce some changes to our JIRA implementation - the system we use to collect, track, and take action on bugs reported by users. You’ll see these changes take effect next week.
Recently, this system was working in a way that wasn’t very transparent, and it frankly wasn’t a good experience for the users who care enough about Second Life to try to help improve it, nor was it the best set-up for the Lindens tasked with addressing these issues. So you can see why we’re happy to be changing it!
Moving forward, we’re going to make our JIRA implementation a more transparent experience. All users will be able to see all BUG issues, all the time. You’ll be able to search to see if there are duplicates before submitting an issue, and if there’s a bug that’s particularly important to you, you can contribute your info to it and see when it’s been Accepted and imported to the Linden team.
You’ll also be able to comment. Before an issue is triaged, everyone can comment to help isolate and describe the issue more clearly. Do remember, there are some basic guidelines for participation that need to be followed. Once an issue is Accepted and imported by Linden Lab’s QA team, the original reporter will still be able to comment, as will Lindens and a small team of community triagers - a group that includes some third party Viewer developers and others selected by Linden Lab for having demonstrated skills in this area. This group has been invaluable in helping to keep the bug database orderly and cross-referenced as well as troubleshooting bugs before they’re triaged, and we’re glad to have their continuing help with this process.
Lastly, “New Feature Request” is back! If you’ve got a great idea for a feature, you don’t need to slip it through the system disguised as a bug report - just select the “New Feature Request” category when you submit. Commenting for this category will work just like for bug reports, and submitting improvements through this category will make things much easier for the Linden team reviewing these. Please remember that JIRA is an engineering tool - it’s not meant for policy discussions and the like nor is it a replacement for the Forums, where you can have all kinds of stimulating discussions.
If you’re one of the many who have taken the time to submit a bug report through the JIRA system - thank you! We really appreciate your work in tracking down the issues, and it’s a significant help to us as we continue to improve Second Life.
We think these changes will make for a better, more transparent and more productive experience for all of us, but if you have additional ideas on ways to improve our implementation, you can share them with us in this Forum thread.
- Someone had a case of the Mondays!
12.05.16 - Recent Updates to Web & Marketplac
11.03.2016 - Recent Updates to Web & Marketplac
ay Retelling of Yesterday’ s Downtime
- The Story Behind Last Week's Unexpected Downtime
- Why the Friday Grid Roll?
- Recent Issues with the Nightly Biller
- Tale of the Missing ACK
- Why Things Were Less Than Optimal This Past Weeken...
An Update on Linux Viewer Developmen