Hi! I’m a member of the Second Life operations team, and I was the primary on-call systems engineer this past weekend. We had a very difficult weekend, so I wanted to take a few minutes to share what happened.
We had a series of independent failures happen that produced the rough waters Residents experienced inworld.
Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.
This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.
After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.
That brings us to Sunday morning.
Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.
It didn’t take us long to realize that things were about to get interesting again, however.
Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.
Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” - it’s what makes the texture you see on your avatar - thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.
One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.
I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.
See you inworld (after I get some sleep!),
Since its introduction, the Linux version of the Second Life Viewer has been considered a Beta status project, meaning that it might have problems that would not have been considered acceptable on the much more widely used Windows or Mac versions. Because "Linux" isn't really one platform - it's a large (and fluid) number of similar but distinct distributions - doing development, builds, and testing for the Linux version has always been a difficult thing to do and a difficult expense to justify. Today, Linux represents under half of one percent of official Viewer users, and just a little over one percent of users on all viewers. We at Linden Lab need to focus our development efforts on the platforms that will improve the experience of more users.
While we hope to be able to continue to distribute a Linux version, from now on we will rely on the open source community for Linux platform support. Linden Lab will integrate open source community contributions to update the Linux platform support, and will build and distribute the resulting viewers, but our development engineering, including bug fixing, will be focused on the platforms more popular among our users. We hope that the community will take up this challenge; anyone interested in ensuring that their fellow Linux users can continue on their preferred platform is encouraged to reach out to us to find out where help is most needed.
Available now is the ability for LSL to return an avatar’s shape type to scripted objects. With this information, scripters and creators of objects can determine the best animation, pose, or position to play when avatars interact with their objects.
Scripts can now read the avatar’s shape type (male or female) and hover height values. For a complete list of object and avatar agent size details please visit the LLGetObjectDetails() and LLGetAgentSize() wiki pages..
Facebook recently announced plans to deprecate an old Open Graph API and required all apps running 1.0 to force update to 2.0. We have completed this update for SLShare, but Facebook anticipates that the process on their end for migrating a given app may take up to a couple of weeks. During this migration period, there may be some service interruptions for some apps.
This means that when using SLShare (updating status, photo uploads, and check-ins from the Viewer) you may experience some temporary problems. Please be assured that we are aware of this and any issues you encounter should be resolved once the migration period is complete.
Thank you for your patience!
Yesterday, with much rejoicing, we promoted Viewer 188.8.131.520918 (Tools Update) to release. While this Viewer doesn’t have a shiny new featureset on the surface (other than reverting to a single-button login), it’s what’s inside that really matters - we’ve updated the numerous tools used to build the Viewer. The immediate expected effect of this is performance stability and a decreased crash rate.
We go to great lengths to maintain backwards compatibility in Second Life, both to never break users’ creations and to support the wide range of systems our Residents use to log in. However, sometimes we have to make the hard decisions: a year ago we announced that we were dropping support for Windows XP and Mac OSX 10.5 & 10.6 (a complete list of current system requirements is available here). Today, with the Tools Update release, the Viewer will no longer run on those systems. You will still be able to log in with an older Viewer until it is aged out based on our deprecation policy, however we strongly recommend updating your system.
It's unfortunate that we have to stop supporting some older systems, but upgrading the tools we use to build the viewer will help us to bring you other improvements to your Second Life experience more quickly and reliably.
Keeping the systems running the Second Life infrastructure operating smoothly is no mean feat. Our monitoring infrastructure keeps an eye on our machines every second, and a team of people work around the clock to ensure that Second Life runs smoothly. We do our best to replace failing systems proactively and invisibly to Residents. Unfortunately, sometimes unexpected problems arise.
In late July, a hardware failure took down four of our latest-generation of simulator hosts. Initially, this was attributed to be a random failure, and the machine was sent off to our vendor for repair. In early October, a second failure took down another four machines. Two weeks later, another failure on another four hosts.
Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.
After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our datacentre to inspect every potentially affected system and replace the defective component to prevent any more failures.
The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.
Last week we deployed the change to serve all texture and mesh data primarily through the CDN, as we've been doing with avatar textures since March. In addition to reviewing feedback from Residents we've been monitoring and measuring the effects of the change, and thought it would be interesting to share some of what we've learned.
First the good news:
- Load on some key systems on the simulator hosts has been reduced considerably. The chart below shows the frequency of high-load conditions in the simulator web services, and you can see the sharp drop as the CDN takes on much of that job. This translates into other things, including region crossings and teleports, being faster and more reliable.
- For most users most of the time there has been a big performance improvement in texture and mesh data loading, resulting in faster rez times in new areas. The improvement has been realized both on the official viewer and on third party viewers.
However, we have also seen that some users have had the opposite experience, and have worked with a number of those users to collect detailed data on the nature of the problems and shared it with our CDN provider. We believe that the problems are the result of a combination of the considerable additional load we added to the CDN, and a coincidental additional large load on the CDN from another source. Exacerbating matters, flaws in both our viewer code and the CDN caused recovery from these load spikes to be much slower than it should have been. We are working with our CDN provider to increase capacity and to configure the CDN so that Second Life data availability will not be as affected by outside load. We are also making changes to our code and in the CDN to make recovery quicker and more robust.
We are confident that using the CDN for this data will make the Second Life experience better. Making any change to a system at the scale of Second Life has some element of unavoidable risk; no matter how carefully we simulate and test in advance, once you deploy at scale in live systems there's always something to be learned. This change has had some problems for a small percentage of users; unfortunately, for those users the problems were quite serious for at least part of the time. We appreciate all the help we've gotten from users in quickly diagnosing those problems. We think that the changes we've begun making will reduce the frequency of failures to below what they were before we adopted the CDN, while keeping the considerable performance gains.
11.07.2014: An update on performance improvements and adjustments is available here.
Second Life was originally designed for nearly all data and Viewer interactions to go through the Simulator server. That is, the Viewer would talk almost exclusively to the specific server hosting the region the Resident was in. This architecture had the advantage of giving a single point of control for any session. It also had the disadvantage of making it difficult to address region resource problems or otherwise scale out busy areas.
Over the years we’ve implemented techniques to get around these problems, but one pain point proved difficult to fix: asset delivery, specifically textures and meshes. Recently we implemented the ability to move texture and mesh traffic off the simulator server onto a Content Delivery Network (CDN), dramatically improving download times for Residents while significantly reducing the load on busy servers.
Download times for textures and meshes have been reduced by more than 50% on average, but outside of North America those the improvements are even more dramatic. That is great news, but the most amazing improvement has been on the simulator servers themselves. The following chart graphs servers on a production release-candidate channel with high HTTP load conditions before and after we rolled the CDN code onto them:
The high load conditions almost completely disappeared! We knew that we would get a major drop in load with the move, but this blew us away. At first we didn’t believe it and spent two days trying to figure out what we did wrong. There was nothing wrong; this was real.
The results of all of this are faster scene loads, quicker object rezzing, far fewer problems with fuzzy or cloudy avatars, fewer teleport failures, and more! The feedback from Residents has been fantastic. We’re loving it, too! Everything is just so much snappier.
We have finished rolling the CDN code out to the grid, and the results have remained extraordinary. This week, we are also fully releasing our HTTP Project Viewer, which will make the CDN change even better by taking advantage of the elimination of server-side rate limiting. We have been extremely happy with the results so far (psst, we’re talking an 80% reduction in content download times).The CDN benefits are available to everyone regardless of which Viewer you choose to use. All users of the official Viewer will also be able to enjoy the results of the HTTP improvements, and third party developers are able to adopt these changes in their Viewers as well.
We are very happy to be finally releasing these improvements to everyone. Give it a try and let us know what you think!
HTTP Project Recap
Earlier this year we blogged about the HTTP project and how, step-by-step, the project is overcoming various limitations. Viewer release 3.4.3 introduced a new HTTP library that made better use of network resources. Texture fetches were the first operations to take advantage of this library, which improved throughput while using fewer connections. But the viewer was still constrained by a one-request-per-connection model.
Changing that model required back-end modifications. Those shipped early in 2013 in the DRTSIM-203 simulator release. For the first time, texture fetches could re-use existing HTTP connections. And for most users, this doubled the theoretical maximum texture request rate.
But all the world is not textures, and mesh fetches were next to receive attention. Meshes required quite a bit more work. Both back-end and viewer engineering was needed, culminating in viewer release 3.7.2. This release brought mesh fetching into behavioral parity with textures.
These releases have brought the viewer up to the request rate limits of region 'C'. Here, the limits are dictated by serialization, distance, and the speed of light. We are preparing to move beyond this region with changes to concurrency and locality. HTTP request concurrency will be vastly increased by the introduction of HTTP pipelining. Locality will be changed by the use of a Content Delivery Network (CDN) to move texture and mesh data nearer to most users.
Common HTTP communication is a simple back-and-forth exchange of requests and responses. A request is issued, a response is returned, and only then is another request issued. As distance increases between endpoints, the time to perform this ping-pong increases, which lowers the effective request rate. Pipelining attacks this distance-induced loss by issuing multiple requests at once without waiting for responses.
The pipelining viewer will use this more aggressive request model for both texture and mesh fetches. This viewer is currently in QA and is expected to go to RC soon after the 3.7.16 release. It has also had trials outside of North America and the results have been very good. The effective request rate has approached the limits imposed by our servers and has exceeded the download rate of UDP texture fetching.
As for those server limits, the operations team at Linden has been making rapid progress on the CDN project (DRTSIM-258). This project will replicate the Linden services that supply meshes and textures to a CDN's PoPs (Points-of-Presence). With PoPs on multiple continents, request service time will be reduced for most users.
Combined with pipelining, CDN experiments are producing results that have only been dreamed about. How fast? Well, the engineer's universal response of "It depends" applies. But several 100's of fetches per second have been seen far from the USA. This takes us into never before encountered performance territory.
And one more thing
The HTTP Project has focused on textures and meshes. But the inventory system, which maintains item ownership, is often described as... sluggish. So as an exercise in expanding the use of the new HTTP library, the pipelining viewer was modified to use it for inventory fetches. As with textures and meshes before, inventory is now fetching in the 'C' region of its specific performance graph. The difference can be surprising.
For several years, HTTP has figured prominently in Linden's plans for Second Life. "HTTP will give you speed and throughput, consistency and robustness." A promised land, but never quite realized. These next steps are payment on that promise. HTTP done well can support an amazing experience. And you'll have no reason to look back to the V1 world of UDP textures and inventory.
Nous sommes embarqués,
As of today, several projects have reached the Project Viewer stage, and we wanted to share a bit about how you can expect to see your Second Life experience improve with these initiatives.
Graphics Settings Benchmarking:
This is a new way of figuring out the best default graphics settings. Maybe this has happened to you: you got an awesome new graphics card, fired up SL… only to discover your graphics settings are set to Low, and can’t be changed? No more! This Viewer does away with the old GPU table and instead uses a quick benchmark measurement to detect your GPU to assign appropriate default graphics settings on startup. The settings on shiny powerful hardware should really let that hardware shine. Get a Project Benchmark Viewer today and help us gather metrics! Please file bugs in JIRA if you find them.
Installation and Login screen changes:
A new look for the login screen is coming in stages. We’re tweaking the login screen and A/B testing the results. We’ve simplified the new user login screen to remove distractions, and we’re adding instructions for new downloads and installs. We’re also streamlining the returning user login screen to help you get where you’re going faster - or find a new place to visit. You’re likely to see some incremental changes continue over the coming weeks.
Two complementary projects are going in neck-and-neck:
A new texture and mesh asset service (a CDN) is in testing on a limited set of regions, and so far it’s showing encouraging results. Particularly for those who login from places far away from our US data centers, this has the potential to significantly improve how quickly textures and meshes load. In the coming weeks, we will expand the number of regions using this new service as the next step in our testing.
We’re also taking the next step with the HTTP project - Pipelining. We will soon put out a viewer that will pipeline HTTP requests for texture and mesh fetches, improve inventory folder and item fetches, and have some general adjustments for using a CDN-enabled grid.
Separately, each of these will improve texture and mesh loading performance, but put together, you should really see some exciting improvements in how long it takes to load new areas and objects - making touring the many fabulous places in Second Life you have not yet visited even better!
From time to time, incidents occur that our operations team needs to quickly fix in order to keep all of Second Life working well 24x7 for users around the world.
How does the Linden Lab ops team collaborate to quickly tackle these incidents? Our VP of Operations and Platform Engineering, Landon McDowell (Landon Linden), has written a great description of an early experience he had with our approach as well as some thoughts on why it works so well. This is a bit outside the usual “Tools & Tech” topics for this blog, but we thought Second Life users familiar with how operations teams work would appreciate the inside look at our team’s approach:
Two weeks into my tenure in the Operations group at Linden Lab I was confronted with my first major incident there. It was early afternoon, and I was well into a post-ramen food coma when alarms started popping off in IRC. All of our major charts were taking a header - logins, concurrency, etc.
The call went out in #ops for hands, but I had already jumped in. This wasn’t my first rodeo. I was primed to hop onto a conference call or pile into a room to marshall a response. But that never happened.
Instead, responders starting piping up in IRC with, “Hands.” Soon I was completely overwhelmed by a stream of text flying across my screen as engineers were reporting back and discussing findings.
The problem was quickly narrowed down to a particular load balancer. I was barely into the box before an engineer chimed in, “It's running out of ports.” From there the resolution was straight-forward: some quick TCP tuning and adding another backend to the pool to quickly stabilize things before proceeding to long-term fixes.
I, though, just sat there staring at the screen wondering what the hell had just happened, wondering what the hell I had gotten myself into. I thought I was a seasoned pro, but I had never ever seen an incident response go that smoothly or quickly. Panic started to set in. I was out of my league.
In the day that followed, I was able to review the incident by reading the chat log, referred to as the scrollback. My confidence slowly began to rebuild. I stepped through the incident response line by line, server by server, action by action. After we completed the postmortem, I felt that with more practice and experience I could do this. I also realized, to the initiated, chat-centric incident response is far and away the best, most efficient method of handling outages.
The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.
In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.
In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.
Responders never all show up simultaneously. Often they have to be pulled in mid-incident. The power of the chat log really comes through here as latecomers get an automatic up-to-the-second sitrep. “Reading scrollback” is our standard entrance letting everyone know someone new has engaged and needs a minute to catch up. Even in cases when a quick briefing to a newcomer is necessary, one person can break off into a separate channel or in private message without having to disengage from the main conversation.
Other kinds of text sidebars are of course useful in incident responses. For example, emotions run high during outages and occasionally you have to ask someone to cool their jets. This is done quickly and effectively in private chat message without embarrassing them in front of the rest of the team.
At Linden Lab, we use a designated Incident Commander to orchestrate incident responses. Chat systems give an easy way to flag whoever is running the show by chat handle and/or in the channel topic. Anyone jumping in knows immediately who is in charge without having to distract the response team by asking.
Running an incident response in a chat channel is also an incredibly effective way of passively disseminating information to a wide audience. A large number of people can quietly lurk in a chat channel unlike in a physical space. More formal status updates to various parties, like support, are of course sometimes necessary but enabling those parties to follow along in real time gives them context that would not otherwise be conveyed in a terse status report.
As a final bonus, we are able to respond to a problem at peak efficiency regardless of where anyone is at that moment. Issues don’t wait for office hours to crop up. Being a distributed team, this is really our only option, but it rocks that being distributed is an advantage in incident response.
The benefits of chat-based incident responses don’t end with the incident though. Having a detailed log of events is invaluable in conducting postmortems. People have a terrible memories, especially during high stress events. The log gives a history of events with precise times that could never be achieved by relying on responders' recollections.
Likewise, the chat log for an incident is potent teaching tool. New hires can use them to learn about the particulars and eccentricities of systems in a way that is rarely captured in documentation or direct instruction. But in general, the log gives a remarkably clear picture of what went right and what went wrong in an incident response, letting the team better iterate and improve on their process over time.
Chat-based incident response isn’t easy. It requires group discipline and commitment as it runs counter to our instincts about communication. It can be nerve-racking for newcomers to the practice. Not everyone can hack it. Extremely smart people can and do wash out from not being able to keep up. But when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.
At one time or another, many of us have suffered a Viewer crash that interrupted our time in Second Life. At Linden Lab, we collect crash data to help us in our ongoing efforts to identify and fix issues in the Viewer that can cause crashes. Thanks to that information, nearly every release we make includes fixes of this sort.
The data we collect also reveals that many users can greatly reduce their risk of Viewer crashes by taking a few steps to update their software outside of Second Life.
The nature of Second Life as a platform for user creativity means that the Viewer faces different challenges than client software for an online game, for example, which would just need to handle the limited and carefully optimized content created by the game’s developer. This can make Second Life a demanding application for your computer and can mean that if your operating system is out of date, your Viewer is more likely to crash.
The good news is you can take steps today to help this! Here are a couple of tips:
Upgrade your Operating System
There is a very clear pattern in our statistics - the more up to date your operating system is, the less likely your Viewer is to crash. This applies on both Windows and Macintosh (Linux is a little harder to judge, since "up to date" has a more fluid meaning there, and the sample sizes are small). Some examples:
Windows 8.1 reports crashes only half as often as Windows 8.0
Those of you who stuck with Windows 7 (roughly 40% of users of our Viewer right now) rather than upgrade to 8.0 made a good choice at the time; version 7 still has a much better crash rate than 8.0, but not quite as good as 8.1 (now about 15% of users), so waiting is no longer the best approach.
Mac OSX 10.9.3 reports crashes a third less than 10.7.5
OSX rates do not have as much variation as Windows versions do, but newer is still better, and there are other non-crash reasons to be on the up to date version, including rendering improvements.
Upgrading will probably also better protect you from security problems, so it's a good idea even aside from allowing you to spend more time in Second Life.
Use the 64 bit version of Windows if you can
For each version of Windows for the last several years, you have had a choice between 32 bit and 64 bit variants; if your system can run the 64 bit variant, then you will probably crash much less frequently by changing to it. While we don't have a fully 64 bit version of the Viewer yet, you can run it on 64 bit Windows, and statistically you'll be much better off if you do.
Generally speaking the 64 bit Windows versions report crashes half as often as the 32 bit versions.
According to the data we collect, a little more than 20% of users are running 32 bit Windows versions; most of you can probably upgrade and would benefit by it.
If you bought your computer any time in the last 5 years, chances are very good that it can run the 64 bit version of Windows (as will some systems that are even older). Microsoft has a FAQ page on this topic; go there and read the answer to the question "How do I tell if my computer can run a 64-bit version of Windows?". That page also explains how to do the upgrade and other useful information.
We'll of course continue working hard to find and fix things that lead to Viewer crashes. Even as we do that, though, you can decrease your chances of crashing today by taking the steps above.
We're ready to start a limited beta test of an exciting new tool for creators: Experience Keys. These are new LSL functions and calls that make it possible to bypass the multiple permissions dialogs that you encounter with scripted objects today. Experience Keys will make it possible for users to create more immersive experiences inworld, because those interacting with the experience will be able to grant all the permissions necessary to participate just once, instead of having the experience interrupted by multiple permissions requests. To learn more, check out this brief video.
We used this technology when creating the Linden Realms game, and we're now ready to start putting this tool in the talented hands of creators in the Second Life community. Experience Keys is a powerful tool, and we need to be sure we test and roll out the feature carefully, so the first step will be a limited beta, then the viewer and server releases shortly after.
If you’d like to participate, send an email to email@example.com with “Experience Key Beta” as the subject along with:
Your experience name.
What genre does it fit in?
Give us a brief description of your experience.
How would your customers benefit from Experience Keys?
When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.
In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.
With a lot of hard work and countless long nights we stabilized the service and started making major improvements to the overall stability and performance of Second Life. However, despite our continued improvements, and the relative tranquility they have created, the spectres of technical debt and single points of failure still loom over our operations. In recent weeks some of them have struck and disrupted Second Life. So much so that I want to explain the outages that have occurred, how we addressed them, and what we are doing going forward.
First, that core MySQL database cluster still exists. It is still the core of many of our central functions. When the write server fails it takes a minimum of thirty minutes to promote a new server into position. The promotion itself is actually relatively quick, but its numerous dependent services must all be taken down and brought back up carefully to ensure that they are all functioning properly.
In the last two months the core MySQL write database has hit two different fatal hardware faults, driving us to temporarily halt most Second Life operations. In some sense, two major write database failures close together is bad luck, but we cannot depend on luck to ensure the reliability of Second Life. In the very near future, we are moving the core MySQL write server to a new hardware class, on which production read servers are already running. Moving the write server will further improve overall database performance and make failures less frequent. It does not of course solve the root single point of failure problem so in the coming days, weeks, and months we will be reducing the impact of database failures even more. This includes continued improvement to the rotation process, extracting more functions out of the core database cluster, and further reducing the number of features that depend on the single write server.
The core MySQL database, however, has not been our only problem recently. A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through. Most seriously though was this week’s four and a half hour long login outage.
On Tuesday morning, users stopped being able to get into Second Life. The root cause was created over ten years ago in a system designed to assign a unique identifier to the hand-off of sessions from login to users’ initial regions. At 7:40AM Pacific Time, that system quietly ran out of possible numbers to assign. It took us four hours to isolate the problem, test a fix, and deploy the change. Users could immediately log in at that point, but it took an additional two hours for systems to settle out. When tens of thousands of users rush back into Second Life following an outage, we have to deliberately throttle some services to prevent further breakage.
Having such a hidden fault in a core service is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.
We want to apologize for all of the recent problems and the frustration they have caused. We too are frustrated and are intent on making our service better. Few things give me more pleasure than every day helping to make Second Life a happy and fun place. Thank you for your patience and support. We simply could not have a more devoted user base and for that we owe you better.
UPDATE: 04.24.2015 - For an update on how the Tool Viewer Release affects these systems, visit this blog.
We have made some changes to the Second Life System Requirements to bring them more up to date, and are making some related changes to the Viewer:
We have removed Windows XP and Mac OSX 10.6 from the list of supported operating systems. Microsoft has announced the end of support for XP, and it has been some time since Apple has released updates for 10.6. For some time now, the Viewer has been significantly less stable on these older systems, and the lack of security updates to them make them more hazardous to use.
We have no plans to actually block those systems, but problems reported on them that cannot be reproduced on supported systems will not likely be fixed.
The Windows installer has been modified to verify that the system has been updated with the most recent Service Packs from Microsoft. While we will not block installation on Windows 8 at this time, we strongly recommend upgrading to 8.1 for greater stability. Our data shows that the Viewer is significantly less stable on systems that have not been kept up to date, so the installer will now block installation until the updates have been applied. This change will be effective in a Viewer version to be released in the next few weeks, so it would be a good idea to get your system up to date before then. You can find information on how to install the latest updates at the Microsoft Windows Update page.
Last week, we made a new page available as a replacement for the old Transaction History page. Due to your feedback, we rolled back the changes to this page to allow us to gather more feedback, and we are now providing this new page for review, without removing the old Transaction History page.
We have not yet made any changes to the new page, because we would like time to collect your feedback and review it. We have created a wiki page giving background on why changes were made to this page, where the new page is, and how to provide feedback. We will be closing feedback on April 30, 2014, so please take a look before then.
Many of you may have read about the Heartbleed SSL vulnerability that is still affecting many Internet sites.
You do not need to take extra action to secure your Second Life password if you have not used the same password on other websites. Your Second Life password was not visible via Heartbleed server memory exposure. No secondlife.com site that accepts passwords had the vulnerable SSL heartbeat feature enabled.
If you used the same password for Second Life that you used on a third-party site, and if that third-party site may have been affected by the vulnerability, you should change your password.
Supporting sites such as Second Life profiles are hosted on cloud hosting services. Some of these sites were previously vulnerable to Heartbleed, which may have exposed one of these servers' certificates. As an extra precaution, we are in the process of replacing our SSL certificates across the board. This change will be fully automatic in standard web browsers.
Thank you for your interest in keeping Second Life safe!
We’re happy to report that the Photo Upload feature of SL Share is once again working (as we previously blogged, that portion of the service had been temporarily disabled by Facebook). Thank you again for your patience as we worked to resolve this issue.
The restored feature no longer automatically includes SLURLs when you share a picture to Facebook, but it’s still possible to let your Facebook friends know where you are inworld by using the Check-In function in SL Share.
As you may have seen, we’re expanding the functionality of SL Share to include not only Facebook, but also Flickr, and Twitter. You can read about that work here and try it out with the project Viewer now available here.
UPDATE - April 3, 2014: this issue is resolved and the Photo Upload feature is once again working. Thank you for your patience!
Facebook recently contacted us to let us know that the Photo Upload feature of SL Share is not permitted to automatically include location SLURLs in posts made from the application. We’re working with them to get a hotfix out ASAP, but in the meantime the Photo Upload feature in SL Share will not work, as Facebook has temporarily disabled that part of the application. SL Share’s Status Update and Check-In features will continue to work.
When SL Share’s full functionality is restored, SLURLs will no longer be included when you share a picture using Photo Upload, but you will still be able to let your Facebook friends know where to join you in Second Life by using the Check-In feature.
We apologize for the inconvenience this may cause you and are working to get a fix out ASAP. We’ll use this blog to keep everyone posted with any updates and will of course let you know once the issue is resolved as well. Thank you for your patience.
The Oculus Rift offers exciting possibilities for Second Life - the stereoscopic virtual reality headset brings a new level of immersion to our 3D world, making Second Life a more compelling experience than ever before.
Though a consumer version of the headset isn’t available yet, we’ve been working with the development kit to integrate the Oculus Rift with the Second Life Viewer. We now have a Viewer ready for beta testers, and if you have an Oculus Rift headset, we’d love to get your feedback.
If you have the Oculus Rift development hardware and would like to help us with feedback on the Viewer integration, please write to firstname.lastname@example.org to apply for the limited beta.
As we blogged about last week, we’re making some changes to our JIRA implementation to make our bug reporting system a more transparent and productive experience. We just wanted to take a moment to let everyone know that these changes are now live!
One of the questions we’ve seen in the past week is how previously submitted issues would be treated - namely, will those also be viewable by everyone and open for comment prior to being triaged?
While we want to make issues visible for the reasons described in our last post, we’re not going to extend this to old issues, because at the time they were created, users knew that those reports would have limited visibility and they may have included sensitive and/or private information. We don’t want to take information that someone thought would be private and suddenly make that visible to everyone, so the new visibility settings will apply only to new issues.
Today, we’re happy to announce some changes to our JIRA implementation - the system we use to collect, track, and take action on bugs reported by users. You’ll see these changes take effect next week.
Recently, this system was working in a way that wasn’t very transparent, and it frankly wasn’t a good experience for the users who care enough about Second Life to try to help improve it, nor was it the best set-up for the Lindens tasked with addressing these issues. So you can see why we’re happy to be changing it!
Moving forward, we’re going to make our JIRA implementation a more transparent experience. All users will be able to see all BUG issues, all the time. You’ll be able to search to see if there are duplicates before submitting an issue, and if there’s a bug that’s particularly important to you, you can contribute your info to it and see when it’s been Accepted and imported to the Linden team.
You’ll also be able to comment. Before an issue is triaged, everyone can comment to help isolate and describe the issue more clearly. Do remember, there are some basic guidelines for participation that need to be followed. Once an issue is Accepted and imported by Linden Lab’s QA team, the original reporter will still be able to comment, as will Lindens and a small team of community triagers - a group that includes some third party Viewer developers and others selected by Linden Lab for having demonstrated skills in this area. This group has been invaluable in helping to keep the bug database orderly and cross-referenced as well as troubleshooting bugs before they’re triaged, and we’re glad to have their continuing help with this process.
Lastly, “New Feature Request” is back! If you’ve got a great idea for a feature, you don’t need to slip it through the system disguised as a bug report - just select the “New Feature Request” category when you submit. Commenting for this category will work just like for bug reports, and submitting improvements through this category will make things much easier for the Linden team reviewing these. Please remember that JIRA is an engineering tool - it’s not meant for policy discussions and the like nor is it a replacement for the Forums, where you can have all kinds of stimulating discussions.
If you’re one of the many who have taken the time to submit a bug report through the JIRA system - thank you! We really appreciate your work in tracking down the issues, and it’s a significant help to us as we continue to improve Second Life.
We think these changes will make for a better, more transparent and more productive experience for all of us, but if you have additional ideas on ways to improve our implementation, you can share them with us in this Forum thread.
As we continue to work on improving the Second Life experience, one challenge we’ve been tackling is the length of the Viewer installation process. No one likes waiting, and now with Project Zipper, you don’t have to!
With the project Viewer available today, there’s really only one thing different - the installation is super fast. Rather than waiting for install to complete, you’ll quickly be in Second Life doing what you love.
Try out Project Zipper with the project Viewer here.
This is still a project Viewer, and if you find bugs while testing it out, please let us know by filing them in BUG project in JIRA.
On deck for immediate release are more improvements to Project Sunshine. New updates include better stability and performance enhancements in retry logic, redundant requests, and better detection for appearance conditions. This release also includes lots of bug fixes and quite a bit of cleanup of old code from the old client-side baking framework.
Additionally, we’ve integrated support for the latest AISv3 code (which will improve how back-end processing of inventory items work). We are just awaiting the latest server components to go live before this feature is fully functioning. More on this soon.
Update to the latest Viewer today and take advantage of the new Sunshine refinements!
Since late in 2012, there's been a project to improve HTTP communications between Viewer and grid services. One of the metrics we've tried to improve is maximum request rate: the number of HTTP requests that can be issued and responded to over a time interval. Improving this figure has been challenging. There are and will always be factors beyond our control, such as the characteristics of the home network, ISP policies and capacities, physical distance to services, and transient network problems, to name a few. However, we can do something about other factors that are under our control.
These factors can be fixed limits intended to implement a rate throttle. Or they can also be side-effects of other decisions, such as connection handling. The impact of fixed limits and connection handling is shown in the diagram below taken from texture and mesh fetching. The lowest curve (red) represents the highest possible request rate when using a unique connection for each HTTP request for a maximum of five (5) concurrent connections. The curve above it (blue) shows the doubling of the request rate limit when HTTP keepalives are used. And both of these limits are strongly affected by packet round-trip (ping) times. Finally, the horizontal line at 100 RPS represents an approximate limit on request rate that's independent of ping time.
At the beginning of this project, most HTTP traffic, texture fetching in particular, was constrained to region 'A'. This was the result of a combination of factors. One was the lack of connection keepalives. This forced a TCP connection handshake prior to every HTTP request. Another was the delay in launching new requests on completion of previous requests.
Released in early 2013, Viewer 3.4.3 included a new library, llcorehttp, to begin to address the above issues. It was foundational in that it didn't attack the above problems directly (nor other problems not discussed here), but it established a structure that would allow the problems to be addressed over a series of releases.
While its goals were modest, the efficiency and latency improvements yielded some nice gains. Texture fetching was the first component to use the new library and throughput and robustness increased while tendencies to stall have disappeared. With these changes, texture fetching solidly entered the 'B' region of the diagram.
Deployed in April, 2013, the DRTSIM-203 server release included the first support for HTTP keepalive connections between viewers and simulator hosts. For small transfers, it halved the number of round-trips required to fetch an object. For larger transfers, the impact of TCP slow-start was also reduced. There were several other beneficial side-effects; the greatest of these was creating fewer TCP connections for a given workload. A high rate of connection creation, called connection churn, is one factor in destabilizing many home routers. Symptoms of this include: router reboots, loss of connectivity, timeouts, and DNS failures.
Keepalive connections were enabled for texture fetches and SSL-encrypted HTTP between viewers and simulator hosts. This raised the ceiling into region 'C,' increasing texture download throughput. Mesh downloads continued to use the original connection scheme, and monitoring and enforcement watchdogs were added to protect common services from monopolization.
We continue to evaluate our HTTP protocols with goals of improving our user experiences. Here are some of our most recent areas of interest.
DRTSIM-216/229 + Viewer DRTVWR-329
Mesh fetching, like texture fetching, is a relatively heavy-weight activity in the viewer. It takes time to acquire these assets and display a scene. So mesh fetches will be the next area to receive the above treatment, putting them in region 'C' as well.
The challenge is that mesh fetching has been built around extremely high connection concurrency. A new capability service, called 'GetMesh2', will supersede the existing 'GetMesh' service. It is designed around low connection concurrency. That, in turn, will permit connection keepalive and even HTTP pipelining while promising higher throughput, lower overhead and improved router stability.
As significant as the above changes are, they still leave most users under a ceiling strongly influenced by ping time. Europe and Asia are usually over 100 ms and South Africa often reports 250 ms. There will always be a 'ping tax' on those who are far from grid services. But there is something we can do: HTTP pipelining.
Pipelining reduces the impact of ping time by issuing multiple HTTP requests at once without waiting for responses. This will allow our servers to keep filling the download stream rather than stalling between requests. Mesh and texture fetching will get first crack at pipelining. There will be new problems to discover. An essential library, libcurl, has only just started supporting pipelining. We will need to update some of our services as well. But those will be solved and asset fetches will begin to operate in region 'D'.
Sky's the Limit
Progress doesn't stop with pipelining. There are more opportunities to push upwards into 'E' and 'F' regions. A few possibilities:
If you participated in the DRTSIM-203 beta test, you had an opportunity to try our 'Happy' regions (TextureTest2H, MeshTest2H, etc.). These regions ran on hosts with elevated limits to see what would happen. The results were positive and invite a permanent increase as HTTP behavior improves.
Updating other HTTP-based services, moving them up the throughput curve. This would include services such as inventory operations and display name lookups, which make heavy, bursty demands.
Looking into other transport protocols. SPDY is interesting and tries to solve some of the challenges Second Life presents. How well it solves them is yet to be seen.
These shouldn’t be taken as commitments or written-in-stone plans, as priorities and goals may shift, but we’ve seen good results from our work on the HTTP project and are looking forward to continuing to improve Second Life’s infrastructure.
A new set of Materials work has become a Release Candidate Viewer. This Viewer includes support for Materials underwater, render support for impostors, and fixes that will address issues with transparency and alpha layers. See the Release Notes for full details.
A quick coral study with Normal and Specular maps shown underwater. Land Impact is 14.
As of today, approximately one fifth of the regions in Second Life have at least one object using Materials, and this is increasing at a rate of about 2% per week. Users with high-end graphics cards are seeing a significant increase in frames per second when Advanced Lighting Model (ALM) is turned on -- we highly recommend that those users turn ALM on. In addition, we will be updating the GPU tables to include newer graphics cards, allowing more users to take advantage of advanced graphics features like Materials.
Thanks to those who have already tried out Materials and submitted a bug report to the MATBUG JIRA project! Once this release candidate has completed the release process, we will be moving any active MATBUGs to the BUG project and track any future issues from the BUG project.
If you’d like to experiment with Materials, download the Materials Release Candidate Viewer today. And don’t forget to check the Good Building Practices wiki for more tips and tricks and the Knowledge Base for updated user documentation.
We are making some improvements to how we release new Viewer features. Our goal is to accelerate the pace at which we bring new features and improvements to all Second Life users by removing what had previously been a bottleneck in our release process.
In a couple of weeks, we’re going to begin following a process similar to what we already do with the simulator. We’ll release more than one new version at the same time in parallel to subsets of users for final validation, and then promote the most important of the best of those to the default Release Viewer when that testing shows it to be ready.
In the past, we have used the 'Second Life Beta Viewer' channel for this final user validation, but because we had only one Beta channel, some features and fixes have been significantly delayed waiting for their 'turn' in the channel. We will now be able to put those Viewers in the regular Release channel to a smaller randomly chosen group, allowing us to test and release new features more quickly.
If a development project wants to put out an early version for testing prior to it being ready for the Release channel, a channel specific to that project (either 'Project <project>' for very early versions, or 'Beta <project>' for more mature ones) will be created, just like we do today. These will be shut down when the project is ready to move to final testing in the Release channel, and users in the early project test channels will automatically be upgraded to the corresponding Release candidate version.
Because several different beta Viewers will be tested concurrently in the Release channel, the ‘Second Life Beta Viewer’ channel will no longer be used; there will be a web page where users who want to try early versions can find them. The existing inworld 'Second Life Beta' group is not affected.
Following a similar process has enabled us to accelerate the pace at which we release improvements to the simulator, and we’re looking forward to using this new process to more quickly introduce new features to the Second Life Viewer.
The Snowstorm project is happy to announce that we have a Materials Project Viewer available for alpha testing of support for Normal and Specular maps on Second Life objects! When generally available, this will enable content creators to make much richer looking materials with less geometric complexity.
This is a very early test version - there are still known bugs, and certainly some that we don't know about yet. We do not recommend using it to manipulate content that you care about. Bugs should be reported in the Materials Project Viewer (MATBUG) project on Jira. If you use it, please leave automatic updates enabled, because we expect to release updates relatively frequently.
Information on how to prepare textures in a way that's compatible with the Materials Project Viewer available on the wiki.
This project is a great example of how we can improve Second Life through collaboration: it includes contributions from members of the Exodus, Firestorm, and Catznip development teams, as well as from Linden Lab. We don't yet have a target date for when it will be in the regular Release Viewer, but we're working toward that, and your help testing will make that happen sooner and with higher quality.
Engineering and Operations at Linden Lab are constantly striving to make performance improvements to Second Life and we want to share the results of a recent one.
An optimization on a single query against the read pool of one of the main core database clusters has resulted in nearly a 75% reduction in load on that cluster, improving numerous systems including group functions and teleports. There has been a 7% improvement in teleport performance in peak concurrency hours and an 86% reduction in group query times.
This is only one of the performance changes we will be making over the next few months, so keep an eye out.
User-submitted bug reports help improve the Second Life experience for all Residents, so we greatly appreciate all of you who take the time to provide this invaluable information to us.
Because we want to make it even easier to report bugs, today we are making some changes that will streamline the bug reporting process, allowing us to more quickly collect information and respond to issues.
Following is a summary of the JIRA changes:
- All bugs should now be filed in the new BUG project, using the more streamlined submission form.
- Second Life users will only see their own reported issues. When a Bug reaches the "Been Triaged" status, they will no longer be able to add comments to their issue.
- Once a Bug reaches the “Accepted” or “Closed” status, it will not be updated. You can watch the Release Notes to see when and if a fix has been released for your issue.
- Existing JIRAs will remain publicly visible. We will continue to review and work through these.
To those of you who have taken the time to alert us to bugs and provided the information we need to fix them -- thank you! We hope that you will continue to help us improve Second Life, and this new process should make it easier for all of us. Ideas about how we can continue to improve the bug reporting process can be shared here.
For more information, visit:
How to report a BUG (Knowledge Base Article):
Bug Tracker (wiki page):
Bug Tracker Status/Resolutions (wiki page):
- Why Things Were Less Than Optimal This Past Weeken...
An Update on Linux Viewer Developmen
- New LSL support for Avatar Shape, Height and Hover
- An Update on SLShare Service Issues
- Tools Update Viewer Released
- The Hardware Issues Behind Recent Region Restarts
- An Update on the CDN Project
- CDN Unleashed
- The Sky Over Berlin (and Elsewhere)
An Update on Several Improvemen
ts to Second Life