Jump to content

Tools and Technology

  • entries
    86
  • comments
    916
  • views
    93,060

Contributors to this blog

About this blog

Entries in this blog

Q Linden

Today we launched Viewer 2.5, now out of Beta. The most significant update is a new, web-based profile system, which allows profiles to be viewed and edited both on the web and in the Viewer. For example, here's mine. Please note this is just a starting point for the web-based profiles; we’ll be doing a lot of work to refine the usability and make them richer over time.

In response to your feedback from the early beta versions of Viewer 2.5, we've added some privacy settings that will allow you to control just how public your profile is. Once you’ve logged in, click on “Privacy Settings” in the upper right corner of your profile. Group settings set in the Viewer will be overridden by these group privacy settings.

  • "Everyone" means that the information is available to the whole Internet and can be picked up by search engines.
  • "Second Life" means that the information is available to all Second Life Residents who are logged into the website or inworld. This is the default for all existing Residents using Viewer 2.5.
  • "Friends" means that only your Second Life friends can see the information on the web and inworld.

This is why we have a beta process--to address concerns and improve your user experience. We will continue to iterate as we get more feedback. Thank you for all your help and comments. Please attend the Viewer 2 User Group meetings if you would like to share your thoughts and feedback directly with me and the Snowstorm team.

Viewer 2.5 also has some other new features. The one I like best is that you can now have your Favorite landmarks also appear on the login screen, so that you can log directly into your favorite locations. Torley made a video about this, so check it out! We've also improved some texturing performance and fixed another batch of bugs. Watching the internal data, we've already seen a noticeable improvement in stability and performance--on par with Viewer 1.23.

Download Viewer 2.5, try it out, and keep the feedback coming! And, if you Twitter, please use the hashtag #slviewer2.

Helpful Links

FJ Linden

As we begin 2011, I want to share the progress that we’re making on several important technology enhancements that I discussed in my last post.  As I mentioned, we are focused on improving the overall performance of Second Life while addressing some long standing limitations such as  raising group limits, improving the chat system, and reducing lag.

Group Limits Raised to 42 Today

In October, we committed to increase group limits from the current 25 up to 40 in the first quarter of 2011. As of today, group limits have been raised to 42! To add groups beyond the previous limit of 25, you must be using Viewer  2.4 (or a more recent version). And if you’re still using Viewer 1.23, or a third-party viewer based on Viewer 1.23 code, then you can add more groups in Viewer 2.4 and they will still be accessible when you switch back to Viewer 1.23.

That said, if there is an unexpected load, then we may need to lower the group limitation to maintain acceptable performance levels across the grid. If we decide to do that, then any Residents who have up to 42  groups will not lose their memberships. But, other Residents will not be able to exceed the new limit.

Group Chat System Will Launch Gridwide By March 31st

We were set to deploy a prototype of the new group chat system in December, but last minute licensing issues were found with our chosen open source library. Now that a solution is in place, we expect to have the prototype available by the end of this month and an industry standard and high performing group chat system available by the end of this quarter.

Performance Improvements When Teleporting and Crossing Regions

As you teleport, or cross regions, all of your avatar data (often a very  large amount of data) needs to be processed by both source and destination regions. In order to streamline this process, we are now compressing avatar information, making your teleports and region crossings faster and more reliable. In fact, we’ve found that teleport failures, due to avatar complexity, have dropped 40%. See the graph below for a more detailed view showing how much we’ve shortened region crossing time.

region_crossing_compression_2011_01_11.jpg

Viewer 2.5 Beta Releasing Soon

We will soon launch the latest version of the Second Life Viewer -- Viewer 2.5 Beta. In addition to more performance and stability improvements, we’ve added enhanced web-based profiles, accessible both on the web and in the Viewer. And, if you wish, you can even connect your Second Life profile to other social identities including Twitter, Facebook, and LinkedIn! You will also have the option in Preferences to choose your first inworld destination from the saved Landmarks in your Favorites Bar. This is very handy when you need to get to a specific destination quickly.

Also, one of the most important new features added to Viewer 2.4 was the auto-updater capability. If you're using Viewer 2.4 or higher, then you won’t be inconvenienced by the notification and download process when we release a new Viewer.

Planning to Implement Significant Grid Infrastructure Enhancements in 2011

We’re planning significant grid infrastructure enhancements throughout the year including technologies to speed server-side rendering (SSR) and server virtualization (web and simulator services). We are also exploring new storage and asset delivery systems. Some of the benefits will not always be noticeable, but they are foundational platform changes that set the stage for rapid performance and scalability improvements. We will continue to keep you updated as we roll out these systems.

I’m pleased with the progress that we made across the platform last year and I'm looking ahead to newer technologies that we will deploy in 2011 to enhance your Second Life experience. As always, I'll be watching for your feedback and thank you for making Second Life such an amazing place.

Q Linden

In the Snowstorm Product Backlog Office Hour Wednesday, I commented that "I think options are bad for users and bad for code quality". If you read that whole transcript, you can probably see that it was interpreted badly. The most extreme variant, reported by someone who was watching in-world chat afterward, held that Linden Lab wanted to remove all options from the Viewer. Let me start by saying that is not the case and never would be, nor is it something that I or anyone at Linden Lab has ever seriously contemplated.

However, I still stand by my original comment -- options are problematic for lots of reasons.

Let's see why:

First, every option has to have a way to control it. In many cases, you have to have multiple ways to control it. From a user interface design point of view, that means creating option interfaces. For the SL Viewer, those are a) the preferences dialog, b) the debug settings, c) checkable menu items, and d) options within dialogs that control other features.

You'd normally like to put options with the things they affect, but screen space is always at a premium and many options are only changed infrequently. So instead, we group options together in a preferences dialog. But there are enough of them that it becomes necessary to create some means of organizing prefs into a hierarchical structure, such as tabs.

But as soon as you do that, you find that you have trouble because not everyone agrees on what the hierarchy should be. What tabs should you have? Where does each option go? When you get too many options for one tab, how should you split them up?

There's no one answer and there's rarely a right answer.

And then, once you have a place to put them, you have to decide what to call each option and what the default is. And if those decisions were easy there wouldn't be a need for an option!

Second, options add complexity to the interface. Every time you add an option, you add a decision for the user to make. In many cases, someone might not even know what the option controls or whether it's important. Too many options might leave someone feeling that the product is too complex to use.

Third, options add complexity to the code. Every option requires code to support all of the branches of the decision tree. If there are multiple options affecting the same feature, all of the combinations must be supported, and tested. Option code is often one of the biggest sources of bugs in a product. The number of options in the Second Life Viewer renderer, which interact not only with each other but with device drivers and different computers, make it literally impossible for us to exhaustively test the renderer. We have to do a probability-based sampling test.

You could say that it's our problem to deal with that complexity, and you'd be right, but every additional bit of complexity slows down development and testing and makes it harder for us to deliver meaningful functionality.

Fourth, options that are 50-50 probably do need to exist. Options that are 90-10 are addressing an advanced (and possibly important) use case. Having them in the preferences interface promotes them to a primacy they probably don't deserve.

Finally, adding options has a snowball effect. Having a small number of options is good, but having too many options is definitely bad for the product and for the customers trying to use it. Sure, advanced uses need advanced features, but we don't have to make everyone confront all of the complexity.

Add all of this up, and I think it becomes clearer why I said I didn't like options and would prefer to find alternatives.

So why have options at all, then? Because different people legitimately have different needs. Advanced users vs novices, or landowners vs shoppers. We get it. But it's also often an indication of a design that needs work.

There are alternatives to putting more checkboxes on the preferences screen:

a) Allow entire user interfaces to be "plugged in". This requires a major architectural change to the software. Although we've talked about it, it's going to be a while yet before we get there.
b) Allow options to be controlled close to the point of use. As I said above, this can clutter the interface but can be effective.
c) Make an interface that covers all use cases. This is the hardest of all, requiring real understanding and design, but is usually the right answer.

In short, I often consider adding a preference to the prefs panel to be the wrong answer to a real question. It's not that we don't consider different use cases, it's that we're trying to cover them in a better way.

So this has been my attempt to explain the thinking behind a statement like "options bad". I hope it's helped -- has it? Tell me in the comments.

Linden Lab

Mesh Update

Linden Lab is excited to announce the latest updates on one of our coolest new features — mesh. With mesh, you’ll be able to create, build and beautify even more incredible objects inworld! You’ll find a new Mesh Enablement functionality tab in the My Account section of your Account screen. It’s just another part of the innovation and imagination that makes Second Life the magical place it is.

Uploading Mesh

Now that mesh is in rollout stage and many users will soon be able to utilize its content-creation capabilities, there are a few things creators should know before giving it a whirl. Potential creators will need to satisfy a couple of requirements in order to gain access to mesh-upload capabilities. First, you’ll need to review a short tutorial that is intended to help Residents understand some of Linden Lab’s intellectual property policies — they can be kind of tricky, and we at the Lab want all content creators to be well-informed This tutorial outlines some of the key points relating to intellectual property.

Second, if you have never given us billing information, you'll need to enter it into your account. Why, you ask? Because having payment information on file is an important step in establishing direct relationships with content creators who will be working in mesh. Note: We do realize that with the current configuration, some Residents will need to make a purchase in order to enter their billing information.

The first round of rollouts of mesh will be happening over the next few weeks! We look forward to seeing your continued imagination, skill and creativity in this exciting new format. We hope you’ll be pleased with our innovations — and can’t wait to see what you come up with!

Now go create something amazing!

Linden Lab


Linden Lab

You may have noticed that we’ve added a Mesh Enablement tab to your Account. Get ready to explore the latest innovation from the Lab!

Mesh? What the heck is that?

We’ve had quite a few questions from users about mesh —  mostly along the lines of what is it? How can I use it? When will it become available? Well, we’ve got some answers for you — and some good news!

First of all, what is it, exactly?The term “mesh” refers to an object that consists of polygonal geometry data. Essentially, it’s what you see in modern video games, special effects and 3D animation. It's extremely flexible — and when it’s well-made, it can be way more efficient than the existing prims and sculpties you’ll find inworld currently. Mesh objects are first created in external programs, such as Blender or Maya, and then imported onto the Second Life grid. Once there, mesh objects can be manipulated in pretty much the same ways you would manipulate a regular old prim. And, here’s the best part — the wait is almost over! We’ve been rolling out mesh-upload capability to selected regions over the past week, and the rollout pace will increase in the coming weeks.

Mesh on the Main Grid

Because we’re rolling out mesh-upload capability across the Main Grid over the next few weeks,  mesh is not going to be available everywhere all at once. But, you can get a preview. Ready to try mesh? Here are a few steps you'll need to complete to get started:

1. Get enabled. To do this, you'll need to first complete a short tutorial about mesh and intellectual property rights.

2. Download the Mesh Project Viewer.

3. Once you’re inworld and using the Mesh Project Viewer, teleport to the selected Project Regions where mesh capabilities are enabled. If you're interested in uploading mesh objects, you will want to visit public areas to test. View the list of current mesh enabled sandbox regions here.



If you have a region and want to get upgraded early, check here.

Before you do visit a Project Region, you should check out the Mesh Project Wiki Page and read about uploading mesh.

It's also a good idea to read through the Mesh Forums to get more info and to ask questions of your fellow creators.

During the rollout period (until we enable upload in the next version of the Release Viewer), Linden Lab will charge a discounted fee for uploads. Also, during this initial period, no mesh items should be posted to the Second Life Marketplace due to the limited availability of places to rez mesh objects.

We’re excited to share our latest project with you — and can’t wait to see the awesome things you’ll build with mesh!

Go out and create something amazing – have fun!

Linden Lab

Linden Lab

There's been a lot going on with the Marketplace and our Web properties, and in an effort to give you a more granular view into what we're working on, we're going to put out release notes summaries on this blog going forward. Of course, some things will have to remain behind the scenes, but here are all the news that's fit to print:

10/31/16 New Premium Landing page

10/28/16 Several bug fixes to the support portal support.secondlife.com

10/24/16 We made an update to the Marketplace with the following changes:

  • Fix sorting reviews by rating
  • Fix duplicate charging for PLE subscriptions
  • Fix some remaining hangers-on from the VMM migration (unassociated items dropdown + “Your store has been migrated” notifications

Fix to Boolean search giving overly broad results (BUG-37730)

10/18/16 Maps: We deployed a fix for “Create Your Own Map” link, which used to generate an invalid slurl.

10/11/16 Marketplace: We disabled fuzzy matches in search on the Marketplace so that search results will be more precise.

10/10/16 We made an update to the Marketplace with the following changes:

  • We will no longer index archived listings
  • We will now reindex a store's products when the store is renamed
  • We made it so that blocked users can no longer send gifts through the Marketplace

We added a switch to allow us to enable or disable fuzzy matches in search

9/28/16 We deployed a fix to the Marketplace for an issue where a Firefox update was ignoring browser-specific style sheet settings on Marketplace.

9/22/16 We made a change to the Join flow for more consistency in password requirements.

9/22/16 We updated System Requirements to reflect the newest information.

As always, we appreciate and welcome your bug reports in JiraPlease stay tuned to the blogs for updates as we complete new releases.

Linden Lab

As promised, we’re sharing some release note summaries of the fixes, tweaks., and other updates that we’re making to the Marketplace and the Web properties, so that those following along can read through at their leisure.

12/01/16 - Maps:  Maps would disappear at peak use times.  That’s fixed now.  

11/28/16 - We have a new shiny Grid Status blog! You may notice an updated look and feel. If you  followed https://community.secondlife.com/t5/Status-Grid/bg-p/status-blog, be sure to update your subscriptions to status.secondlifegrid.net

11/22/16 - No more slurl.com. All http://maps.secondlife.com/ all the time.

11/21/16 - We did a minor deploy to the lindenlab.com web properties.

11/09/16 - Events infrastructure stabilization to fix a few listing bugs.

11/08/16 - Fixes to maps.secondlife.com were released, including:

  • Viewing a specific location on maps.secondlife.com no longer throws a 404 error in the console
  • Adding a redirect from slurl.com/secondlife/ requests to maps.secondlife.com


11/04/16 - A minor Security fix was released.

11/03/16 - We released a large infrastructure update to secondlife.com along with security fixes and several minor bug fixes.

As always, we appreciate and welcome your bug reports in Jira!

Stay tuned to the blogs for future updates as we complete new releases.

Linden Lab

Yesterday, with much rejoicing, we promoted Viewer 3.7.28.300918 (Tools Update) to release. While this Viewer doesn’t have a shiny new featureset on the surface (other than reverting to a single-button login), it’s what’s inside that really matters - we’ve updated the numerous tools used to build the Viewer. The immediate expected effect of this is performance stability and a decreased crash rate.

We go to great lengths to maintain backwards compatibility in Second Life, both to never break users’ creations and to support the wide range of systems our Residents use to log in. However, sometimes we have to make the hard decisions: a year ago we announced that we were dropping support for Windows XP and Mac OSX 10.5 & 10.6 (a complete list of current system requirements is available here). Today, with the Tools Update release, the Viewer will no longer run on those systems. You will still be able to log in with an older Viewer until it is aged out based on our deprecation policy, however we strongly recommend updating your system.

It's unfortunate that we have to stop supporting some older systems, but upgrading the tools we use to build the viewer will help us to bring you other improvements to your Second Life experience more quickly and reliably.

Linden Lab

Facebook recently announced plans to deprecate an old Open Graph API and required all apps running 1.0 to force update to 2.0. We have completed this update for SLShare, but Facebook anticipates that the process on their end for migrating a given app may take up to a couple of weeks. During this migration period, there may be some service interruptions for some apps.

This means that when using SLShare (updating status, photo uploads, and check-ins from the Viewer) you may experience some temporary problems. Please be assured that we are aware of this and any issues you encounter should be resolved once the migration period is complete.

Thank you for your patience!

 

Linden Lab

Available now is the ability for LSL to return an avatar’s shape type to scripted objects. With this information, scripters and creators of objects can determine the best animation, pose, or position to play when avatars interact with their objects.

Scripts can now read the avatar’s shape type (male or female) and hover height values. For a complete list of object and avatar agent size details please visit the LLGetObjectDetails() and LLGetAgentSize() wiki pages..

Linden Lab

Heya!

April Linden here. I’m a member of the Second Life Operations team. Second Life had some unexpected downtime on Monday morning, and I wanted to take a few minutes to explain what happened.

We buy bandwidth to our data centers from several providers. On Monday morning, one of those providers had a hardware failure on the link that connects Second Life to the Internet. This is a fairly normal thing to happen (and is why we have more than one Internet provider). This time was a bit unusual, as the traffic from our Residents on that provider did not automatically spill over to one of the other connections, as it usually does.

Our ops team caught this almost immediately and were able to shift traffic around to the other providers, but not before a whole bunch of Residents had been logged out due to Second Life being unreachable.

Since a bunch of Residents were unexpectedly logged out, they all tried to log back in at once. This rush of logins was very high, and it took quite a while for everyone to get logged back in. Our ops team brought some additional login servers online to help with the backlog of login attempts, and this allowed the login queue to eventually return to its normal size.

Some time after the login rush was completed the failed Internet provider connection was restored, and traffic shifted around normally, without disruption, returning Second Life back to normal.

There was a bright spot in this event! Our new status blog performed very well, allowing our support team to be able to communicate with Residents, even in a state where it was under much higher load than normal.

We’re very sorry for the unexpected downtime on Monday morning. We know how important having a fun time Inworld is to our Residents, and we know how unfun events like this can be.

See you Inworld!

April Linden

Linden Lab

Since its introduction, the Linux version of the Second Life Viewer has been considered a Beta status project, meaning that it might have problems that would not have been considered acceptable on the much more widely used Windows or Mac versions. Because "Linux" isn't really one platform - it's a large (and fluid) number of similar but distinct distributions - doing development, builds, and testing for the Linux version has always been a difficult thing to do and a difficult expense to justify. Today, Linux represents under half of one percent of official Viewer users, and just a little over one percent of users on all viewers. We at Linden Lab need to focus our development efforts on the platforms that will improve the experience of more users.

While we hope to be able to continue to distribute a Linux version, from now on we will rely on the open source community for Linux platform support. Linden Lab will integrate open source community contributions to update the Linux platform support, and will build and distribute the resulting viewers, but our development engineering, including bug fixing, will be focused on the platforms more popular among our users. We hope that the community will take up this challenge; anyone interested in ensuring that their fellow Linux users can continue on their preferred platform is encouraged to reach out to us to find out where help is most needed.

 

Linden Lab

Hi! I’m a member of the Second Life operations team, and I was the primary on-call systems engineer this past weekend. We had a very difficult weekend, so I wanted to take a few minutes to share what happened.

We had a series of independent failures happen that produced the rough waters Residents experienced inworld.

Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most  used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.

This sort of failure is something my team is good at handling, but it takes time for us to promote a replica up the chain to ultimately become the new master node. While we’re doing this we block logins and close other inworld services to help take the pressure off the newly promoted master node when it starts taking queries. (We reopen the grid slowly, turning on services one at a time, as the database is able to handle it.) The promotion process took about an hour and a half, and the grid returned to normal by 1:30am.

After this promotion took place the grid was stable the rest of the day on Saturday, and that evening.

That brings us to Sunday morning.

Around 8:00am Pacific on January 10th (Sunday), one of our providers start experiencing issues, which resulted in very poor performance in loading assets inworld. I very quickly got on the phone with them as they tracked down the source of the issue. With my team and the remote team working together we were able to spot the problem, and get it resolved by early afternoon. All of our metrics looked good, and I and my colleagues were able to rez assets inworld just fine. It was at this point that we posted the first “All Clear” on the blog, because it appeared that things were back to normal.

It didn’t take us long to realize that things were about to get interesting again, however.

Shortly after we declared all clear, Residents rushed to return to the grid. (Sunday afternoon is a very busy time inworld, even under normal circumstances!) The rush of Residents returning to Second Life (a lot of whom now had empty caches that needed to be re-filled) at a time when our concurrency is the highest put many other subsystems under several times their normal load.

Rezzing assets was now fine, but we had other issues to figure out. It took us a few more hours after the first all clear for us to be able to stabilize our other services. As some folks noticed, the system that was under the highest load was the one that does what we call “baking” - it’s what makes the texture you see on your avatar - thus we had a large number of Residents that either appeared gray, or as clouds. (It was still trying to get caught up from the asset loading outage earlier!) By Sunday evening we were able to re-stabilize the grid, and Second Life returned to normal for real.

One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.

I’m really sorry for how rough things were inworld this weekend. My team takes the stability of the grid extremely seriously, and no one dislikes downtime more than us. Either one of these failures happening independently is bad enough, but having them occur in a series like that is fairly miserable.

See you inworld (after I get some sleep!),

April Linden

Xiola Linden

The AssetHttp Project Viewer is now available on the alternate viewers page!  We expect it to help with speed and reliability in fetching animations, gestures, sounds, avatar clothing, and avatar body parts - but we need your help to make sure everything works!

Historically, loading assets such as textures, animations, clothing items and so on has been an area where problems were common. All such items were requested through the simulator, which would then find the items on our asset hosts and retrieve them. Especially in heavily populated regions it was possible for things to fail to load because the simulators would get overloaded. Having these requests routed through an intermediate host also made them much slower than necessary. 

A few years ago we made the change to allow textures to be loaded directly via HTTP. Now instead of asking the simulator for every texture, the Viewer could fetch textures directly from a CDN, the same way web content is normally distributed. Performance improved and people noticed that the world loaded faster, and clouded or gray avatars became much less common. 

Today we are taking the next step in enabling HTTP-based fetching. If you download this test Viewer, then several other common types of assets will also be retrieved via HTTP. Supported asset types include animations and gestures, sounds, avatar clothing, and avatar body parts such as the skin and shape.

Based on our tests, this change also helps with performance. We need your help to make sure it works for people all over the world, and to identify any remaining issues that we need to fix. Please download the Viewer and give it a spin; we hope it will make your Second Life even faster and more reliable.

Sometime after these changes go into the default Viewer download, we will be phasing out support on the simulators for the old non-HTTP asset fetching process. We will let you know well ahead of that time so you can download a supported Viewer, and so that makers of other viewers will have time to add these changes to their products. Please let us know how it goes! 

Linden Lab

Chris Linden here. I wanted to briefly share an example of some of the interesting challenges we tackle on the Systems Engineering team here at Linden Lab. We recently noticed strange failures while testing a new API endpoint hosted in Amazon. Out of 100 https requests to the endpoint from our datacenter, 1 or 2 of the requests would hang and eventually time out. Strange. We understand that networks are unreliable, so we write our applications to handle it and try to make it more reliable when we can.

We began to dig. Did it happen from other hosts in the datacenter? Yes. Did it fail from all hosts? No, and this was maddening to be true. Did it happen from outside the datacenter? No. Did it happen from different network segments within our datacenter? Yes.

Sadly, this left our core routers as the only similar piece of hardware between the hosts showing failures and the internet at large. We did a number of traceroutes to get an idea of the various paths being used, but saw nothing out of the ordinary. We took a number of packet captures and noticed something strange on the sending side.

1521   9.753127 216.82.x.x -> 1.2.3.4 TCP 74 53819 > 443 [sYN] Seq=0 Win=29200 Len=0 MSS=1400 SACK_PERM=1 TSval=2885304500 TSecr=0 WS=128
1525   9.773753 1.2.3.4 -> 216.82.x.x TCP 74 443 > 53819 [sYN, ACK] Seq=0 Ack=1 Win=26960 Len=0 MSS=1360 SACK_PERM=1 TSval=75379683 TSecr=2885304500 WS=128
1526   9.774010 216.82.x.x -> 1.2.3.4 TCP 66 53819 > 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2885304505 TSecr=75379683
1527   9.774482 216.82.x.x -> 1.2.3.4 SSL 583 Client Hello
1528  10.008106 216.82.x.x -> 1.2.3.4 SSL 583 [TCP Retransmission] Client Hello
1529  10.292113 216.82.x.x -> 1.2.3.4 SSL 583 [TCP Retransmission] Client Hello
1530  10.860219 216.82.x.x -> 1.2.3.4 SSL 583 [TCP Retransmission] Client Hello

We saw the tcp handshake happen, then at the SSL portion, the far side just stopped responding. This happened each time there was a failure. Dropping packets is normal. Dropping them consistently at the Client Hello every time? Very odd. We looked more closely at the datacenter host and the Amazon instance. We poked at MTU settings, Path MTU Discovery, bugs in Xen Hypervisor, tcp segmentation settings, and NIC offloading. Nothing fixed the problem.

We decided to look at our internet service providers in our datacenter. We are multi-homed to the internet for redundancy and, like most of the internet, use Border Gateway Protocol to determine which path our traffic takes to reach a destination. While we can influence the path it takes, we generally don't need to.

We looked up routes to Amazon on our routers and determined that the majority of them prefer going out ISP A. We found a couple of routes to Amazon that preferred to go out ISP B, so we dug through regions in AWS, spinning up Elastic IP addresses until we found one in the route preferring to go out ISP B. It was in Ireland. We spun up an instance in eu-west-1 and hit it with our test and ... no failures. We added static routes on our routers to force traffic to instances in AWS that were previously seeing failures. This allowed us to send requests to these test hosts either via ISP A or ISP B, based on a small configuration change. ISP A always saw failures, ISP B didn't.

We manipulated the routes to send outbound traffic from our datacenter to Amazon networks via the ISP B network. Success. While in place, traffic preferred going out ISP B (the network that didn't show failures), but would fall back to going out ISP A if for any reason ISP B went away.   

After engaging with ISP A, they found an issue with a piece of hardware within their network and replaced it. We have verified that we no longer see any of the same failures and have rolled back the changes that manipulated traffic. We chalk this up as a win, and by resolving the connection issues we've been able to make Second Life that much more reliable.

 

Linden Lab

Over the past week, a number of Second Life customers may have noticed that they were not being billed promptly for their Premium membership subscriptions, mainland tier fees, and monthly private region fees, with some customers inadvertently receiving delinquent balance notices by email, as we described on our status blog.

This incident has now been corrected, and our nightly billing system has since processed all users that should have been billed over the past week.

I wanted to share with you some of the details of what happened to cause this outage from our internal post mortem and more importantly, what we’re doing to prevent this happening in the future.

Summary

Every night, one of our batch processes collects a list of users that should be billed on that day, and processes that list through one of our internal data service subsystems. Internally, we refer to this process as the 'Nightly Biller'.

A regularly scheduled deploy to that same data service subsystem for a new internal feature inadvertently contained a regression which prevented this Nightly Biller process from running to completion.

Event Timeline
On February 1st, 2016, we began a rolling deploy of code to one of our internal dataservice subsystems. For this particular deploy, we opted to deploy the code to the backend machines over four days, deploying to six hosts each day. The deploy was completed on February 4th.

The first support issue regarding billing issues was raised on February 8th, however, as we only had one incident reported to our payments team, we decided to wait and see if they were billed correctly the next night.

However, we were notified on the morning of February 9th that 546 private regions had not been billed, and an internal investigation began with a team assembled from Second Life Engineering, Payments, QA and Network Operations teams. This team identified the regression by 9am, and had pushed the required code fixes to our build system. By noon, the proposed fix was pushed to our staging environment for testing.

Unfortunately, testing overnight uncovered a further problem with this new code that would have prevented new users from being able to join Second Life. On February 10th, investigation into this failure, and how it was connected with both the Nightly Biller system, and our new internal tool code continued.

By February 11th, we had made the decision to roll back to the previous code version that would have allowed the Nightly Biller to complete successfully, but would have disabled our new internal feature. One final review of the new code uncovered an issue with an outdated version of some Linden-specific web libraries. Once these libraries were updated and deployed to our staging environment, our QA team were able to successfully complete the tests for our Nightly Biller, our new internal tool, and the User Registration flow.

The new code was pushed out to our production dataservice subsystem by 7pm on February 11th, and the Payments team were able to confirm that the Nightly Biller ran successfully later that evening.

Follow Up
As a result of this incident, we’re making some internal process changes:

  • Firstly, we’ll be changing our build system to ensure that when new code is built, we’re always using the latest version of our internal libraries.
  • Secondly, we are implementing changes to our workflow around code deploys to ensure that such regressions do not occur in the future

Conclusions
We're always striving for low risk software deploys at the Lab, and each code deploy request made is evaluated for a potential risk level. Further reducing our risk is internal documentation that describes the release process. Unfortunately, a key step was missed in the process, which inadvertently led to a high risk situation, and the failure of our nightly biller. The above changes are already in progress which will reduce the likelihood of incidents such as this recurring.

 

Steven Linden

Linden Lab

Hi! I wanted to take a moment to share why we had to do a full grid roll on a Friday. We know that Friday grid rolls are super disruptive, and we felt it was important to explain why this one was timed the way it was.

Second Life is run on a collection of thousands of Linux servers, which we call the “grid.” This week there was a critical security warning issued for one of the core system libraries (glibc), that we use on our version of Linux. This security vulnerability is known as CVE-2015-7547.

Since then we’ve been working around-the-clock to make sure Second Life is secure.

The issue came to light on Tuesday morning, and the various Linux distributions made patches for the issue available shortly afterwards. Our security team quickly took a look at it, and assessed the impact it might have on the grid. They were able to determine that under certain situations this might impact Second Life, so we sprang into action to get the grid fully patched. They were able to make this determination shortly after lunch time on Tuesday.

The security team then handed the issue over to the Operations team, who worked to make the updates needed to the machine images we use. They finished in the middle of the night on Tuesday (which was actually early Wednesday morning).

Once the updates were available, the development and release teams sprung into action, and pulled the updates into the Second Life Server machine image. This took until Wednesday afternoon to get the Second Life Server code built, tested, and the security team confirmed that any potential risk had been taken care of.

After this, the updates were sent to the Quality Assurance (QA) team to make sure that Second Life still functioned as it should, and they finished up in the middle of the night on Wednesday.

At this point we had a decision to make - do we want to roll the code to the full grid at once? We decided that since the updates were to one of the most core libraries, we should be extra careful, and decided to roll the updates to the Release Candidate (RC) channels first. That happened on Thursday morning.

We took Thursday to watch the RC channels and make sure they were still performing well, and then went ahead and rolled the security update to the rest of the grid on Friday.

Just to make it clear, we saw no evidence that there was any attempt to use this security issue against Second Life. It was our mission to make sure it stayed that way!

The reason there was little notice for the roll on Thursday is two fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.

We know how disruptive full grid rolls are, and we know how busy Friday is for Residents inworld. The timing was terrible, but we felt it was important to get the security update on the full grid as quickly as we could.

Thank you for your patience, and we’re sorry for the bumpy ride on a Friday.

April Linden

Linden Lab

Last week we deployed the change to serve all texture and mesh data primarily through the CDN, as we've been doing with avatar textures since March. In addition to reviewing feedback from Residents we've been monitoring and measuring the effects of the change, and thought it would be interesting to share some of what we've learned.

First the good news:

  • Load on some key systems on the simulator hosts has been reduced considerably. The chart below shows the frequency of high-load conditions in the simulator web services, and you can see the sharp drop as the CDN takes on much of that job. This translates into other things, including region crossings and teleports, being faster and more reliable.
  • For most users most of the time there has been a big performance improvement in texture and mesh data loading, resulting in faster rez times in new areas. The improvement has been realized both on the official viewer and on third party viewers.

cdn2.png

However, we have also seen that some users have had the opposite experience, and have worked with a number of those users to collect detailed data on the nature of the problems and shared it with our CDN provider. We believe that the problems are the result of a combination of the considerable additional load we added to the CDN, and a coincidental additional large load on the CDN from another source. Exacerbating matters, flaws in both our viewer code and the CDN caused recovery from these load spikes to be much slower than it should have been. We are working with our CDN provider to increase capacity and to configure the CDN so that Second Life data availability will not be as affected by outside load. We are also making changes to our code and in the CDN to make recovery quicker and more robust.

We are confident that using the CDN for this data will make the Second Life experience better. Making any change to a system at the scale of Second Life has some element of unavoidable risk; no matter how carefully we simulate and test in advance, once you deploy at scale in live systems there's always something to be learned. This change has had some problems for a small percentage of users; unfortunately, for those users the problems were quite serious for at least part of the time. We appreciate all the help we've gotten from users in quickly diagnosing those problems. We think that the changes we've begun making will reduce the frequency of failures to below what they were before we adopted the CDN, while keeping the considerable performance gains.

 

Linden Lab

Heya! April Linden here.

Yesterday afternoon (San Francisco time) all of the Place Pages got reset back to their default state. All customizations have been lost. We know this is really frustrating, and I want to explain what happened.

A script that our developers use to reset their development databases back to a clean state was accidently run against the production database. It was completely human error. Worse, none of the backups we had of the database actually worked, leaving us unable to restore it. After a few frantic hours yesterday trying to figure out if we had any way to get the data back, we decided the best thing to do was just to leave it alone, but in a totally clean state.

An unfortunate side effect of this accident is that all of the web addresses to existing Place Pages will change. There is a unique identifier in the address that points to the parcel that the Place Page is for, and without the database, we’re unable to link the address to the parcel. (A new one will automatically be generated the first time a Place Page is visited.) If you have bookmarks to any Place Pages in your browser, or on social media sites, they'll have to be updated.

Because of this accident, we’re taking a look at the procedures we already have to make sure this sort of mistake doesn’t happen again. We’re also doing an audit of all of our database backups to make sure they’re working like we expect them to.

I’d like to stress that we’re really sorry this accident occurred. I personally had a bunch of Place Pages I’d created, so I’m right in there with everyone else in being sad. (But I’m determined to rebuild them!)

Since we’re on the topic of human error, I’d like to share with you a neat piece of the culture we have here at the Lab.

We encourage people to take risks and push the limits of what we think is possible with technology and virtual worlds. It helps keep us flexible and innovative. However… sometimes things don’t work out the way they were planned, and things break. What we do for penance is what makes us unique.

Around the offices (and inworld!) we have sets of overly sized green ears. If a Linden breaks the grid, they may optionally, if they choose to, wear the Shrek Ears as a way of owning their mistake.

If we see a fellow Linden wearing the Shrek Ears, we all know they’ve fessed up, and they’re owning their mistake. Rather than tease them, we try to be supportive. They’re having a bad day as it is, and it’s a sign that someone could use a little bit of niceness in their life.

At the end of the day, the Linden takes off the Shrek Ears, and we move on. It’s now in the past, and it’s time to learn from our mistakes and focus on the future.

There are people wearing Shrek Ears around the office and inworld today. If you see a Linden wearing them, please know that’s their way of saying sorry, and they’re really having a bad day.

april-baloo-bear.jpg

Baloo Linden and April Linden, in the ops’ team inworld headquarters, the Port of Ops . 

Linden Lab

Keeping the systems running the Second Life infrastructure operating smoothly is no mean feat. Our monitoring infrastructure keeps an eye on our machines every second, and a team of people work around the clock to ensure that Second Life runs smoothly. We do our best to replace failing systems proactively and invisibly to Residents. Unfortunately, sometimes unexpected problems arise.

In late July, a hardware failure took down four of our latest-generation of simulator hosts. Initially, this was attributed to be a random failure, and the machine was sent off to our vendor for repair. In early October, a second failure took down another four machines. Two weeks later, another failure on another four hosts.

Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.

After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our datacentre to inspect every potentially affected system and replace the defective component to prevent any more failures.

The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.

 

Linden Lab

Recently, our bug reporting system (Jira) was hit with some spam reports and inappropriate comments, including offensive language and attempts at impersonating Lindens. The Jira system can email bug reporters when new comments are added to their reports, and so unfortunately the inappropriate comments also ended up in some Residents' inboxes.

We have cleaned up these messages, and continue to investigate ways to prevent this kind of spam in the future. We appreciate your understanding as we work to manage an open forum and mitigate incidents like this. 

In the short term, we have disabled some commenting features to prevent this from recurring. This means that you will not be able to comment on Jiras created by other Residents. We apologize for this inconvenience as we look into long term solutions to help prevent this type of event from occurring.





 

Linden Lab

Hi! I’m a member of the Second Life Operations team. On Friday afternoon, major parts of Second Life had some unplanned downtime, and I want to take a few minutes to explain what happened.

Shortly before 4:15pm PDT/SLT last Friday (May 6th, 2016), the primary node for one of the central databases that drive Second Life crashed. The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.

When the primary node in this database is offline we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.

My team quickly sprung into action, and we were able to promote one of the replica nodes up the chain to replace the primary node that had crashed. All services were fully restored and turned back on in just under an hour.

One additional (and totally unexpected) problem that came up is that for the first part of the outage, our status blog was inaccessible. Our support team uses our status blog to inform Residents of what’s going on when there are problems, and the amount of traffic it receives during an outage is pretty impressive!

A few weeks ago we moved our status blog to new servers. It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home. (Don’t forget that you can also follow us on Twitter at @SLGridStatus. It’s really handy when the status blog in inaccessible!)

As Landon Linden wrote a year ago, being around my team during an outage is like watching “a ballet in a war zone.” We work hard to restore Second Life services the moment they break, and this outage was no exception. It can be pretty crazy at times!

We’re really sorry for the unexpected downtime late last week. There’s a lot of fun things that happen inworld on Friday night, and the last thing we want is for technical issues to get in the way.


April Linden

Linden Lab

Missed Connections

 

Heya! April Linden here.

We had a pretty rough morning here at the Lab, and I want to tell you what happened.

Early this morning (during the grid roll, but it was just a coincidence) we had a piece of hardware die on our internal network. When this piece of hardware died, it made it very difficult for the servers on the grid to figure out how to convert a human-readable domain name, like www.secondlife.com, into IP addresses, like 216.82.8.56.

Everything was still up and running, but none of the computers could actually find each other on our network, so activity on the grid ground to a halt. The Second Life grid is a huge collection of computers, and if they can’t find other other, things like switching regions, teleports, accessing your inventory, changing outfits, and even chatting fail. This caused a lot of Residents to try to relog.

We quickly rushed to get the hardware that died replaced, but hardware takes time - and in this case, it was a couple of hours. It was very eerie watching our grid monitors. At one point the “Logins Per Minute” metric was reading “1,” and the “Percentage of Successful Teleports” was reading “2%.” I hope to never see numbers like this again.

Once the failed hardware was replaced, the grid started to come back to life.

Following the hardware failure, the login servers got into a really unusual state. The login server would tell the Resident’s viewer that the login was unsuccessful, but it was telling the grid itself that the Resident had logged in. This mismatch in communication made finding what was going on really difficult, because it looked like Residents were logging in, when really they weren't. We eventually found the thing on the login servers that wasn’t working right following the hardware failure, and corrected it, and at this point the grid returned to normal.

There is some good news to share! We are currently in the middle of testing our next generation login servers, which have been specifically designed to better withstand this type of failure. We’ve had a few of the next generation login servers in the pool for the last few days just to see how they handle actual Resident traffic, and they held up really well! In fact, we think the only reason Residents were able to log in at all during this outage was because they happened to get really lucky and got randomly assigned to one of the next generation login servers that we’re testing.

The next step for us is to finish up testing the next generation login servers and have them take over for all login requests entirely. (Hopefully soon!)

We’re really sorry about the downtime today. This one was a doozy, and recovering from it was interesting, to say the least. My team takes the health and stability of Second Life really seriously, and we’re all a little worn out this afternoon.

Your friendly long eared GridBun,

April Linden

Linden Lab

 

Hi everyone! Mazidox here. I’d like to give you an overview of what happened on Wednesday (09/06) that ended up with some Residents’ objects being mass returned.

Two weeks ago, we had several problems crop up all at once - starting with a DNS server outage (a server that helps route requests between different parts of Second Life). Unfortunately, when the dust settled, we started seeing a disturbing trend: mass-returns of objects.

We diagnosed an issue where a region starts up with incorrect mesh Land Impact calculations, which could lead to a lot of objects getting returned at once, as we had encountered several months ago. At that time we applied what we call a speculative fix. A speculative fix means that while we can’t recreate the circumstances that led to a problem, we are still fairly confident that we can stop it from happening again. Unfortunately, in this case we were mistaken; because the fix we applied was speculative, the problem wasn’t fixed as completely as it could have been, and we found out how incomplete the fix was in a dramatic fashion that Wednesday night.

When a problem like this occurs with Second Life we have three priorities:

  1. Stop the problem from getting worse

  2. Fix the damage that has been done

  3. Keep the problem from happening again

We had the first priority taken care of by the end of the initial outage; we could be certain at that point that our servers could talk to each other and there weren’t going to be any more mass-returns of objects that day. At that point, we started assessing the damage and figuring out how to fix as much as we could. In this case it turned out that restarted affected regions where no objects had been returned fixed the problem of some meshes showing the wrong Land Impact.

For regions where a mass-return had happened, there wasn’t a quick fix. Our Ops team managed to figure out a partial list of what regions were affected by a mass object return, which kept our Support team very busy with clean up. Once we helped everyone we knew, who had experienced mass object returns our focus shifted once more, this time to keeping the problem from happening again.

In order to recreate all the various factors that caused this object return we needed to first identify each contributing factor, and then put those pieces together in a test environment. Running tests and finding strange problems is the Server QA team’s specialty so we’ve been at it since the morning after this all happened. I have personally been working to reproduce this, along with help from our Engineering and Ops teams. We’re all focused on trying to put each of the pieces together to ensure that no one has to deal with a mass-return again.

Your local bug-hunting spraycan,

Mazidox Linden

Linden Lab

Hi everyone! :)

As many Residents saw, we had a pretty rough day on the Grid yesterday. I wanted to take a few minutes and explain what happened. All of the times in this blog post are going to be in Pacific Time, aka SLT.

Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.

A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.

Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.

We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.

On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.

My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life. :)

Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.

We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better

My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.

April Linden

×
×
  • Create New...