We rolled out some changes to how region crossings work this week, and I want to explain a little about why and what we changed.
Please note that this blog post is gonna be a bunch more technical than my normal posts, because this is a tricky technical thing. Here’s a really quick summary, however: As part of moving regions to the cloud we discovered that region crossings between the cloud and our data center were terrible. Since we now had an easy way to reproduce the issue, we dug into it, and were able to find some really old bugs, and fixed them. Hooray!
And now for a fuller description!
The process of crossing from one region to another when you’re riding in a vehicle is pretty involved. The region you’re leaving needs to be able to tell the region where you’re arriving everything it knows about the vehicle, and has to do it really quickly. That includes all of the scripts in the vehicle, everything that’s attached to it, the direction and how fast it’s going, and lots of other stuff.
To make this happen quickly, early on in Second Life’s history we made some assumptions about our network, including things like how big a packet can be. Those assumptions generally worked okay on our own network, but not outside it.
When you crossed from one region to another, the regions were putting a lot of information into large packets and sending them across our network. This was usually okay because our network was purposefully built to run Second Life. Then, as soon we tried to do this on someone else’s network (in the cloud), things didn’t work quite right. The problem was most noticeable when crossing from a region in our data center to one in the cloud.
The first thing our engineering teams tried was breaking those large packets up into smaller ones, but that actually made the problem worse. Rather than send one big packet and wait for the other side to say it received the data, with smaller packets, it had to repeat that bunch of times for each packet. (Send, get an acknowledgement, send another piece, get an acknowledgement, etc.) It was still mostly okay across our network, but way worse when a region in our network was talking to one in the cloud. We now knew this code would never work well, so we needed a different approach.
Next, our engineering team decided to use another way to send the data across the network, using the same protocol and method we use for other types of data. Most importantly, it is faster and more reliable. That did the trick! We’re still collecting statistics on the impact this change has, but things are looking very positive.
Once this new code was written, the performance when going from region to region got a lot better, and it worked between our data center and the cloud! The improvement was so dramatic that we decided to not make our Residents wait for uplifted simulators, and rolled the changes out right away. That code is what rolled out to the grid this week.
It’s really exciting that the cloud migration is helping us find really old bugs and make Second Life better as we go.
A gridbun that’s really ready to hop among the clouds,