Since late in 2012, there's been a project to improve HTTP communications between Viewer and grid services. One of the metrics we've tried to improve is maximum request rate: the number of HTTP requests that can be issued and responded to over a time interval. Improving this figure has been challenging. There are and will always be factors beyond our control, such as the characteristics of the home network, ISP policies and capacities, physical distance to services, and transient network problems, to name a few. However, we can do something about other factors that are under our control.
These factors can be fixed limits intended to implement a rate throttle. Or they can also be side-effects of other decisions, such as connection handling. The impact of fixed limits and connection handling is shown in the diagram below taken from texture and mesh fetching. The lowest curve (red) represents the highest possible request rate when using a unique connection for each HTTP request for a maximum of five (5) concurrent connections. The curve above it (blue) shows the doubling of the request rate limit when HTTP keepalives are used. And both of these limits are strongly affected by packet round-trip (ping) times. Finally, the horizontal line at 100 RPS represents an approximate limit on request rate that's independent of ping time.
At the beginning of this project, most HTTP traffic, texture fetching in particular, was constrained to region 'A'. This was the result of a combination of factors. One was the lack of connection keepalives. This forced a TCP connection handshake prior to every HTTP request. Another was the delay in launching new requests on completion of previous requests.
Released in early 2013, Viewer 3.4.3 included a new library, llcorehttp, to begin to address the above issues. It was foundational in that it didn't attack the above problems directly (nor other problems not discussed here), but it established a structure that would allow the problems to be addressed over a series of releases.
While its goals were modest, the efficiency and latency improvements yielded some nice gains. Texture fetching was the first component to use the new library and throughput and robustness increased while tendencies to stall have disappeared. With these changes, texture fetching solidly entered the 'B' region of the diagram.
Deployed in April, 2013, the DRTSIM-203 server release included the first support for HTTP keepalive connections between viewers and simulator hosts. For small transfers, it halved the number of round-trips required to fetch an object. For larger transfers, the impact of TCP slow-start was also reduced. There were several other beneficial side-effects; the greatest of these was creating fewer TCP connections for a given workload. A high rate of connection creation, called connection churn, is one factor in destabilizing many home routers. Symptoms of this include: router reboots, loss of connectivity, timeouts, and DNS failures.
Keepalive connections were enabled for texture fetches and SSL-encrypted HTTP between viewers and simulator hosts. This raised the ceiling into region 'C,' increasing texture download throughput. Mesh downloads continued to use the original connection scheme, and monitoring and enforcement watchdogs were added to protect common services from monopolization.
We continue to evaluate our HTTP protocols with goals of improving our user experiences. Here are some of our most recent areas of interest.
DRTSIM-216/229 + Viewer DRTVWR-329
Mesh fetching, like texture fetching, is a relatively heavy-weight activity in the viewer. It takes time to acquire these assets and display a scene. So mesh fetches will be the next area to receive the above treatment, putting them in region 'C' as well.
The challenge is that mesh fetching has been built around extremely high connection concurrency. A new capability service, called 'GetMesh2', will supersede the existing 'GetMesh' service. It is designed around low connection concurrency. That, in turn, will permit connection keepalive and even HTTP pipelining while promising higher throughput, lower overhead and improved router stability.
As significant as the above changes are, they still leave most users under a ceiling strongly influenced by ping time. Europe and Asia are usually over 100 ms and South Africa often reports 250 ms. There will always be a 'ping tax' on those who are far from grid services. But there is something we can do: HTTP pipelining.
Pipelining reduces the impact of ping time by issuing multiple HTTP requests at once without waiting for responses. This will allow our servers to keep filling the download stream rather than stalling between requests. Mesh and texture fetching will get first crack at pipelining. There will be new problems to discover. An essential library, libcurl, has only just started supporting pipelining. We will need to update some of our services as well. But those will be solved and asset fetches will begin to operate in region 'D'.
Sky's the Limit
Progress doesn't stop with pipelining. There are more opportunities to push upwards into 'E' and 'F' regions. A few possibilities:
If you participated in the DRTSIM-203 beta test, you had an opportunity to try our 'Happy' regions (TextureTest2H, MeshTest2H, etc.). These regions ran on hosts with elevated limits to see what would happen. The results were positive and invite a permanent increase as HTTP behavior improves.
Updating other HTTP-based services, moving them up the throughput curve. This would include services such as inventory operations and display name lookups, which make heavy, bursty demands.
Looking into other transport protocols. SPDY is interesting and tries to solve some of the challenges Second Life presents. How well it solves them is yet to be seen.
These shouldn’t be taken as commitments or written-in-stone plans, as priorities and goals may shift, but we’ve seen good results from our work on the HTTP project and are looking forward to continuing to improve Second Life’s infrastructure.