Improving performance through sim multithreading

animats · May 29, 2019

We all know about SL's sim-side performance problems. The usual solution today in other applications is to put more CPU cores on the problem. When SL was designed, the typical server had one core and ran about 3 GHz. So SL's sim programs are single-thread. They can't use a multi-core CPU effectively. Today, a typical server has 8 or more cores and runs about 4 GHz per core. You can get more cores today, but you can't get much faster ones. With the move to AWS coming, LL will have much more flexibility on how many CPUs are available to a sim. If sims were multi-threaded, enough compute power could be applied when needed to make even Fashion Week sims work without choking.

So we had some discussion today at Server User Group about what it would take to add multiple core usage to script processing. Script processing is the big variable load sim-side. If a sim is overloaded, it's usually heavy script load. (Yes, there are exceptions.)

So what would it take to parallelize LSL? The concurrency rules of LSL:

Each script is processed sequentially. You never interrupt one event to run another event in the same script. So you don't need locks in LSL code.
There is no order guarantee across multiple scripts. You have no control over which scripts run first.
Each script is guaranteed that, within an event that doesn't have a delay (llSleep or a forced delay), the values you get from get-type operations like llGetPos will not change. (It's not clear there's a formal guarantee of this, but scripts seem to expect it.)

OK. So let's suppose multiple scripts could run at once, on different CPUs. This is quite feasible. The problems come from two scripts changing the same world state information. In code that was never written for concurrency, adding locks after the fact is risky and difficult. A very clean approach is needed to make this work.

So, suppose it works like this.

All state changes to the world (all llSet... functions) queue up a change item but don't change the world immediately. llSetPrimitiveParamsFast already works this way. It's asynchronous. You request a change, and it happens some time later.
Synchronous llSet... functions, like llSetPos, would also queue up a change item, instead of being applied immediately like now. But they'd also set a "changed" flag for the script. This indicates that the world info obtained with llGet... functions is out of date.
Calling a llGet... function when the "changed" flag is set causes the script to be paused for one frame before the value is fetched. This way, if you do llSet... followed by llGet..., you get the new value, one frame later, and you're back in sync. This has the same effect as the "forced delay" mechanism now, but it's applied later. Note that you can do many "llSet..." operations before being stalled one frame time. It's only when you do "get" after "set" that you have to wait for the world to catch up.
At the end of a frame cycle, after all scripts have finished their event or were interrupted, all change items get applied to the world. If two scripts queued up change items, the last one wins. All script "changed" flags are cleared.

OK, this is confusing. It's a lot like the way a database transaction or a superscalar CPU fence works, though. So this isn't new, it's just a widely used concept translated to the SL world.

Now here's where it gets interesting. It would be a problem if an LSL function set something immediately without a forced delay.

To make those functions work asynchronously, a one frame delay would have to be inserted. Classic LSL functions that change things almost all have a forced delay. Or they are explicitly asynchronous. It looks like the original designers of SL were preparing for concurrency in the future. Someone was thinking ahead. When you think about it, it's unusual that so many LSL llSet... functions are "blind" - you don't get a completion status back. That's exactly what you want in an asynchronous system.

Some newer functions do break those rules.

llSetRegionPos - returns a status and no forced delay. No reason this shouldn't have a delay, like llSetPos does.
llScaleByFactor - returns a status and no forced delay.

This suggests the original design rationale of LSL calls was forgotten. Those few new calls could be given a forced delay, to make them like the others.

So, looking ahead, this is a way that SL can move into the multi-core era and overcome its performance problems while not breaking existing scripts. I suspect the original designers of SL had this in mind years ago, but multicore CPUs were rare and expensive back then. Now, you usually have more cores than you know what to do with.

Aishagain · May 29, 2019

@animats: I am not going to pretend that I fully understood all that but it strikes me that at last some constructive ideas are coming forward to the running of SL.

Kardargo Adamczyk · May 29, 2019

I do not think LL is working to a multi-threaded sim solution at all, and i don't see anything like that happening in the near future, lets say they have 20.000 full regions running, not counting homesteads and other regions as they seem to be sharing cores. right now they use 20.000 cores for those 20.000 regions, lets say for the sake of argument the sims can start using 2 cores, meaning they have to add 20.000 cores to the whole simulator, my best guess is that with AWS they looking to reduce the amount of used cores rather then increasing them.

It's not like they run a single sim on a 16 core server, they run 16 sims on a 16 core box, 1 core for each region.

I do see your point though, it would be an increase in computational power for each region, but the costs would not outweigh the benefits.

Qie Niangao · May 29, 2019

4 minutes ago, Kardargo Adamczyk said:

It's not like they run a single sim on a 16 core server, they run 16 sims on a 16 core box, 1 core for each region.

As I understand it, the sims are not actually pinned to a core, so it's possible they could use some of their host-mates' idle processor capacity. Of course, that assumes some smart load balancing such that no sim can be denied a full core's-worth of CPU time (which I kinda think is already part of their host virtualization). Some similar limiting would also be needed in the cloud, to make sure they can't gobble up all processing in a data center if things go haywire.

You're right, though: until they're on the cloud, the payoff for sim multi-threading would be pretty limited.

Sharie Criss · May 30, 2019

20 hours ago, animats said:

You can get more cores today, but you can't get much faster ones.

Well, yes and no. The work unit of a single core of a modern CPU at the same speed rating is *significantly* higher than it used to be - this is where the innovations in processor design come into play. When clock frequencies get much higher, all sorts of bad things happen (power usage / heat / etc.) it's why there is such a push for more cores at the same clock speed - it's cheaper to design more of the same core on a die than to increase efficiencies (work unit per clock cycle.) In fact, most of the performance improvements chip designers can come up with have already been done. Looking at a per-core CPU benchmark graph of common processors over the past 15 years, there was a big upwards trend that has leveled off significantly in the past 5 years. This is bad news for SL though, as performance was pretty constant with normal ups and downs yet despite pulling functionality off the sim server and pushing it to other servers / content servers, things got worse. I still suspect LL went on a money saving spree of going with much higher core count CPUs, possibly with lower speed ratings allowing more regions per server. Unfortunately the rest of the server won't be significantly faster. This money savings would explain the new price decrease on full regions. (Personally, I'd forgo the price decrease if it meant additional CPU could be made available even if it was just a percentage - like - an additional 25% of a core.)

Reading though server release notes over the years, you will see comments like - "Moved BLAH to it's own thread because...." which sure makes it sound like the server code is already quite threaded (not sure how you do something like SL without a threaded architecture.) More threads does not help of course unless you have more than one physical core available to execute them (context switches are expensive!) The process scheduler for the simulator must limit region instances to only being able to run a single thread at a time in order to maintain that 1 CPU core per region limit.

Bottom line - SL is already multi-threaded. Looking in caves and under rocks, trying to find more opportunities for a threaded task when there are no plans to increase available cores per region is pointless. LL can no longer rely on core performance increases to dig them out of their performance holes due to the more recent flat-lining of those generational CPU core performance increases. The only viable solution to the performance hit we've all seen is to increase core work units per region.

Regarding making more SL calls async..... If you need a return value, async calls would basically require the internal equivalent of a dataserver event to get that output. There is a lot of overhead doing a call that needs a dataserver to get call results and it can greatly complicate a script. If you want to REALLY get SL scripting more efficient, add a lot more utility functions that do things that used to require a lot of code (Regex's would be awesome, as would an LSL equivalent of sprintf, a modern menu system to cut down on laggy HUDs, etc. etc. - hundreds of opportunities.) And most awesome of all would be the ability to call a function directly in another script synchronously, or access that other script's global variables without having to resort to slow and messy link messages. A script blocking on a sync call is not necessarily a bad thing - and in many cases and sometimes preferred. For tasks where you Never care about the return value (it can happen, but not caring is why so many scripts break) then moving that function from sync to async makes sense.

One option that REALLY opens up the possibility of reducing script load on a simulator is to open up an API, relax the HTTP throttle, and move a good portion of script processing to outside compute resources. I would certainly move as much as I could out of SL if the throttle didn't make that impossible.

animats · May 30, 2019

6 hours ago, Sharie Criss said:

Reading though server release notes over the years, you will see comments like - "Moved BLAH to it's own thread because...." which sure makes it sound like the server code is already quite threaded.

I looked through the server release notes for mentions of "thread". At least two things seem to have been moved to a separate thread - object rezzing and parts of region crossings.

Both of those are outside the main loop. The main loop in SL is a game loop:

    do_forever { 
		get_viewer_inputs(); 
		do_physics(); 
		do_scripts(); 
		tell_viewers(); 
		}

There are also lots of bookkeeping tasks in the server - profiles, search, inventory, money, chat, voice, delivery, etc. which aren't tied to the frame cycle. Those are probably in other threads. (High Fidelity did their bookkeeping on separate machines, to simplify the sim server. In a new design, that makes sense.)

It's the stuff in the game loop that matters for performance. (Mostly. Slow rezzing probably means "the game loop is using all the cycles and the lower priority rezzing thread is being starved.")

The argument for multithreading script execution is that script execution is so variable and unthrottled. Land impact limits keep the number of objects under control. Avatar limits keep the number of avis per sim under control. Nothing keeps the number of scripts per sim under control. The combination of idle scripts using time, no limit on scripts, and single thread script execution is making the sim servers choke.

Quote

Looking at a per-core CPU benchmark graph of common processors over the past 15 years, there was a big upwards trend that has leveled off significantly in the past 5 years.

Oh, yes. Clock speed leveled off about a decade ago, after doubling every few years for several decades. I'm in Silicon Valley, where people are very aware of this. Fancier superscalar CPU designs have helped some, but that's a diminishing returns thing. Single CPU speed is pretty much maxed out now.

But look at cores per chip and their cost.

Cores just started getting much cheaper.

7 hours ago, Sharie Criss said:

Bottom line - SL is already multi-threaded. Looking in caves and under rocks, trying to find more opportunities for a threaded task when there are no plans to increase available cores per region is pointless. LL can no longer rely on core performance increases to dig them out of their performance holes due to the more recent flat-lining of those generational CPU core performance increases. The only viable solution to the performance hit we've all seen is to increase core work units per region.

Yes. SL needs to put more engine behind each sim. But that won't help sims that are in script overload, which now seems to be half of mainland. Those need to be able to use more cores.

I think the biggest near-term win would be to try really hard to get the CPU time usage of idle scripts down to 0. If it doesn't have an event to process, it needs to not be looked at all during the frame. Each idle script uses about 0.004ms per frame (plus or minus about 0.001ms) which, somewhere between 4000 to 6000 scripts, chokes the sim. That's got to stop. Oz Linden admitted at Server User Group that a lot of CPU time is going down the drain that way. I suggested that half of SL's CPU load was idle scripts; Simon Linden thought it was less than that. Everyone who's looked at this agrees it's way too big.

You can see this using "Asset Search" in Firestorm and asking for "Script Info" with a right mouse click.

Sharie Criss · May 31, 2019

5 hours ago, animats said:

Oh, yes. Clock speed leveled off about a decade ago, after doubling every few years for several decades.

Oh - it wasn't about clock speed, it's number of clock cycles required to perform certain instructions. An example, most modern CPUs have cryptographic instructions that operate on a block of data. If the old CPU design took 50 clock cycles to execute the instruction, chip designers may implement additional parallel pipelines in hardware for that instruction that would allow the operation to execute in only 25 clock cycles. That's the type of improvement that's tapered off. The graphs don't tell you about overall per core performance changes over time (that's benchmark data, not clock speed) and $$$ cost per unit of work over time which is the relevant data when you look at SL and the work that each sim server is being asked to do - what cost savings or performance improvements should have been realized.

FULLY agree regarding sleeping scripts doing nothing just burning CPU - that really does need to be fixed. It reminds me of a discussion I had with someone that made a certain product and when I expressed concerns that this simple product that could easily exist with only a single script (like all the competition) this creator insisted that the 5 extra scripts weren't costing anything because they were "sleeping."

Script count actually is mostly LL's fault. Because of various limits, memory, throttling, functions that cause the script to sleep, etc. creators had to get - Creative - to get around them. We've all seen it - edit some complex item and look at the scripts, and due to things like permission requests, sit targets, llInstantMessage or llEmail sleeping, ll functions that only operate on the prim where the script resides, etc. all REQUIRE multiple scripts to work around. A prime example is the old Hippo vendor / rental systems. INSANE script counts that were massively duplicated due to the nature of how they were used. In more recent years, there's been VERY little done outside of a couple functions like LlSetLinkPrimitiveParamsFast that resolved the bazillion script resizer issues. The severe memory limits cause projects to be divided up into smaller scripts that have a lot of duplication of code, tons of link messasges to send state info back and forth, etc. ALL causing needless load on the sim server. Scripts were created over the years that far far exceed LL's wildest imagination of what people would do with the platform. Unfortunately, LL never kept up with the need and now the aged script system is biting us all in the butt.

Also frustrated with the JIRA I created ages ago where I asked for estate level impact limits for Avatars that would have mitigated some ding dong with old attachments with hundreds of resizer scripts causing the sim to choke for 30 seconds or so when they TP in and out. What's wrong with limiting each avatar to say - 1ms max of script time? I created that Jira issue when I was regularly seeing spikes in the estate Top Scripts tool showing avis with 90+ms of script impact (which should be impossible, but - there it was.)

Qie Niangao · May 31, 2019

3 hours ago, Sharie Criss said:

... old attachments with hundreds of resizer scripts causing the sim to choke for 30 seconds or so when they TP in and out. What's wrong with limiting each avatar to say - 1ms max of script time?

Reminds me of a tangent. As they rework the script scheduler, one special requirement for LSL is the cost of loading a new script into the simulation, which is very often the largest performance impact a script will ever have, by far. So it's not only the steady-state overhead that's a concern, but also the resources demanded when an avatar-full of attached scripts needs rezzing-in all at once.

Sharie Criss · May 31, 2019

7 hours ago, Qie Niangao said:

Reminds me of a tangent. As they rework the script scheduler, one special requirement for LSL is the cost of loading a new script into the simulation, which is very often the largest performance impact a script will ever have, by far. So it's not only the steady-state overhead that's a concern, but also the resources demanded when an avatar-full of attached scripts needs rezzing-in all at once.

Yes yes yes, THIS - is exactly the issue, and it's both TP in and TP out, although less so on TP out as the process is basically: suspend scripts, transfer script state to new region, remove scripts from current region. Some venues with script monitors that eject avis with script counts higher than whatever threshold are actually doing themselves more harm than good. By the time the script monitor even knows the Avi is there with excessive scripts, it's too late, and the impact of the avi's scripts has already settled down to "Minimal" - that mostly sleeping state.

Again, this is all well intentioned, it just doesn't work well in practice. If avi script impact was throttled / limited, those sim freezes would all but disappear. People running excessive scripts would just hurt themselves (as in - their scripts just run slow,) and not everyone else around them.

If we had both avi script impact limits and fixed the idle scripts still burning CPU, server side lag would be a fraction of what it is today and everything would run smoother. Of course, this is easy to say, not so easy to actually accomplish.... But it's needed if SL is to continue to be viable for the foreseeable future. The current situation is so bad that we can't take advantage of all the cool things we can do with animesh, pathfinding, experience keys, etc. A short term solution would be to allow us to buy additional CPU for a region (much like you can buy additional prims.)

Coffee Pancake · May 31, 2019

Going from a single threaded application to a fully scalable multi threaded one would basically require tossing the existing architecture in the trash, and starting from scratch designing for N cores from the outset.

This is not possible with SL as you have a linear dependency between steps in the main '"region frame" loop (simplified as no accommodation is made for script run time being limited to remainder of total frame time, rollover)

To butcher @animats example

    do_forever { 
		get_viewer_inputs(); // from all N avatars for this frame
		do_physics(); // for all N avatars, depends on viewer inputs
		do_scripts(); // for all objects & N avatars, depends on viewer inputs AND physics
		tell_viewers(); // for N avatars, depends on all of the above
		}

That's not to say each dependency couldn't be split up across cores, but that only goes so far and the overhead of creating threads and then getting the data back from each before the next dependency could well be slower than just running the whole mess liner on one core.

We had this when experimenting with ways to increase texture decode speeds, it seems ideal for all the cores .. in reality, the CPU gets very busy .. the overhead eats the perf boost and then some. End result .. 5 - 20% lower frame rate / total time doing decodes for this frame. The job was simply too small

--

You could redesign a region architecture around individual avatars rather than a set frame or time slice, it would scale to all the cores you could throw at it .. but would consume a lot more resources overall (as each avatar would potentially have it's own instance of havok & mono). Bottleneck would be focused at all points where avatar A affects avatar B, causing B to have to wait for A to complete or A's actions not impacting till B's following frame, The very possibility of A affecting B adds major synchronicity overheads.

Multi threading is great ... for everything that doesn't depend or interfere with the operation of the other threads.

Like rendering an image, one core can do one bit while a different one does another. Easy and perfectly suited.

Simulating physics with balls in a box, putting each ball on its own core would be insanely slow.

Edited May 31, 2019 by CoffeeDujour

animats · June 1, 2019

4 hours ago, CoffeeDujour said:

Simulating physics with balls in a box, putting each ball on its own core would be insanely slow.

Interestingly enough, multithread physics is a standard feature of the Havok physics engine. Objects are divided into groups close enough to be interacting, and groups are worked on in parallel, up to the limit of the number of scheduled threads.

Quote

We had this when experimenting with ways to increase texture decode speeds, it seems ideal for all the cores .. in reality, the CPU gets very busy .. the overhead eats the perf boost and then some. End result .. 5 - 20% lower frame rate / total time doing decodes for this frame. The job was simply too small

Firestorm does do texture decode in a separate thread. Several separate threads, in fact. I had to find a bug in there once and looked at the code. It's rather clever.

Right now, I think the big performance problem in the script area is the overhead of idle scripts. That, all by itself, seems to be much of the routine overload. See this thread. The really bad thing about overhead from idle scripts is that it never goes away. It's always sucking CPU time. It's not a transient problem.

Improving performance through sim multithreading

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Please sign in to comment

Linden Lab

Tilia

Second Life

Connect With Us

Partner With Us