Jump to content
animats

It really is the number of idle scripts that drags down a sim

Recommended Posts

scriptoverhead.thumb.png.6fb3d07a2845b2ae8f2ad8334b5c0ec6.png

Sim overload, the spreadsheet.

Yes, more stuff that we as users shouldn't have to worry about.

I've been suspecting for some time that the overhead of idle scripts drags down sims more than what the scripts are doing. Here's some evidence. Went around Robin Loop in Heterocera, which has a nice variety of land uses, and took a screenshot of the statistics bar in each sim. (The statistics bar is under Advanced->Performance Tools->Statistics Bar in Firestorm. There's a lot of stuff there. Scroll it down for sim stats.) Not much was going on in those sims. Mostly one or two avatars per sim. The most active one had 4. So we're seeing the overhead for a sim just sitting there.

We have Scripts Run, percent. When that's below 100%, the sim gets sluggish for scripted objects. And we have Spare Time, the amount of CPU time unused in this frame. When that's zero, Scripts Run drops below 100%. 

The left graph shows  Scripts Run vs. Active Scripts. Below 3000 scripts, everything is just fine. The sim starts to overload somewhere between 4000 and 5000 scripts. You can see Scripts Run tapering off with the number of scripts.

The right graph shows, for comparison, Scripts Run vs. Script Events Per Second. That's an indication of how many scripts are doing something, not just idle. That's more scattered. Events per second alone can get quite high without dragging down the sim. It does matter, but seemingly not as much as the number of scripts just sitting there.

These are all "full region" mainland sims, with full compute resources. Homestead and Open Space sims get less CPU, so  they need to be counted separately.

Executive summary:

  • Idle script overhead really matters. It's dragging down SL. Needs to be fixed.
  • Below 2000 scripts in a full region, you're good.
  • Above 4000 scripts in a full region, trouble. "Distressed real estate".

 

  • Thanks 4

Share this post


Link to post
Share on other sites
Posted (edited)

@animats

Yes I see the sense in this analysis.  However that does not address the issue I am seeing on my homestead. 

Prior to this phenomenon beginning last year we had within 5% of the same number of scripts running on the region and as avatars were running more or less the same number we do today and yet script-run was >99%.  We had some 16.00ms spare time, and while I understand that number is shared between however many regions are on that core/server, that suggests that there is some.

Now with little or no script increase, either on-region or on-avatar we see the same spare time yet routinely <50% script-run.  I see no other significant changes in the stats figures, and no change to allocated memory, so what has happened.  This cannot be the process you describe (though I can see it on some other regions I visit where script numbers are very large).

There must have been some sort of script-run throttle applied, presumably during a change of server version, last year, though nothing was mentioned.

So while your proposition may hold in certain circumstances it is not the phenomenon I see on Woods of Heaven.  That is borne out by Rider Linden's analysis that nothing else seemed amiss. :(

Edited by Aishagain

Share this post


Link to post
Share on other sites
3 minutes ago, Aishagain said:

@animats

Yes I see the sense in this analysis.  However that does not address the issue I am seeing on my homestead.

Homesteads and open space sims have more sims per CPU and different throttling. Not sure how that works. I'm focusing on mainland sims for now. If someone wants to collect data for some homestead sims, go for it.

LL presumably collects the server statistics we see in the statistics bar. They should be able to run the analysis above grid wide.

This isn't the only performance problem. But it's a big one, and one that ought to be fixable. Because there's no real work being done checking idle scripts for "Is there an event for this one? No, skip it".

(It's a data structure design issue for a scaling problem. The kind of question asked when people go for a programming interview at Google.)

  • Thanks 1

Share this post


Link to post
Share on other sites
9 hours ago, animats said:

Homesteads and open space sims have more sims per CPU and different throttling. Not sure how that works.

I'm not sure exactly how it works either but what is clear is that the total frame time displayed in the stats isn't for the region you are looking at but for all four homesteads sharing a core combined. The detailed list shows the actual frame time used by the single region. That means that the frame time used by the other three regions has to show up somewhere for the numbers to add up and the only place it can be, is in the spare time figure.

Share this post


Link to post
Share on other sites

It would be nice to know what's responsible for the big difference in Scripts Run between Vine and Kama Center, despite having similar script and event counts. Vine is the outlier here; a casual tour of the sim shows an unusual number of critters (both breedables and decorative prim-animated stuff), but no idea if events associated with those scripts are somehow more fraught than other script events.

Share this post


Link to post
Share on other sites

This is a fascinating topic, with ramifications for SL that if solvable, go a long LONG way towards addressing over a decade of server side lag issues.

It's a given that I don't have inside working knowledge of the existing system so it's theoretical, but I suspect that there is a fair amount of time spent for each script where it does loop check through all it's registered event queues (listens, touch, http, timer, etc. etc.) If this is done in the script vm context, of course this is a context switch with all that overhead. The fix would probably have to be something like a dirty bit - if any of those queues were touched, THEN execute that loop scan, otherwise skip all further processing - no context switch.

It can't be that simple though, or it would have already been done. Hopefully?

Share this post


Link to post
Share on other sites
Posted (edited)
4 hours ago, Sharie Criss said:

This is a fascinating topic, with ramifications for SL that if solvable, go a long LONG way towards addressing over a decade of server side lag issues.

It's a given that I don't have inside working knowledge of the existing system so it's theoretical, but I suspect that there is a fair amount of time spent for each script where it does loop check through all it's registered event queues (listens, touch, http, timer, etc. etc.) If this is done in the script vm context, of course this is a context switch with all that overhead. The fix would probably have to be something like a dirty bit - if any of those queues were touched, THEN execute that loop scan, otherwise skip all further processing - no context switch.

It can't be that simple though, or it would have already been done. Hopefully?

I asked Simon Linden that. He said that the test for "does script need to run" is not done inside the script VM.

For now, perhaps the best we can do is follow a rule of "one script per 10 LI parcel capacity". 2250 scripts per full region. That's conservative, but sims below that limit usually have 100% scripts running. Useful info for landlords. That rule should apply for homestead and open space sims, too; they have less prim capacity and less available CPU time in proportion.

That's only a temporary bug workaround. LL needs to fix this so that idle scripts don't drag the whole sim down, and we as users need to keep the pressure on LL to do that.

Edited by animats
  • Thanks 1

Share this post


Link to post
Share on other sites

If it's not the script vm itself running at every frame (theoretically) there still may not be a dirty bit for the whole script (all registered events) as a quick check, it may just be that the event check loop runs in the main loop rather than inside the vm... IDK....

With the impact of this issue, why is it not THE top priority project for the dev team? What else are they doing that someone thinks is more important than this? Solving your number one performance problem will go a long way towards increasing your user-base.  Give users a good experience for a change.

That sort of script count limit would destroy my venue - which has been around for over 11 years. It's not a realistic option at all, not even as a temporary. It's funny, I'm paying for 30K prims but in reality I can't even use 15K. I'd rather pay for more CPU than prims.

Share this post


Link to post
Share on other sites

I think this is like the TP-issues, it's not everybody affected, it's not predictable, it can't be demonstrated to put in a Jira, and frankly some of it defies explanation: in my parcel we removed 100+ prims, some of which were scripted, and instead of getting an improvement in script time we got the opposite.

I'm still trying to understand what Animats was showing in the opening post, it's almost as if an idle script, by not pulling events of the stack and dealing with them, goes into a waste-time loop?

Share this post


Link to post
Share on other sites
7 hours ago, Profaitchikenz Haiku said:

I think this is like the TP-issues, it's not everybody affected, it's not predictable, it can't be demonstrated to put in a Jira, and frankly some of it defies explanation: in my parcel we removed 100+ prims, some of which were scripted, and instead of getting an improvement in script time we got the opposite.

I'm still trying to understand what Animats was showing in the opening post, it's almost as if an idle script, by not pulling events of the stack and dealing with them, goes into a waste-time loop?

The process of checking a script to see if it needs time takes about 0.004ms/frame. You can see this is if you look in "Area Search" in Firestorm, select something, and right-click for "Script Info". It's not much, but multiply it by 4000 or so scripts and almost all your script time is used up, doing nothing.

We've been able to verify this by creating objects with large numbers of idle scripts in a completely script-free sandbox region. The script time goes up in proportion to the number of scripts. This is totally repeatable and easy to test. That's what got the Lindens' attention.

There are other issues, but this one alone is eating up a lot of script time for no good reason.

  • Thanks 1

Share this post


Link to post
Share on other sites
Posted (edited)

@Animats Ok, I see what you've noticed, that's a nifty way of checking, but I don't use FF much, I'll have to see if Singularity can do the same.

In the parcel yesterday I noticed the script eps figure was lowish, 900 or so, and I wasn't sure whether this meant the scripts that were running were not generating a lot of events, or whether the server was only able to handle this low a number?

What we are doing is really black-box testing. I think I'll come along to the next server user group meeting and ask if we can be given a better explanation of what the stats available to us mean. It's next Tuesday, I believe?

ETA Singularity doesn't offer script info in the area search. Overnight (my night,m not SL's) things have changed, scripts run is back up at 90% (it is usually 102%), script events is now 367 eps. It makes me wonder if there is a long delay between scripted objects being deleted inworld and them actually being removed from the server?

Edited by Profaitchikenz Haiku

Share this post


Link to post
Share on other sites
Posted (edited)
Quote

The process of checking a script to see if it needs time takes about 0.004ms/frame. You can see this is if you look in "Area Search" in Firestorm, select something, and right-click for "Script Info". It's not much, but multiply it by 4000 or so scripts and almost all your script time is used up, doing nothing.

How do you know this 0.004 is true?

In the management console most scripted objects in the sim are shown with 0.001 and even that is probably just a placeholder for a time below that value.

The script time under ctrl-shift-1 is a totally different number than the summary shown in the management console. Like 20ms to 7ms.

So I will not speculate about numbers when I don't have more informations.

True is that the the script execution is back to 80-90% in my home sim (over scripted environment)

The problem I don't see in the low impact (0.001/0.002) scripts - but in the many 100s of beds, furniture, dance balls, plants and garbage that idle 24/7 with 0.005 to 0.050. All that stuff belongs into the 0.001 to 0.004 group when idling.

PS: what I really wonder is why an idling script has a cpu time at all - what's the problem for the script scheduler to jump over scripts that have nothing to do and no events? Most be a weird engine here.

Edited by Nova Convair

Share this post


Link to post
Share on other sites
5 hours ago, Nova Convair said:

How do you know this 0.004 is true?

From this testing in a sandbox. Controlled tests have been made.

scripts1000a.png.d63276fd4ead246bb600237

Each box has 10 scripts. Each is the default "New Script" script. Each column has 100 scripts. 10 columns, 1000 scripts here. 4.566ms script time / 1000 scripts = 0.0045 ms/script. We've tried with other numbers of scripts, too.

 

  • Thanks 1

Share this post


Link to post
Share on other sites

Ah interesting, but that was an empty sandbox.

For the 7300 scripts in my home sim that means 32.85 ms - 80 to 90% running - lets say 80% - that's 26,28 ms. There are quite some scripts that use more resources though. 1300 events per second.

That does not fit in the 20ms script time and shows a huge difference to your research. On top of that the management console still sums it up to 7 ms script time - don't see that matching either - I wonder  how that is meant. What does that thing add or is it just buggy?

Maybe I add 1 or 2000  scripts on top of that pile and see how the numbers change.

Looks like the script time an idle script consumes has a range depending on the circumstances and no fixed value.

Share this post


Link to post
Share on other sites
3 hours ago, Nova Convair said:

Ah interesting, but that was an empty sandbox.

For the 7300 scripts in my home sim that means 32.85 ms - 80 to 90% running - lets say 80% - that's 26,28 ms. There are quite some scripts that use more resources though. 1300 events per second.

That does not fit in the 20ms script time and shows a huge difference to your research. On top of that the management console still sums it up to 7 ms script time - don't see that matching either - I wonder  how that is meant. What does that thing add or is it just buggy?

Maybe I add 1 or 2000  scripts on top of that pile and see how the numbers change.

Looks like the script time an idle script consumes has a range depending on the circumstances and no fixed value.

Region name?

And is this a full, homestead, or open space sim?

  • Confused 1

Share this post


Link to post
Share on other sites
Posted (edited)

@animats

I sincerely hope that Nova is not on a homestead or even worse an openspace region.  Neither could support 7300 scripts!  It might be germane to know whether it is a private estate or a Mainland full region though.  If there are differences there it opens another can of worms regarding the sort of server that such full regions inhabit.

Edited by Aishagain

Share this post


Link to post
Share on other sites
1 hour ago, Aishagain said:

@animats

I sincerely hope that Nova is not on a homestead or even worse an openspace region.  Neither could support 7300 scripts!  It might be germane to know whether it is a private estate or a Mainland full region though.  If there are differences there it opens another can of worms regarding the sort of server that such full regions inhabit.

FWIW, "management console" sounds Estate to me, presumably with the hard-to-decipher Top Scripts.

I am very interested in the scheduler overhead issue. Old timer story: We've known there was more than expected scheduler overhead since long before Mono (so it's kinda fresh, albeit delayed, quantifying it in Mono too), since Lex Neva ran some experiments on an empty sim and found that utterly idle scripts used some script time despite having no eligible event handler at all, only a state_entry. (This was all in the forums back then, several generations of archives ago.)

The thing is, I'm not convinced that there's been such a step-function in script count to account for the fairly recent fairly sudden rise in script lag across the grid. At first I was thinking: maybe... mesh avatars, heads, their HUDs, those are new additions. But it's not always avatar-heavy regions that seem to be suffering more than a year or so ago. Breedables, furniture with bloating pose engines, etc... yeah, but are script counts really going up compared to our past of script-per-prim resizers? Or are scripts getting slower? Is it possible the script scheduler itself actually got slower? Maybe more tests, internal reporting, whatever?

Share this post


Link to post
Share on other sites
56 minutes ago, Qie Niangao said:

I am very interested in the scheduler overhead issue. ... Maybe more tests, internal reporting, whatever?

One can hope LL has a handle on this. I'm focusing on this right now because 1) it's a big effect, 2) it's easy to measure, and 3) it happens even in sims where nothing scripted is happening.

Personally, I'm annoyed because my animesh NPCs can no longer get anything done in my home sim, Vallone. Pathfinding breaks badly under script overload, because pathfinding has less priority than scripts. Only 5-7% of pathfinding steps are running. I've had my characters get stuck in walls, go off the parcel, go off the sim, and fall down the stairs. With enough recovery code, I can have them slow to a crawl, get out of the problems created by broken LL pathfinding, and eventually get where they are going. But at 1/20 of full speed. It takes them 10 minutes to cross my parcel. This is pathetic.

A week ago, I could have 10 NPCs in the sim, all working at full speed. Then someone added about 1500 scripts in skyboxes. Just prefab houses with lots of stuff in them. Performance dropped way down.

valerieoverload.png.dd8079f259a277e83859c989d6eb89f2.png

Trying very hard and getting there, by taking little steps very slowly. Looks awful. The hover text turns red and the "overload" message appears when pathfinding performance is below 25%, so people don't blame me for LL's problems. This recovery stuff was put there to handle temporary overloads, like many visiting avis with elaborate attachments. Not a permanently slow sim due to idle script overhead. This sucks.

My animesh characters in Bellesaria and Animesh1 are doing fine. It's the sim that's broken.

I just tried flying my helicopter. 2 second delays between pressing a key and the controls responding. Managed to land on the helipad OK, but it took quite a while. Worst it's ever been in over a year.

This probably isn't the only bug in script scheduling, but it's one that's big, identifiable and should be fixable.

  • Thanks 1

Share this post


Link to post
Share on other sites
3 hours ago, animats said:

One can hope LL has a handle on this.

At the moment its just getting worse. After the script run on my homestead parcel dropped from 99% to around 50% weeks ago, it has fallen to 25% now.
But surprisingly, everthing still seems to work without noticeable lag or other issues.

Share this post


Link to post
Share on other sites
Posted (edited)

I made some tests - full estate sim - I'm not the owner but a manager - 2 low scripted avatars on the sim (including me)

1st column - momentary sim status
2nd column - after adding 1000 idle scripts
3rd column - after adding 3000 idle scripts

I linked groups of 100 prims = 1000 scripts and clicked one block - that didn't make any noticable impact
clicked all 3 blocks quickly - I noticed a small disturbance of the force for a few seconds but barely noticeable. 😎
So 3000 empty touch events are obviously nothing - I didn't count the 3000 "touched" messages I got though, they are cut anyways.

4th column - sim after I removed the idle scripts

image.png.923a026f252bc2b8e653d5df0f9fd705.png

script time idle load is the script time of the  block of idle scripts (1000 and 3000 scripts) taken from the management console. According to that an idle script uses 0.0005 script time in this scenario.

Edited by Nova Convair

Share this post


Link to post
Share on other sites

When we talk of an "idle script" here, are we talking of a script with only one state, maybe a touch event, no sensor, no timer, no listen, no moving start or end ? Or does it not even have a touch event?

Share this post


Link to post
Share on other sites
32 minutes ago, Profaitchikenz Haiku said:

When we talk of an "idle script" here, are we talking of a script with only one state, maybe a touch event, no sensor, no timer, no listen, no moving start or end ? Or does it not even have a touch event?

My test idle script is the one generated by the "New Script" button. Someone else tried the even simpler script with no "touch" event, with about the same results.

  • Thanks 1

Share this post


Link to post
Share on other sites

Ok, I can understand a script with an event requiring processing time because it obviously *could* get a touch. It makes me think that scripts when compiled are simply put in the server as bytecode with no other classification as to their requirements, so every script will then be checked to see if it has had any type of event, even ones it doesn't have a handler for? This would make for simpler setup on the server obviously, at the expense of more run time.

An idea has begun forming: scripts require memory, assume the server does not have limitless RAM, therefore script memory will be paged in and out, so are we seeing the effect of disk-loading? As more scripts are put on a server, more paging will occur, and so the time load you are seeing is more to do with the memory management rather than the event handling? Paging will take up time that would otherwise be available to the scripts for running, so the more paging that has to take place, the less script run time is available?

Share this post


Link to post
Share on other sites
48 minutes ago, Profaitchikenz Haiku said:

paging

It's the first thing that came to my mind too* -- but Simon Linden says that exhausting script memory and seeing any paging at all is known to be very rare in modern sims. 

This sort of thing would be indicated if there were some very super-linear effect of script count on sim performance: if, after crossing a threshold, suddenly every added script has much worse effect than those before, the difference between all-RAM computation and some paging. At that point, sim performance overall -- indeed, host performance for all sims sharing the I/O bus -- should fall off a cliff. I don't think we're seeing evidence of this, exactly.

To avoid possible confusion: there is clearly a different threshold on perceived script performance that's crossed when spare time goes to zero and some scripts need to wait their turn; until then, adding scripts has no effect at all on perceived performance. That's all downstream of our hypothetical paging threshold that would affect the time impact of each individual script whether or not it waits to run.

_________________
*This does reveal we are of a certain age to remember times when RAM was dear -- or at least back to when Mono was new and leaked memory like a sieve.

Share this post


Link to post
Share on other sites
24 minutes ago, Qie Niangao said:

This sort of thing would be indicated if there were some very super-linear effect of script count on sim performance: if, after crossing a threshold, suddenly every added script has much worse effect than those before,

That phrase made me think of the cusp behaviour that Thom used as the basis of his catastrophe theory, you reach a point where not only any linearity has vanished, but you can't even fit a curve using any *usual* candidates, you have to accept that the function breaks at that point, straight-lines to a new point and resumes later. It's also similar to the problems with on/off decisions in a batch of simultaneous equations, you just can't do it smoothly.

Another thing has occurred to me, linking this thread to the one hypothesising fewer servers: If the server has say four regions, and something in Region A is causing the choking of scripts, regions B C and D had request they are restarted until they are blue in the face, the problem will only be cleared when something happens to Region A.

I am seeing this problem in my parcel more and more now, and persisting for longer, so my (subjective) experience is that the problem is getting worse. I would hope it doesn't have to get to the point of the teleports and sim-crossings before sufficient attention is given it, but looking at all the other things such as bakes-on-mesh that are being demanded, I am worried that scripters are not at the head of the queue.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...