Jump to content
Sign in to follow this  
Toysoldier Thor

Increase in Instant SIM LAG & Crashes During Larger Events - Network Source?

Recommended Posts


Cincia Singh wrote:

If you think there is a server glitch file a support ticket. Otherwise this is just venting and arguing and not really productive.

And how is collating the information and making sure all the ducks are in order before filing a ticket or a bug report 'counter productive?' 

Last week a friend and I both filed tickets for a SIM.  I filed a ticket for a problem I ran into and they filed a ticket for the problem they ran into.

Support responded to my ticket by restarting the SIM.  A Linden responded to my Friends ticket.

Neither of the problems we reported were fixed and we both had to re-open our cases.  I collected more information from the Forum and Wiki adding it to the Ticket and hopefully will be fixed correctly this time.  All I can figure is what was obvious to us wasn't obvious to them.

It appears to me that in a very predictable way you are the one coming here and prodding arguments.

 

ETA:  As of today, Oct 2, the problems I had reported are now fixed.

 

Share this post


Link to post
Share on other sites


Cincia Singh wrote:

If you think there is a server glitch file a support ticket. Otherwise this is just venting and arguing and not really productive.

A couple points:

 

  1. As was already mentioned to you - since Rodvik executed a mandate that Resident JIRAs will no longer be public viewable to the JIRA community, there is no other place for SL residents to communicate / share potentially larger problems on the SL grid, viewers, Marketplace than by placing these problems in the forums.  The sharing of commonly experienced problems is a critical means to quickly determine if there is amore major issue going on or if its just an isolated event.
  2. This forum thread has already been openly discussed with the LL Server User Group team and the Lindens are fully aware of this thread and using it to help them perform further diagnostics if recent code deployments OR a loophole in the code is allowing serious griefing is occuring.

So... I am not "VENTING" - I am trying to take the lead in coordinating the gathering of info from the community so that we can resolve whatever has been hitting the popular venues in SL.   Whatever the cause, through this thread LL has a lot of information / leads to help them isolate and find root cause - something that cannot easily be done by independent isolated problem tickets into LL Support Desk who only looks at each incident as an isolated problem.

 

Share this post


Link to post
Share on other sites

Thank you Toy you have conducted yourself very well in this matter, now if only the Lindens would follow your example and show an equal amount of concern and responsibility.  There is, however, two factors you seem to have overlooked.  First, this problem occurrs much less often in the RC servers which allegedly have the same code at the moment as the main servers, so I am starting to doubt this is a soft ware issue, second it is often accompanied by huge amounts of sim ping which I take it to signify an over worked server.  Indeed in some cases I have seen ping go above 30k in some main channel servers, and again this problem goes away when there is a reset, so if software is involved maybe we should look at the core servers and not the regional servers since they saw some serious attention at the start of this "improvement cycle" and some changes made to them then may now be bearing poisoned fruit.  Whatever it is it is certainly a major problem and seems to be a carry over from the spam problem corrected with such fanfare earlier.  My guess is the servers have been and are overworked and overloaded and are beging to show signs of failing and need to be dealt with at once or things will only get worse.

Share this post


Link to post
Share on other sites

@Lydia

I wont argue your theory as you could be right.  I do not know LL's SL systems architecture / model.  As such, I do not know exactly what servers in the environment have had a lot of changes (SW or HW).  I also do not know if the root cause to this new rash of unusual set of symptoms (aka... the server the sim is on goes "stale" as a Linden support person referred to it as) is because of a naturally reached trigger or caused by a griefer that discovered a loophole.

I am just trying to bring together enough evidence / facts to help those in LL that can use it to isolate the problem and close the loophole or fix the bug.

Share this post


Link to post
Share on other sites

I have noticed poor performance lately lag has increased and random crashes occur more frequently most of these problems began when pathfinding was implemented just another assumed benefit released before the bugs had been killed but then we are a beta game if I am not mistaken ))

 

The crash on edit bug seems to have survived from earlier pest control measures

Share this post


Link to post
Share on other sites

So tonight at a 1 hour concert at a popular artist's gig - there was about 40 at the concert and the lag was real heavy (Oct 5th - at 9-10pm slt)

But no one was crashing... then at about 9:35ish myself and several others crashed - not all avatars did.  I logged back in to my home and TPed into the sim again and crashed again and logged back in again and was able to stay on to record the statistics on the sim that Qie wanted me to capture.

The lag was still HORRID but as my capture show - it does not seem to be script related.  But below are the images of the stats during the heavy lag  (PS - when these captures were finally taken- the avatar count was only at 28 which is quite low for a live concert in SL)....

 

whiskeyGoGo-SimLoad-9-57PM Oct5 2012-30PPL.JPG

 whiskeyGoGo-SimLoad-9-56PM Oct5 2012-30PPL.JPG

Share this post


Link to post
Share on other sites

Thanks for this update, Toysoldier.  You're right: this isn't scripts, and it's dramatically different from the performance characteristics of the earlier Mainland griefer attacks. In that small sense, the sim is behaving properly under lag conditions: the scheduler is restricting Script time to a bare minimum.

It's clearly a network-related problem, based on the massively inflated Net and Images times. It's interesting that if one adds those two component times, the sum is larger than the Total Frame time -- whatever that means.

I can't guess what the cause is, really. If some networking gear outside the sim's host server were at fault, one would expect many other sims to be affected exactly simultaneously (unless there were some address-specific routing problem, I suppose). It seems more likely that there's something within the server itself--kernel? NIC-drivers?--that's driving network performance into the ground when under load.

It would not seem to be flaky hardware on one host, because this has been happening on a number of sims.

Because this is presumably not happening to every moderately heavily-loaded sim (right?), one wonders why it happens on those specific ones where it does. For this reason, I still wouldn't completely rule out some kind of griefing, either something originating within the sim (some improperly throttled flood of network demand, either scripted or viewer-driven) or externally (DDOS). I still suspect a bug somewhere in the network stack, but we'd need some reason that it triggers where it does and not everywhere else, too.

This update (well, the top image especially) should be brought to the attention of the Lindens investigating this problem.

It's certainly nothing like normal behavior for a sim under heavy load. Something network related is degrading very ungracefully -- falling off a cliff, more like.

Share this post


Link to post
Share on other sites

Qie,

I will say that the event that I recorded last night was extremely lagging even after previous crashes reduced the avatar count down to only 28-30 users on it (the count of avatars when these stats were captured), BUT, this event was not identical in nature to the ones we were witnessing last week.

What was different about this one event vs the others were:

 

  • Last night the sim had ~ 40 avatars on the sim at the beginning of event and the sim seemed very stressed for the first 30 minutes of the event - but it was the kind of stress that one would feel with a very large avatar count (more like 60 or 70).  i.e. avatars could almost not move but no one was crashing.

    With the other events last week - there were about the same avatars on the sim but the lag was very low and reasonable.  AND the lag hit the sim like a shotgun - it was noticed when the sim went into this extreme lag.
  • Last night the sim started crashing many of us (like me) after the event was more than 1/2 over - BUT - only less than 1/2 of us crashed as the rest were able to survive within the intolerable lag during the entire one hour event.  Also, I was able to re-login quickly and return to the laggy sim (where I ended up crashing again).

    With the other events last week - the lag hit instantly and almost no one on the sim survived and were booted off.  I was luckily one of the few to witness as all but 2 - 7 of us remained logged in on the sim.  For those that did get booted out of SL, it took them a long time to return to the grid but none of them were allowed to return to the sim for about 10-15 minutes.  They / we were pretty much blocked from returning even though the sim was up and running.
  • Last night - even after the crashes seemed to stop and there were about 1/2 the avatars remaining on the sim - the lag on the sim was still extremely heavy.  The images you see are from the sim's stats AFTER all the crashes.  The sim had moments where it was less laggy but mostly it was what you saw in the stats until the event ended and everyone was leaving.  Even at about 10 avatars - the lag was still present but reducing.

    The events last week - after the major event happened and avatars started returnng back back to the sim, the lag was completely gone.  Even though the sim never crashed - it seemed to clense itself when it booted everyone off.

I just wanted point this out that I thought the lag was unusually extreme even for a sim with only 28 ppl on it but it may not be completely related to what was occuring last week.

You understand the metrics of the stats better than I but when I was looking at the stats, there was no metrics that seemed to be screaming out as a sore spot.  It seemed like a deeper internal reource that was crippled.  I can show these metrics to the server sim LL team (aka Simon Linden) next week unless he is still monitoring this thread and sees the metrics in this thread.

Share this post


Link to post
Share on other sites

If I vaguely understand LL's server model they use some form of server virtualization.  I dont know if they use something like VMWARE where a physical server virtualizes several OS's or if LL has developed a virtualization model at the Application level (i.e. a single instance of the OS on a physical server and then operate 6 or 8 instances of a "SIM" application.

Can you enlighten me how exactly LL virtualizes their sims?

I do recall that LL used to execute 8 sims on a physical server.  Regardless of their virtualization model, if there was a "network resource" related issue on the server, then this extreme lag would only be noticed on those 8 respective sims... of which there is a good chance many of the others are lagged but no one on it to witness it.  (i.e. if a lag falls on a sim and no one was there to notice it - did it really happen ;)  ).

If it is an issue like this what you were speculating - then it could be triggered by new code on the sim that is causing a new threshold on the kernel to hit or maybe its happen on these new hardware deployments LL said they are migrating (and a corrupted or incompatible driver that came with the new hardware)..... or or or

Only LL would know

Share this post


Link to post
Share on other sites

Only LL would know

Yeah. I don't know the details of their virtualization, but yes, there are some common resources across all the sims hosted on the same server. The count of sims per server varies by sim type (Homestead, etc.) and server generation, so basically it's complicated enough, with enough changes since they've said anything public about the architecture, that I have no clear picture of how it works. (I do think you can count on having a CPU core all to yourself on any full-primmed sim.)

You know, speaking of prims... well, objects, really... there was another thing that struck me as odd in the statistics: an object count of 15,329. Now, a full-primmed sim supposedly caps at 15,000 prims (not objects), not counting attachments, temp-rezzed stuff, and things sat-upon and selected in the editor. An avatar can have some fixed number of objects attached (maybe 40 or fewer), so 30 avatars' attachments just aren't going to exceed 1000 objects. So, I mean, it's theoretically possible to get to that 15,329 number, but not by any normal use of sim resources. Something is strange there, but I have no idea what it means.

Share this post


Link to post
Share on other sites

I've read up again and learned that the "Objects" line in the sim statistics actually refers to prims, not linksets, and does not include those used in attachments, but I'm still not sure what's going on here. The statistics console for that sim shows over 15,000 objects (prims) even now. The sim is using most of the 15K prims available for rezzing on the land: 9702 on one parcel, 4646 on the other, so I'm really not sure whence those other few hundred prims arise. There is some temp-rezzing going on, although it's very small primcount. (Still, I'd dump those old-school temp-rezzed physical waves; it's not going to fix any problems here, but it's just good sim "hygiene". Also, for the benefit of event patrons, personally I'd lose the firetruck; it's generating a lot of object updates, and a room full of avatars generate quite enough updates without other stuff getting pushed out the network to them. But again, this is not even a drop in the bucket compared to the network volumes during the lag incident.) 

The number of "Active" scripted objects is about the same in this normal state (dilation 0.99, over 4ms Spare time) as in the super-lagged state, but the number of scripts is quite a bit fewer: about 5350, vs 8373. That's a pretty big difference, but I suppose 30 club-goers, some with resize-scripted attachments, might possibly account for 3000 more scripts.

Anyway, again, it's not script execution itself that's causing the problem here, but one could imagine "special" scripts that trigger some massive amount of network activity, without themselves running very hard (nor very often, in a highly lagged sim). Of course, there are other ways the sim could be network-bound, as I mentioned in my earlier post.

Share this post


Link to post
Share on other sites

I will tell the owner next time I see him about the firetruck.

So that is interesting... if the root cause to the extreme lag was from network load within the server or the lab environment, your one suggestion might be a good candidate.... a simple script that either on purpose or inadvertently is generating an enormous about of network traffic without the script having to work too hard itself.  For example... a script that makes countless no-delay queries for a large amount of data from for some internet source of data... "please send me that 2Meg file.... please send me that 2Meg file... please..."  This would generate very little script load but generate a lot of network load.  But I am not a LSL script expert so so I dont know if the script can bring the data to the server.  You are the expert on scripts.

If it is a network sourced lag then LL should be able to find this quite easy.

 

 

Share this post


Link to post
Share on other sites

So just as a comparison when a very busy sim during an event does well.... tonight I was at teo large music events.... one sime had about 65 and the other 77.... for both of them - the lag was very acceptable and you could even still move around.  Here is a snapshot of the stats from both these events tonight....

oct6-846pm-77Avis-modLag.JPGoct6-737pm-66Avis-LowLag.JPG

As u can see in one of the event they were even having a huge light show - which wont hurt the sim as much as the viewers - but that was not bad either.

Share this post


Link to post
Share on other sites

This is great information. I meant to check in at a busy club to see what the statistics actually look like when things are working, but I kinda forgot. There's a huge contrast in network-related numbers. As we already observed, the "bad" event had vastly inflated Net Time: 727.326 ms, compared to 14.196 for the "good" event. Drilling down to network-related processing, however, check this out:

Packets In:

  • Bad: 190 packets per second
  • Good: 1328

Packets Out:

  • Bad: 376 pps
  • Good: 7106

Pending Downloads

  • Bad: 531
  • Good: 1

Total Unacked Bytes:

  • Bad: 581.2 kb
  • Good: 3182.3

What this shows is that a healthy sim can handle a lot of network traffic.  The "Bad" event was handling much less (like 1/15 as much) network traffic as the "Good" event -- and taking way, way longer to do it. This kind of draws me away from a DDOS or in-sim network flood, and towards either a real problem in the datacenter network (hardware, routing, etc), or a bug in the sim host's network stack (triggered by god knows what) -- at least for that particular "Bad" event.

Share this post


Link to post
Share on other sites

Of those number you pointed out, the last two stats would make a network guy like me take notice...

Pending Downloads

  • Bad: 531
  • Good: 1

Total Unacked Bytes:

  • Bad: 581.2 kb
  • Good: 3182.3

Again, I dont exactly know the source of these metrics (i.e. exactly what resource is reporting these network metrics - the sim app instance?  the metric from the OS of the physical server?), but if we assume the metrics are from the OS/kernel or whatever their VM is on the physical server, these two metrics are saying that the server is not able to keep up on servicing network requests and as a result the sim is being forced to hold communications with any of its respective clients.

The BAD sim has much less actual traffic to service yet it is majorly overwhelmed at a network level - the two good healthy sims are processing far greater network traffic loads and not being overwhelmed.  Something here is WRONG.

If this is true, a network stat that we also cannot see but would be very interesting to see the server's "Session Count".  If a max has been reached and/or if communications are UDP, another symptom would be that clients would be dropped.  Could explain the drop of all the avatars on the sim all of a sudden.  With a lot of the avatars dropped, maybe whatever is overwhelming the network layer of the sim gets enough relief that eventually it catches up and lets ppl back on the sim.

LL Engineers need to focus on the root cause of what is causing the network layer of a busy sim to go stale or easily overwhelmed.  This could be the root cause to many sim lag issues.

 

Share this post


Link to post
Share on other sites

Qie...

Notice posting #13 of this thread - added info to the theory of something specific in the LL DC ... the hardware or different kernel....

One of the GOOD sims ( the one with 65 avi's on it) was one of them that was REAL BAD with crashing until a couple weeks ago the sim owner opened a support ticket and LL resolved the problem when they switched something - the poster suspects they moved the sim to different hardware. 

Ever since they moved the sim - the massive lag issues went away.  and we all can see how well it takes large load.  Specifically, the network stats are very healthy.

LL staff can find this ticket to check the details of what they left behind and check the Network layer / driver differences between the two.

 

Share this post


Link to post
Share on other sites

Thanks Toy and to the rest who confirm that others are experiencing this unusual increase in SIM lag during large events. As a venue owner this it has become almost impossible to hold an event.

 

We have been experiencing the exact same issue during our shows over the past several weeks...for no apparent reason.  During a performance I keep watch on my sim stats and have noticed too... that at some point...even with as few as 20 to 45 .. with script counts and running times well within what would normally not cause a problem .. suddenly drop in performance to where the Sim FPS & PHYSICs FPS drops to almost nothing (as the one graphic above of the statistic bars shows).  Not always, but sometimes this can happen when guests are tp'ing on top of one another ...but unlike in the past ...once all have arrived....and rezzed....the sim just doesnt recover ... everyone starts crashing ...then the sim goes down ... ending up with having to submit a ticket as I cant' get into that sim to do a restart on my own. 

All sims in my island estate are on the Second Life Main Servers.  Before giving up on doing what I can to support Live music in SL , I keep hoping that the 'next' server rollout will at least take care of that problem...instead of creation more problems :/ I've done what I can with submitting tickets .. just waiting.

'Normal' SIM FPS & PHYSICS FPS run between 44.5 & 45 on each of the sims.

Calas Galadhon Park sims/owner

OZ Nightclub, Glass Pavilion, The Dolphin Cafe owner

 

Share this post


Link to post
Share on other sites

I have officially create BUG-355 with LL and pointed them to this public viewable thread

https://jira.secondlife.com/browse/BUG-355

(due to recent LL policy changes - SL Residents cannot see any JIRA's created by another resident.  So I have filled in the formal bug that basically tells LL to come to this thread to get all the details collected by the resident community).

I have asked that LL look into a possible Network related bug in their SIM code or the old/new physical hardware or its kernel related to the network drivers / transport, or a LL DC routing issue.

I am hoping Simon or Andrew Linden will pick this up and focus investigation on this bug.

Share this post


Link to post
Share on other sites

Thanks for the reports, Toy ... there's some good information there that seems to shows the simulator getting into trouble when there are network problems.  

Share this post


Link to post
Share on other sites

Forgive my non-techie status.

 

I wanted to contribute to this becuase in my 13 sim rp estate (private land), we have been having the same issue. The difference is many of you seem to have problems with many avatars, in clubs (with possibly more scripty stuff) and sometimes stream changes. This has happened on many of my sims with just a few avatars, very low script time, and no stream changes. The sim stats (dilation and fps) either fluctuate or tank completely. Sometimes there are associated crashes - sometimes not.

 

Also, I wanted to ask all of you if this seems a lot like the kernel bug referred to as "time warp" and if not, could you explain any differences? To me, it looks exactly the same.

Share this post


Link to post
Share on other sites

@sinshine,

Next time one of your sims encounters one of these major lag events.... take a snapshot of the Advanced Performance screen and post it here.  The important stats to look at are related to the NETWORK stats. 

What has been initially noticed (and we would like more evidence of it from stale sims) is that some event triggers the the sim or the server kernel's network driver or something related to the network to not rocess network packets at a normal flow.  Even at a low avi count and with other metrics looking normal, the network stats show unusually high pending downloads and very high un-acked packets.

But, we need more examples (snapshots of the perf stats) of a sim when it has fallen into this state.  Also make shur you mention date/time and the name of the sim. 

Share this post


Link to post
Share on other sites

SO just to mention that there were two more large events from two sims where - as we are calling it - the sim went stale during the event and several avatars were all booted out of SL and the sim did not crash.  One event happened badly on Saturday but the sim owner didnt capture the stats during the crash.  I informed her what to capture if it happens again.

The other event was last night at a busy event at a sim.  The sim owner doesnt want me to mention his sim since he feels that mentioning it might scare away his visitors (as if music loving SL residents are not fully aware of laggy crashy sims and would not show up to a live event simply because of what was posted in an obscure SL forum).  shrugs...  regardless... the events are still happening.

Share this post


Link to post
Share on other sites

help island public 11:07 12-10-18.gif

This was Help Island Public on 2012-10-18 at 11:07AM. It's the second time I find such an enormous lag. The last one was less than a week ago and it was the same kind of lag with the huge net time. I hope this is of some help.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...