Sign in to follow this  
Followers 0
Toysoldier Thor

Increase in Instant SIM LAG & Crashes During Larger Events - Network Source?

106 posts in this topic

Don't know if that's the same problem or not, but It's definitely network-related: over 40,000 pending downloads. And most of the extended frame is spent in Network time. I think it's useful to see these reports, whether they turn out to be the same problem or not.

(Incidentally, the Network detail that you have expanded at the top is actually what the viewer is experiencing, not the conditions at the sim. Viewers have a lot of network traffic from sources other than the sim, and can have quite normal network performance while the sim is going belly-up.)

Share this post


Link to post
Share on other sites

Was at another event that when it got to about 40 avatars the sim tanked right up - couldnt move.  People were not crashing because the host of the sim emergency moved many of his guests to his home sim.  When the count got down to about 20, the sim stabilized again.

I captured the stats while the sim was staling out and once again..... look at the PENDING DOWNLOADS.  I watched the stats as ppl left... the Pending Downloads dropped to under 5.

 

oct22-915p-40avatars.JPG

Then at the sim that everyone jumped to... there were about 30 avatars.... and it was laggy but the normal kind of laggy .... and look at the Packets in and out and pending downloads on a sim with about 12 less avatars (Processing a lot more packets - over 4 times more - and only 1 pending download)

oct22-952p-28avatars.JPG

Share this post


Link to post
Share on other sites

I tried at the Server user group yesterday:


[2012/10/23 12:49] Qie Niangao: Do we know any more about Toysoldier Thor's bug... BUG-355

[2012/10/23 12:50] Qie Niangao: (aka MAINT-1682, about sim-crippling network performance, with moderate-to-high avatar counts)

[2012/10/23 12:51] Simon Linden: I don't have any news on BUG-355

[2012/10/23 12:51] Qie Niangao: okay. reports are that it's still happening, so... probably will keep getting asked about it.

[2012/10/23 12:51] Nal (nalates.urriah): Talk to Nyx, I think we saw that in Barrowdale yesterday.

[2012/10/23 12:52] Nal (nalates.urriah): The region had 0.03 TD and 45 FPS... w/free script time and no PF...

Any progress is behind the MAINT jira wall, so it's hard to judge whether anybody is actively looking at this or not.

Share this post


Link to post
Share on other sites


Qie Niangao wrote:

I tried at the Server user group yesterday:

[2012/10/23 12:49] Qie Niangao: Do we know any more about Toysoldier Thor's bug... BUG-355

[2012/10/23 12:50] Qie Niangao: (aka MAINT-1682, about sim-crippling network performance, with moderate-to-high avatar counts)

[2012/10/23 12:51] Simon Linden: I don't have any news on BUG-355

[2012/10/23 12:51] Qie Niangao: okay. reports are that it's still happening, so... probably will keep getting asked about it.

[2012/10/23 12:51] Nal (nalates.urriah): Talk to Nyx, I think we saw that in Barrowdale yesterday.

[2012/10/23 12:52] Nal (nalates.urriah): The region had 0.03 TD and 45 FPS... w/free script time and no PF...

Any progress is behind the MAINT jira wall, so it's hard to judge whether anybody is actively looking at this or not.

There has been no progress on this JIRA and as you can see nothing mentioned in tis thread and nothing talked about in the user groups on any progress to this bug.  The JIRA's last activity was that it was ACCEPTED by LL shortly after I created it a few weeks ago.  Since the JIRAs are not gagged and hidden from public view, its just that much more easy to let them go stale without any Public Eyes able to watch it.

The evidence that there is something going wrong with the "NETWORKING" aspect of the sims or the Debian OS underneath the sims or their connectivity within the LL DCs is mounting.  And in fact this week has been a real bad week.  Not sure if its because of recent sim code upgrades that aggrivate this problem or unrelated but similar symptom problems, but it has been a real unstable week on the sims this week.

 

Qie, I really wished that you myself and some others with interest and knowledge of I.T. could be allowd under the covers at LL's DC.  As a Network Architect, it is fun and challenging to fix network related problems.  They may seem tough but compared to many other I.T. problems, network problems are generally easier to isolate and find root causes to.

The problem is that LL seems focused on all their recent sim code upgrades and are not really interested in fixing old problems - even though its ironic that the Sim team is focused on an objective to IMPROVING PERFORMANCE & STABILITY... and if they could find the root cause to why the sim's network traffic goes stale / stupid, they could potentially resolve a lot of the most painful and frustrating lag / crash problems on the grid ....  which in turn means HAPPIER RESIDENTS!

All we can do is keep slapping the evidence of a network problem in the LL DC in LL's face on this forum until they place some focused effort into diagnosing these symptoms.

Share this post


Link to post
Share on other sites

Inara Pey reports from the Friday Server user group that this problem is planned to be (partially?) addressed in next week's Magnum RC rollout. Fingers crossed.

Share this post


Link to post
Share on other sites

Yeah Qie.... that is what Simon said at the server meeting on Friday but the impression I got from the way they were discussing this topic was that they only "hoped" that this bug might be addressed.... more as if its with some luck that it will be fixed.  I really dont think they are focusing on diagnoising the network stale-out problem.

As such, I am writing this post as I am at yet another staled out sim during a major event...  Notice the network stats again.  3Meg of unAcked data.... and lots of down and uploads pending... and yet the PPS are rediculously low even with all the data that is pending.

 

CrashedSim-Lexa-Oct27-920p.JPG

Share this post


Link to post
Share on other sites

I am personally nervous today that this event disrupting bug within the sim code or LL DC network infrastructure (that all but cancelled last night's music event last night because the sim would not recover in time for the concert to continue) will rear its ugly head today at 12 noon for the even that I am the feature artist.

The Paris Metro Art Gala will have a music artist streaming and is expected to host well over 70 avatars to this major event.  A ton of planning and effort has gone into this evnt by several poeple.  If what has been happening at several other major inworld events with these NETWORK LAG OUTS happens this afternoon, I am going to be ROYALLY TICKED OFF LL!

I am crossing my fingers the sim will hold up under the stress and not trigger this bug in the sim code.  I am going to capture the perf stats during the event if I have time.

Share this post


Link to post
Share on other sites

UPDATE from the major event today and how it went.

LL... you need to start FOCUSING ON THIS BUG!!  

I am Royally Ticked Off that this thread has been going on for over a month.  We have provided LL very clear evidence that there is a bug somewhere in the Network Layer of the LL Sim, or the underlying Server OS, or the LL Data Center network infratructure.  I have personally brought up issue up to many Friday LL Server/Sim user group meetings.  I have created a formal JIRA on this issue.  Yet there is still no serious effort by LL Staff to focus efforts on diagnoising this SL BUG.

Maybe you Lindens do not understand the magnitude of the impact this bug is causing YOUR CUSTOMERS inworld - since you rarely come inworld to understand it.  But let me help you out.  When there is a major event planned at a sim (i.e. a music, club, fashion, art venue), not only has there been a lot of planning, time, and effort put into event.  BUT, there is advertising / promotion revenue at stake and MONEY LOST because of this LL BUG ON THE SIMS.

At today, Paris Metro Art Gala, the event was promoted around countless media avenues.  One promotion of today's event went out to an audience of 80,000 members!  There was big costs put out to hire a top quality singer.  Gifts were made for the guests.

And so what happened today for the event that started at 12 noon SLT?  Well I got there right at 12 and with not even 20 people on the sim, the lag was intolerable!  I could barely move.  This should be be the case with only 20 avatars!!  By the time the sim got to between 50-60 avatars people started crashing and on the PERFORMANCE STATS I could see that the LL SIM BUG started showing its ugly head.  I knew we were in trouble.  The Singer crashed.  I captured the stats before I crashed.  Then the sim itself didnt crash but massively staled out so that almost everyone crashed out and could not get onto the sim for about 10 minutes.

Then all of a sudden the sim was letting people TP back to it.  The lag was substantially less.  We only got about 30 avatars back since a lot of people gave up on the event and never returned.  WE LOST A LOT OF POTENTIAL SALES of art and the PR for the Paris Metro sim.

Here is the screen capture of the sim just before I crashed:  Notice the arrows.  Notice the common symptom of the network pps being for lower than what should be happening for a sim with 55 avatars on it.  Notice the Pending Uploads and the Unacked data.

ParisMetro-1222-oct28-lag.JPG

 

To the LL Team at the Friday Server Sim Usrer Group meeting, I will be doing a lot of talking about this topic so I sure hope by Friday your team has done some REAL DIAGNOSTICS on this issue and will have answers.  There is no excuse that you are not aware of the problem... that we havent pointed fingers of where the issue is.... that there is no JIRA.... 

Share this post


Link to post
Share on other sites

A question if I may?

Could some Ava's be more susceptible to this than others?  I myself have been having a lot of trouble at one venue with a lot of crashes.  It is the only place I have had any trouble so last night I was trying to watch my statistics bar while enjoying the event.  I have had as many as 4 crashes in the period of an hour. Last night only one.

I was as far as I could discern last night the only one who crashed but I am pretty sure others have been.  Watching the statistics bar I did keep seeing the unpacked bytes climbing over 600 but the pending downloads stayed very low.

This stuff is a bit over my head.  And I just loved the hostesses response when I asked if any one else had crashed. "You might want to clear your cache." 

Share this post


Link to post
Share on other sites

There are certainly many reasons for one user's sessions to crash more than other's, and to diagnose that, we'd need to delve into all manner of details of your viewer, machine, and network configuration--which you may have already supplied ad nauseum somewhere else. I suspect, however, that your crashes are unrelated to the sim network problem(s) in this thread. A few hundred unacked bytes aren't a big deal; the problematic conditions are indicated when it's a few hundred thousand or more unacked bytes (among other apparent symptoms with which this thread has been wrestling).

One thing I'm noticing is that Image Time sometimes does and sometimes doesn't climb along with Net Time. I'm not sure this actually means anything; Image Time is all about the network, and I suppose it's just a function of what the sim is trying to do on the (sickened) network that determines how the time is distributed.

I wish I knew the Ops guys' take on this problem.  We've never been able to see inside MAINT jiras, so it's not anything new that we're in the dark on this... but it sure is frustrating.

Share this post


Link to post
Share on other sites

I completely agree with Qie that there is always opportunities / situations where an individual avatar viewer/session could be impacted for specific reasons but these situations are not related to what this thread is focusing on.

The problems that seems to have really become a very noticable and extremely frustrating situation since Aug/Sep'ish are related to how a sim goes "STALE" when a large crowd of avatars gathers and streaming seems to also be happening.  Not sure if the Streaming is related or if its just because many of the largest gathering of avatars on sims are because of a live music event.  So the Streaming might be just a red herring.

But what has been noticed and even recorded in this thread is that this is not your "normal" busy sim lag.  Its not the lag you see and would rightly expect when the Server, Debian OS, and/or sim instance are healthy but simply cannot keep up with the loads being placed upon the sim. 

The stats shown here in this thread over and over prove that when these sims go STALE, there are no visible resources on the performance stats that are overwhelmed (i.e. you are not seeing script time skyrocketing, etc.).

What is notice every time is that the network throughput seems to be governed or locked down or pinned down far below its normal capabilities to transmit and receive data to the viewers.  When this happens, the Pending Downloads and Uploads begins to climb and the unacked data start climbing to large numbers (into the Megs even).

I have stats shown here where extremely busy sims (60 and 70+ avatars on the sim while streaming) have very little trouble keeping up to the loads on the sim instance and you can see the network traffic flows in to some very significant PPS. 

Where the sims go STALE you can see that the sims are not able to process this network traffic at even a fraction of what a healthy sim can do.

 

I just wish this problem would get some SERIOUS LL attention as it is impacted very large coordinated inworld events and ruining the fragile revenue/profits that many of these inworld venues, clubs, galleries, etc. try to make somply to survive in SL's declining economy.

 

Share this post


Link to post
Share on other sites

Some of the disparity between sims with 70 avatars running fine and another sim with 40 avatars experiencing network issues could be related to how busy the other regions sharing the server are. When I check Tyche's Grid Survey I see anywhere from 4 to 7 other regions sharing a server. If a region is unlucky enough to restart onto a server with a couple of busy locations in SL I could see how the server could be stressed and experience issues. A region with 70 avatars sharing a server with only 4 other regions which are mostly vacant during events would probably seem to run very smoothly while a region with 40 avatars at an event sharing a server with 7 other regions four of which also have 30-40 avatars at an  event would probably have a less satisfactory experience. There are many variables involved in the performance of SL for a given region.

Share this post


Link to post
Share on other sites

Thank you for your reply as well as Toy's.

It is only at one club, only at one place that I am having this problem which is why I asked.  Other than that I remain relatively crash free.

My machine has been delved into rather deeply both here and in a now unviewable JIRA I started way back when.  Mesh enabled viewers run at a crawl for me so 90% of the time I am logged in with the Firestorm Beta.

Perhaps next time I will get to snag some screen shots of the statistics window before I crash for you to see.   It might be self serving in one sense....I get tired of the crashes when I am at my favorite club, but I do want to help also.

Share this post


Link to post
Share on other sites


Cincia Singh wrote:

Some of the disparity between sims with 70 avatars running fine and another sim with 40 avatars experiencing network issues could be related to how busy the other regions sharing the server are. When I check Tyche's Grid Survey I see anywhere from 4 to 7 other regions sharing a server. If a region is unlucky enough to restart onto a server with a couple of busy locations in SL I could see how the server could be stressed and experience issues. A region with 70 avatars sharing a server with only 4 other regions which are mostly vacant during events would probably seem to run very smoothly while a region with 40 avatars at an event sharing a server with 7 other regions four of which also have 30-40 avatars at an  event would probably have a less satisfactory experience. There are many variables involved in the performance of SL for a given region.

Yes I know exactly what you are saying and that would make sense as a possible theory, but this was discussed as a theory with a
possible
flaw in it.  We asked the Lindens at a previous Friday's Sim User Group meeting to explain exactly the source of some of the metrics reported on the PERFORMANCE STATISTIC screen that we all can see on our viewers.  The metrics in particular where knowledge of the source is critical would be the
PACKETS IN / PACKETS OUT / PENDING DOWN & UPLOADS / UNACKED DATA
values being reported.

Since SIMS are nothing more than application instances running on a Debian OS on a physical platform and assuming that each of these servers only have one production Ethernet interface onto their network, then it can be assumed that these metrics are the raw statistics coming off the OS's drivers for that interface.  In other words the stats are not being reported from a mib or statistic off the SIM Application of its specific use of the OS's network driver.

IF this is true then the theory you suggest might not be a valid factor to explain why the network's Packets IN / OUT appear to stale out even though there is significant data to send and receive (as is shown by the pending down/ups and the substantial increase in unack'ed data).

If another (or other) Application Sims are generating a huge amount of user load on the Debian OS, then we should be seeing a substantial packet in/out traffic that they would be generating from their load on this shared metric.  We are not seeing that.

The only way your theory would be valid is of these other sims on the OS are crippling the OS on other resources so badly that it is indirectly causing the OS's network driver to stale out in some manner.

that being said, only the Linden Engineers know the answers to our understanding / guesses regarding these metrics and the inter-play of the sims on the OS.

Is there any Linden reading these threads that could explain very clearly where the actual source of the Performance metrics come from (i.e. SIM or OS)?

Share this post


Link to post
Share on other sites

Perrie,

How many times do you attend live music events a week where there are often 40 50 70 avatars at the event?

I attend about 2 to 3 a night and I have been doing this for well over a year.  I have a mesh enabled viewer (FS).  I have been to music venue sims where they are running clean and healthy for some events then for some reason another event will start and the same sim that was previously running ok - stales out and real quickly.

I do agree that there are some sims that are notorious for being laggy even when things are relatively quiet.  I wish that the owner of these sims would ask LL to move their sim to another physical server but so far they dont.  I think this would help at least one of the sims I am thinking of and this sim constantly host large gatherings and is laggy even at smaller crowds.  BUT I suspect this is more of an issue of what was just mentioned - what other sims are sharing the same OS.

Share this post


Link to post
Share on other sites


Toysoldier Thor wrote:

Perrie,

How many times do you attend live music events a week where there are often 40 50 70 avatars at the event?

I attend about 2 to 3 a night and I have been doing this for well over a year.  I have a mesh enabled viewer (FS).  I have been to music venue sims where they are running clean and healthy for some events then for some reason another event will start and the same sim that was previously running ok - stales out and real quickly.

I do agree that there are some sims that are notorious for being laggy even when things are relatively quiet.  I wish that the owner of these sims would ask LL to move their sim to another physical server but so far they dont.  I think this would help at least one of the sims I am thinking of and this sim constantly host large gatherings and is laggy even at smaller crowds.  BUT I suspect this is more of an issue of what was just mentioned - what other sims are sharing the same OS.

Any more, no where as frequently as you.  So I may hit only one SIM  that crowded in a night.  But for the past over two weeks now, everytime at the one particular SIM I will crash.  No exceptions to that.  For me I have not crashed anywhere else.  Some how I am sensitive to something that is happening at that SIM.  Your thread has been the only plausible cause I have been able to discern.

I used to like you hit several live events a night.  I don't anymore.

Thanks.

 

Share this post


Link to post
Share on other sites

Not sure the Lindens are keeping an eye on the thread and if all the evidence is even being used by them to help isolate / diagnose the network bug in the SL sims, Debian OS or within the LL Data Center.... but here yet another example of just how well a sim in a "HEALTHY" state can handle extreme loads placed on it from very large avatar populations.

The Performance Stats snapshop was taken last night on a sim that was hosting a live music event for Halloween.  I wont mention the sim (as I didnt the last one) but if LL staff want to know them... just IM me and I will tell you.

Take a note of the stats I highlighted.  84 avatars on the sim.  Network traffic flows at over 7000 packets per second.  And even though the unacked data was over 2meg, there was no PENDING DOWN or UP LOADS.  The sim/server is able execute network traffic flows to ensure there are no pending loads.

(and if you look at the previous sim where it was unable to generate a fraction the traffic flow and the pending down/up loads build until the sim goes STALE.

 

BoomPony-Oct31-730pm-healthy.JPG

Share this post


Link to post
Share on other sites

We're only able to attend the server user group as a sort of tag team, and today is Friday, so I'll handoff the relevant snippet from Tuesday's:

 

  • [2012/10/30 12:13] Qie Niangao: Speaking of sim lag, I wonder if we can learn any more about the cause of BUG-355, Toysoldier Thor's bug about anomalous network performance. I gather that Friday it was mentioned as something that may be fixed in an RC this week; not sure how fervently to hope for that.
  • [2012/10/30 12:14] Simon Linden: I would hang onto the RC as a thread of hope, but not feverently
  • [2012/10/30 12:14] Theresa Tennyson: Have you figured out what's happening with that yet?
  • [2012/10/30 12:15] Qie Niangao: hehehe... okay. It would be some relief if we had a bit more handle on what's happening to cause the problem.
  • [2012/10/30 12:15] Simon Linden: I think those reports cover a couple of different issues, Theresa

I'm sure the Linden developers have been encouraged to be circumspect, and it's always risky to read too much between the lines. Nonetheless, they're being extraordinarily tight-lipped about this one. That, plus the very abrupt onset of the network-stall condition, renews my suspicion that there may be a component of this that's intentionally caused by somebody in the sim. Lindens are always wisely silent about exploits, lest they spread across the grid before they can get countermeasures in place.

Or, of course, that may have nothing to do with it.

(I'll just mention again that the top "Network" section of the Statistics bar is about viewer-side traffic. The glimpse we get into the region's health is limited to the categories under the Simulator heading.)

(Also parenthetically, I don't really subscribe to the "shared host" hypothesis as an explanation for this. I mean, it's possible that something happens in the affected sim to push the whole shared host over the edge right around the time a new stream is offered, and that load on other sims sharing the host makes the whole server more vulnerable, but unless this timing is coincidental, the "sharedness" of the host isn't really explaining much. There's widespread suspicion about increased sims-per-server architectures, but personally I'm really skeptical about it causing any observable problem, as I am about Pathfinding-phobia and the fear of idling-when-vacant.) (Or watch as I eat my hat when it turns out that some glitch in the sim-idling code also puts the network interfaces to sleep. :smileyembarrassed:)

Share this post


Link to post
Share on other sites


Toysoldier Thor wrote:

Not sure the Lindens are keeping an eye on the thread and if all the evidence is even being used by them to help isolate / diagnose the network bug in the SL sims, Debian OS or within the LL Data Center.... but here yet another example of just how well a sim in a "HEALTHY" state can handle extreme loads placed on it from very large avatar populations.

The Performance Stats snapshop was taken last night on a sim that was hosting a live music event for Halloween.  I wont mention the sim (as I didnt the last one) but if LL staff want to know them... just IM me and I will tell you.

Take a note of the stats I highlighted.  84 avatars on the sim.  Network traffic flows at over 7000 packets per second.  And even though the unacked data was over 2meg, there was no PENDING DOWN or UP LOADS.  The sim/server is able execute network traffic flows to ensure there are no pending loads.

(and if you look at the previous sim where it was unable to generate a fraction the traffic flow and the pending down/up loads build until the sim goes STALE.

 

BoomPony-Oct31-730pm-healthy.JPG


Thanks.  I haven't been back to the club for a couple of days but next time I have a problem there or I think I may be seeing a problem I'll just try to grab a screenshot of the statistics bar and let you "take it apart."  As I said, I'm just trying to help but I also know I am way out of my league on this stuff.  And sadly, very few people are contributing input here. 

 

Share this post


Link to post
Share on other sites

There is no "out of league" issue as we all seem to know parts of the whole.  I am doing a lot of educated guessing myself since there is no reference book of the internal workings of LL's SL grid architecture to read up on and the LL Engineers only provide a VERY VERY Shallow exposure of how it works.  So the trick is to gather the small bread crumbs they drop and use what ever tools are available to us to develop a theory.

So the more info we get from sources - the better we can develop, validate, discard theories.  Now you have a good idea of what screen shots we are using.

Share this post


Link to post
Share on other sites

Hey Qie,

Thanks for the info you gathered at your Server meeting.  I will attend the Friday meeting in a few hours (Tag Team style).

Your theory that LL is staying pretty quiet on this BUG-355 and instant lag STALE-OUTS - i.e. they might quietly suspect this is not as much a bug on the pathfinder or general sim/OS code but possibly a new exploit - could be valid.  If this is an exploit then its very interesting how the exploit could be degrading the OS's ability to stale out the network throughput. 

You were mentioning the "idling" idea - that tech is out of my scope of understanding and if that is a bug issue or an exploit issue.

A couple people have mentioned to me that this is just a problem that has been around for a long time and related to LL massively over-subscribing the sims / server ratio - far beyond the ratios they publicly tell everyone it is.  LL often says its like 8 sims / server but varies.  One person said he knows for a fact that its many times greater than that in order for LL to reduce costs (i.e. like 40 sims per server or greater). 

This would be a factor for the common lag we all know and love on SL but based on the stats of the network lag when the sim crashes, this does not appear to be an issue of an over-subscribed OS/Server HW.  If that were the case I would think we would see extremely high PPS in/out and the pending up/downloads would still be increasing.


I dont think the Lindens will be to forthcoming on any further diagnosis on this bug today but I do want to ask the Lindens to provide more details of the Performance Stats.... i.e. are the Sim stats coming from a SIM's perspective or the Debian OS perspective?

If there is anything else you want to have me ask... post it.

Share this post


Link to post
Share on other sites


Toysoldier Thor wrote:

A couple people have mentioned to me that this is just a problem that has been around for a long time and related to LL massively over-subscribing the sims / server ratio - far beyond the ratios they publicly tell everyone it is.  LL often says its like 8 sims / server but varies.  One person said he knows for a fact that its many times greater than that in order for LL to reduce costs (i.e. like 40 sims per server or greater). 


I'd be highly skeptical of this higher SIM per server # because LL is still selling a service and they state specific numbers in their official documentation.  While they can change their TOS at anytime, until such a change is stated, they are bound to their own terms.

 

Share this post


Link to post
Share on other sites

I don't have anything specific to ask them, but I do think it's important to keep bringing it up.

If we had a way of reproducing the problem on demand, I'd ask which release channel would be most informative for us to test, but since it seems to happen only at certain venues with certain crowds during certain phases of the moon, I don't know that there's any point in asking that.

Incidentally, I really do not think this has anything to do with sim-idling -- hence the "eat my hat" thing.  Just as background, earlier this year they changed the behavior of sims that were vacant (no agents in the region) to slow down the frame time of the sim, with the objective of letting other sims on the same host get a chance to share the capacity that freed-up. To many, this sounded risky, but it seems to have been a pretty much "no hitches" success, at least as far as I can tell.

As I hinted above, I also don't buy into the whole thing about LL "oversubscribing" sims-per-server. That theory gets trotted out whenever a sim throws a sprocket. ("Gee, why doesn't my sex-bed run smoothly anymore? Must be oversubscribed servers. Nothing to do with all these temp-rezzed physical waves I just set out -- and you'll never convince me otherwise.") It all smacks of the same paranoid superstition that had people obsessing over server class for years after it was irrelevant to all but the most esoteric metrics.

Share this post


Link to post
Share on other sites

Well Qie... I hope the transcript is posted because 75% of the hour was about the Jira-355 topic. I started by expressing how big an impact these outages are to SL customers that put a ton of time and effort and $ planning and setting up events.

A few good things came out of it that I extracted from the meeting and were important:

 

  • The Packets In / Out (pps measured) are for the SIM not the OS (from Andrew)
  • I brought up over subscription or network staling out from other sims on the OS especially in light of the stat being a SIM not OS stat.... Andrew said that although possible it would not be a strong theory for him to believe.
  • LL's sim logs only go back about 24 horus. It used to be 7 days but not anymore. So if we wanted LL to see the logs on the sim that went stale, Andrew would have to hear about it and look within 24 hours.
  • He said that if I run into one this weekend to IM him and maybe he can catch it.
  • He hoped they could see if there is a network storm of some kind that could be detected - to me that is an idea as long as the storm packets are not noticed in the stats - which if its sim stats - it wouldnt.
  • There was talk about the STREAM being a factor - most of us believe its not a contributing factor.  I believe its moreso that the stream is used by live artists that attract many fans.  It was weird how it was initially happening as soon as the stream switched.
  • Brought up if it was an exploit or a bug in the OS or the path finder. no strong beliefs expressed but i got the impression Lindens moreso would believe its more a bug than an exploit - just my impression.
  • Others were mentioning other factors for lag.... textures, scripts, etc. but i showed them the stats from the thread that scripts and other factors did not seem to be the factors.  If it was resource loads then the network traffic would have been high and overwhelmed... it clearly is not.

So it was good discussion on the topic with the lindens.  IF you see the same sim symptoms, I would suggest u IM Andrew Linden ASAP .... BUT have a snapshot of the stats of the sim at the time.  He wants the sim/region, date & time as well.  He can then try to see the logs before it rolls off.

 

 

Share this post


Link to post
Share on other sites

No transcript yet, but glad to know that the issue is getting attention.

I'm also glad that the Lindens seem more inclined to think it a bug than an exploit. The fact that they were willing to discuss it at some length is reassuring that way.

The thing is, there's that business of the apparent onset of the problems often coinciding with a stream change. The sim does next to nothing when that happens; it cannot in itself generate enough network load to matter. I'm happier not suspecting somebody triggering the problem to grief the upcoming performer.

So, what else happens when a new performer takes over? Well, it may be that the new performer brings in a lot of new listeners all at once, but I'm thinking now that it may be as much a function of the number of old listeners leaving -- that is, it may be triggered by the turnover of sessions with network traffic to/from the sim, more than the absolute number of agents in the sim.

Now, I don't know much about how sims and viewers communicate, and even less about what the sims do when sessions come and go. I'm wondering if there could be any effect of people logging out while in the sim, and I'm vaguely remembering that some viewer version or other (maybe just the Beta?) has been terminating ungracefully. I also dimly recall that they're migrating more of the viewer traffic from UDP to TCP (but not sure how much of that is sim vs central servers); there'd be more low-level connection clean-up involved with TCP, which might involve some kernel tuning or something. (For that matter, especially for UDP, there's no doubt some application layer comms reliability, but presumably that hasn't changed in years, whatever it may be.)

(Someday I should spend some quality time with netstat while my viewer is running, to see what ports are running what protocols with which servers.)

This is obviously grasping at straws.  I mean, busy stores have a lot of turnover all the time, and don't seem to be complaining about this network problem. For this theory to be plausible, it would have to be all about the burstiness of turnover in venues during "changing of the guard" for performers.

So... just random musings.

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
Sign in to follow this  
Followers 0