Jump to content

What's up with sim crossings?


MBeatrix
 Share

You are about to reply to a thread that has been inactive for 1774 days.

Please take a moment to consider if this thread is worth bumping.

Recommended Posts

3 hours ago, Qie Niangao said:

As I was typing, I was reminded of another challenge in diagnosing race conditions: a few lines of debug code can completely change the triggering circumstances, sometimes completely masking the whole event, or shifting it unrecognizably. They really are the worst bugs.

Yes, I remember spending days on some Fortran90 code with pointers that would not replicate the fault when debugging was enabled. Things like that make you very very superstitious until the reason finally comes to light.

  • Like 2
Link to comment
Share on other sites

3 hours ago, Solar Legion said:

The difference between clients as described by both Fluf and Alyona is a difference that has several variables which include but are not limited to: Network speeds, internal timeout margins, hand off channels used, some system hardware ....

Precisely. This is why what I described clearly related to ME (common deductive reasoning should dictate that others'  mileage will vary.) But it never hurts, in fact, I always strongly recommend one gives other viewers a try, at the very least.

Link to comment
Share on other sites

33 minutes ago, Alyona Su said:

Precisely. This is why what I described clearly related to ME (common deductive reasoning should dictate that others'  mileage will vary.) But it never hurts, in fact, I always strongly recommend one gives other viewers a try, at the very least.

Heh, I know you're aware of the difference - it's quite clear in how you've responded thus far, even before being mentioned.

ETA: I do agree with you, just wanted the above stated.

Edited by Solar Legion
  • Sad 1
Link to comment
Share on other sites

1 hour ago, JoyofRLC Acker said:

RC channels ... to the extent they are seamlessly attached to the main grid and are scattered as it were they are 'production' sims for all practical purposes.  In most places you cant sail on "just" RC channel or "just" main channel (Blake Sea aside).   Code released to production should be tested. 

Unfortunately, the RC channel servers are the test bed.

Well, secondary test bed. The Second Life Beta Grid - Aditi - is a fair bit too small for the sort of testing that is needed. Most server code is rolled there first before it even hits any of the RC servers on Agni (the main grid).

Because of the much smaller scale, testing on Aditi only shows the absolute worst of the bugs. Testing on the Agni RC servers catches more of them but rarely catches them all.

How we view these different channels really does amount to nothing: They are marked and treated as test beds and at one time you could find each of these channels running variations on their server code before full release.

Fun tidbit: These channels exist and function partly thanks to the heterogeneous nature of the current Second Life Grid and partly due to not having enough information/a large enough sample size during testing of new server side code.

Put another way, they didn't have enough users or regions on Aditi to even come close to properly testing these roll outs.

Nor do they have the resources to make a full scale duplicate of Agni for internal or beta testing.

There really isn't an easy answer or simple way for them to do it.

Link to comment
Share on other sites

44 minutes ago, Solar Legion said:

Heh, I know you're aware of the difference - it's quite clear in how you've responded thus far, even before being mentioned.

ETA: I do agree with you, just wanted the above stated.

Yes, yes - SO many moving parts: how fast is CPU (and not just clock-speed), RAM, BUS, Hard Disk, Network card, Router - and that's before we even get out of the house onto the Internet LOL BUT, It's all LL's fault. never forget that. ;)

Edited by Alyona Su
  • Like 1
Link to comment
Share on other sites

For the sake of quenching the nay sayers out there, I'll put down my findings on different client experiences during the current disconnection problems in more detail. If it helps people who are interested in sim-crossing activities in SL, that's great. Ideally we want the problem fixed though.

Around the beginning of March I decided to complete all the Yavascript pod routes again for reasons not worth explaining here. Using Firestorm (my usual viewer), this was going fine.

The 18th March was the last time to date that I have managed to complete any of the Yavascript pod routes using Firestorm. I didn't try any pod routes on the 19th but I was aware from my own experiences and friends that teleports were a problem that evening (UTC). On the 20th I went back to the pod runs.

All hail the sim crossing disconnection demon. For a while I suspected recent changes in the Firestorm code (I compile from source), but that doesn't seem to be the case. However when I tried CoolVL for a comparison a few days after the 20th, once again I was completing pod routes.

So here's the list again of Pod Routes for the doubters.

Completed routes:
C1, C2/N2, C3, C4-Tea2
H1, H2, H3, H4, H6, H7, H8
G1-North, G1-South, G2, G5, G6
J1, J2, J3, J4, J5, J6, J7
M1, M2, M3
N1
S0, S1, S2, S3, S4, S5, S6, S7, S8
T1, T2
Z1, Z2, Z3, Z4

Incomplete routes:
C4-Tea1
G7
H5
N3

46 routes. Of that list, 8 were completed before March 18th using Firestorm. That leaves 38 routes. 4 still incomplete. For those of you that have never tried to complete a Yavascript pod route, some of those routes cross hundreds of sims and take many hours to complete. The information I'm giving you here is based on literally thousands of sim crossings.

33 of those routes have been completed since the start of the current server disconnection problems (March 20th) using CoolVL. One of them was completed using Alchemy. None of them have been completed using Firestorm. Zero. Nada. Zilch. And not for lack of trying.

During the disconnection problems, I have tried various tricks. I've tried Firestorm with sky rendering turned off and/or Windlight shaders disabled. No big difference. After starting a "Pod run", Firestorm will normally give me a disconnection error message after 15-30mins of sim crossings. For a long time 45mins was the longest it managed. On one evening it managed 1h45mins, sadly still not long enough to complete the pod run I was on at the time. That gives us an idea of the degree of random in the disconnection problems. (Yes I time the pod runs).

In comparison. CoolVL has managed far in excess of 3 hours constant sim crossings. Completing routes and still allowing a TP to start the next route. It's not immune to the current problems, but they happen much more rarely. Most pod failures are the "old style" problems, where you find yourself standing on a road after a while with no pod under your ass.

I've tried Kokua a few times, and so far it's disconnection performance appears roughly equal to Firestorm.

The current Singularity for Linux build has problems with the CEF (HTML) plugin for me, so I haven't tested it.

I recently tried Alchemy out if interest. It's an old build but compared to CoolVL still a modern v2 viewer. So I just wanted to see if changes since November 2017 (it's last release) made any difference. Results are patchy. It can achieve a very long run sometimes, although that may just be freak occurrences. It may be failing at other times simply because it's so out of date. (It doesn't understand Animesh for example). Since the source code seems to be no longer available, and the variance in time before disconnection is so huge. I've given up testing it.

Any other viewers that I haven't mentioned have not been tested.

So if someone on here is telling you that all viewers will suffer the same disconnection rates, and that any user analysis of what's going on is worthless, and we should all just read a book until the Linden's fix it. They really have no idea what they are talking about. It probably is worth trying other viewers, but beware the random run of good luck. If you get good results, check it a few times to make sure it's not just a lucky run then share them. Cheers.

  • Haha 1
Link to comment
Share on other sites

2 hours ago, Fluf Fredriksson said:

[...]

So if someone on here is telling you that all viewers will suffer the same disconnection rates, and that any user analysis of what's going on is worthless, and we should all just read a book until the Linden's fix it. They really have no idea what they are talking about. It probably is worth trying other viewers, but beware the random run of good luck. If you get good results, check it a few times to make sure it's not just a lucky run then share them. Cheers.

You sure convinced me.

Also, Linden QA needs an automated Fluf to ride the pods for every deployed version. (I mean after being deployed on Agni, first to RCs and then grid-wide. Of course they'd be using a totally non-standard client, so some issues might still sneak through.) If only they could do it retroactively.

It tells us something, that there's a mutant viewer strain that's somehow surviving the challenge. How do biologists figure out what changed to make one microbe resistant to antibiotics? Maybe this isn't that hard, but it's sure beyond me. We do have the source code, I guess, if somebody's equipped to compare how they handle region change.

(I'm still only assuming that the sim-crossing and teleport problems are caused by the same thing. But I also assumed the choice of viewer had nothing to do with it, and that was wrong, so now I'm second-guessing everything.)

Link to comment
Share on other sites

9 hours ago, Fluf Fredriksson said:

Completed Yavascript routes during testing sim disconnection issues:

The Yava pods record their problems on an external server. 

closelywatchedpods.thumb.png.4391a5eb26c4dd7fecf906c39628a6bc.png

There's a display board at the Calleta pod terminal which shows the pod system status and the problems it is having. Yavanna herself often shows up at Server User Group on Tuesdays. The Yava pods are closely watched. Their info is especially useful when a sim is so broken that nothing is getting in. That shows up on the big board in Calletta.

Right now, Satellite sim seems to be having a very bad day.

Edited by animats
Link to comment
Share on other sites

I can teleport around, without failures, fast enough to "hit the fence."  I teleported 28 times in 11 or 12 minutes, at somewhat erratic intervals due to numerous stale landmarks.  I might assume "the teleport issue" has been rectified.

 

Link to comment
Share on other sites

9 hours ago, Ardy Lay said:

I can teleport around, without failures, fast enough to "hit the fence."  I teleported 28 times in 11 or 12 minutes, at somewhat erratic intervals due to numerous stale landmarks.  I might assume "the teleport issue" has been rectified.

I have no idea whether it means anything, but at one user group we were advised that teleporting quickly after arriving would likely not trigger the problem. Still, any hope is better than none.

Link to comment
Share on other sites

19 hours ago, MBeatrix said:

It would be possible to test it more or less properly if the new code had been rolled to RC Magnum, at least for me, as there are quite some contiguous Magnum regions I cross daily.

Indeed.  My point really was that LL uses what I call "testing in production", as opposed to a separate test system/environment.

One of the many things about this situation that appal me is that LL was apparently unaware of the problem until some JIRAs were posted - and job 1 had to be to update the system to actually get log data on the problem.

Link to comment
Share on other sites

11 hours ago, Ardy Lay said:

I can teleport around, without failures, fast enough to "hit the fence."  I teleported 28 times in 11 or 12 minutes, at somewhat erratic intervals due to numerous stale landmarks.  I might assume "the teleport issue" has been rectified.

 

As of yesterday, in Blake Sea (all Main Channel) there were still significant issues.  Not clear if the situation was any better or not, but clearly not solved.

Link to comment
Share on other sites

Since I am a Linux user, there is no supported viewer I can use to confirm bugs before reporting them to the Lindens, but it's interesting that Cool VL Viewer works better than Firestorm.

I usually use Firestorm. They're always asking me if I can replicate the bug using the Linden Lab viewer.

I don't know what the rollout today did. Sim crossing was OK at the weekend, started getting bad yesterday, and is horrible right now.

Most of what I have seen suggests that the teleport and sim-crossing problems are happening on the network connecting Linden Lab servers. It could be my connection that's part of it, but today is the last day I shall be using it. Just when the change takes effect tomorrow, I don't know, but it will be interesting to see what does change, and what doesn't.

I don't expect the internet ping time to change. It was a little bit slower when I used dial-up to Demon, The USA has always been over 150ms away. I tried an early VR demo, and that trans-Atlantic wet-string is a problem

The lag from scenery loading should drop, but I doubt it will help on the Jeogeot-Sansar channel: all you can see on that is water.

Oh, and Debian Jessie? The support on that ends next summer. I get the feeling that it's getting a little too old to be switching to it.

Link to comment
Share on other sites

Well. Only time for 1 test so far since the server roll out, but it's promising. "Cautiously optimistic".

Entire H4 run completed, just over 2 hours online with no disconnect problem using Firestorm (Linux). Previous average disconnect time 20-45mins, with one 1h45m run. So that does look like a very significant improvement. But... Only one test run.

It's possible that's random good luck, but given the results of previous test runs, statistically unlikely. Finally I can go back to using Firestorm. Phew.

Sadly some of you still seem to be in disconnect hell, and as usual we have no idea what LL changed. I guess it's possible it has only improved things for some people? Without a bit more information from LL it's impossible to know.

Edit: Oops. Typo!

Edited by Fluf Fredriksson
Link to comment
Share on other sites

I just went through the same route as yesterday (Nuggy->Von Luckner->all sims North of Nautilus City->Blake Sea - Crows Nest->Honah Lee Trudeau and back) on a motor boat, again at very high speed, and had no issues at all. The speed was so high that the next sim was only showing on mini map right when I was about to cross into it.

Yesterday, I was silly, and only checked which server channel the regions were on after finishing the testing. So, it happened that only two of those sims had gotten the new server version, but still it all went fine. 😄

We'll see how region crossings and teleporting work in the next few days.

I'm running Firestorm, always.

[EDIT] Some more testing, this time from Nuggy to Ahab's Haunt via Venrigalli, then a bit to NE through the passage at Dooknock, all the way South to Blake Sea - Travertine, East to Blake Sea - Crows Nest, and South again to Honah Lee Trudeau; then back to Nuggy, via Blake Sea - Crows Nest->Blake Sea - Azimuth->>>von Luckner, etc.
Again, no issues at all, and I had someone racing with me from Blake Sea - Jones Locker till Nuggy on my way back home.

Edited by MBeatrix
adding more info
Link to comment
Share on other sites

Trying to diagnose this particular bug from the user side seems to be futile. There's a timing dependent bug somewhere, but it's not visible to the viewer connection. I've done too much work on this, and that can be seen elsewhere in the forums and on the JIRA.

About a year ago, I identified six major bugs with region crossings. Many of those I was able to deal with in vehicle scripts and with viewer mods. The result is that for better modern vehicles using Firestorm, a normal region crossing, even at high speed, is just a short pause. No sudden change in direction, no sinking, no going in the air or underground, no spinning sideways, no avatar being animated out of position, no script errors. My bikes do this, and so do vehicles from at least three other builders. (Most of the tricks I rediscovered. I've found old vehicles which do some of these things, from creators long gone from SL. Some of the techniques were kept secret, and lost. So few vehicles had all of them.)

But none of us have been able to make any solid progress on region crossing failures. Even modifying Firestorm to log object update messages, which I've tried, or looking at the message traffic with Wireshark, which Beq Janus has tried, has not helped. What a failure usually looks like at the message level is that the gaining sim just never starts sending updates for avatar or vehicle, leaving it stuck. I've seen this with no errors of any kind visible to the viewer - no network errors, no out of sequence updates, no HTTP errors retrieving a capability. 

Simon Linden, looking at this from the sim side, has said at Server User Group that sometimes the gaining sim just doesn't start updating the object. That matches what I've seen viewer side. Just silence from the gaining sim. Sometimes we do see things going wrong on the viewer connection, like bogus "kill" messages from the losing sim, but that may just be a symptom, not the underlying problem. Failures can occur with no "wrong" messages transmitted.

All this indicates some sim to sim problem. We can't see any of that from the viewer connection, so it's up to Linden Labs.

Messing with timing just moves the problem around. This has to be fixed so that it's not a timing dependent bug. That's the crucial point. Some previous fixes to network retransmit seemed to have moved the problem around without fixing it. From what I'm hearing at Server User Group, the developers finally accept that they have to really fix this to make it timing independent, not just tweak the timing and  change when it occurs.

This is not easy. It's going to take a lot of LL resources. It's not just "fixing a bug". It takes a multi-step approach. First detecting the problem when it happens, then logging information about the problem, then achieving understanding of the problem and documenting it, and finally coming up with a solution.  It seems that LL is, at long last, applying enough resources to do this.

We will see. As users, we need to keep the pressure on, so this gets fixed, not just tweaked again.

As users, we can help by keeping the pressure on so that LL management continues to devote resources to fixing this.

  • Like 2
  • Thanks 5
Link to comment
Share on other sites

@animats Thanks! Nice post. That's more information in one forum post than we've had out of LL for over a month!

I hope your optimism for a genuine fix is well founded. I have a feeling though that after today's sim roll-outs and tomorrow's (seemingly related) server maintenance, it might be declared "good enough" and back to the EEP roll out.

Link to comment
Share on other sites

Well it's pretty much official from me. Whatever fix LL rolled out on the 24th April has fixed sim crossing random disconnects for me 100%. The entire Yavascript S1 route completed in one run using Firestorm on Linux. If you asked me before March 18th if that was possible, I'd have said highly unlikely.

To whichever LL codemonkey(s) that figured out what to roll back / reconfigure, my sincere thank you. I hope they get you another monitor, a nicer keyboard and a more comfy chair. You deserve it.

To LL in general. That was one monumental ***** up. I'm sure it has cost you both revenue and users. You can't afford to keep doing that.

Edited by Fluf Fredriksson
Link to comment
Share on other sites

I was a bit grumpy when the latest fix, rolled out grid-wide on Wednesday, seemed to make things worse. Things are looking better now, and my guess is that a restarted region needs time to settle down, that there's a lot of data still flying around the network. There's other things to it but it suggests to me that the dream of using the Cloud for Second Life is going to stay a dream.

 

The possibility that a few older creators had fixes for some of the problems, and took them with them when they left the Grid, sounds all too likely. I don't do scripting, but I have seen some pretty secretive behaviour from some creators. It doesn't help when Linden Lab uses so many different names for the same thing.

 

Something to think about: when I was making a very fast run between Jeogeot and Sansara, with an old sculpt-era plane at about 60 m/sec, there were occasional messages on sim crossings, warnings that the HUD was using a lot of texture memory. I'm not sure if an alternative HUD is worthwhile for that plane, so much is controlled from the HUD, but I have seen the same warning on quite ordinary HUD/Vehicle combinations.

Mesh, done right on LODs, can be a much lower load, Display and Download, than Sculpts, and I have seen good updates of some things, but scripts and HUDs seem to be the hidden load on vehicles.

 

Link to comment
Share on other sites

7 hours ago, Fluf Fredriksson said:

Well it's pretty much official from me. Whatever fix LL rolled out on the 24th March has fixed sim crossing random disconnects for me 100%. The entire Yavascript S1 route completed in one run using Firestorm on Linux. If you asked me before March 18th if that was possible, I'd have said highly unlikely.

To whichever LL codemonkey(s) that figured out what to roll back / reconfigure, my sincere thank you. I hope they get you another monitor, a nicer keyboard and a more comfy chair. You deserve it.

To LL in general. That was one monumental ***** up. I'm sure it has cost you both revenue and users. You can't afford to keep doing that.

I haven't had any teleport failures since yesterday's rolls, either. Also crossing regions had no issues at all. Still, I remain cautious and will wait a week before I finally shout "yay!" in my mind.

I don't think LL had any significant losses. I'd bet that most of those staying away from SL because of this latest issue are freebie users.

Edited by MBeatrix
corrections
  • Like 1
Link to comment
Share on other sites

14 hours ago, animats said:

As users, we can help by keeping the pressure on so that LL management continues to devote resources to fixing this.

I've been responsible for fixing some big systems. It's a whole lot easier when everybody on the team has been working on the relevant code for a while. Now would be a very good time to really fix sim handoffs. In fact, the best time in SL's entire history.

  • Like 5
  • Thanks 2
Link to comment
Share on other sites

4 hours ago, MBeatrix said:

I haven't had any teleport failures since yesterday's rolls

Unfortunately, I have had a couple, but there's been a huge difference.

In the past, when its been obvious that the TP had frozen, using the Cancel button has returned me to the normal view but the disconnect had not been prevented. Avatar movement forwards or backwards would not work (although, strangely, rotation did) and many seconds later the dreaded "You have been logged out....." or similar message would appear.

This time the Cancel has worked as before but the disconnect has not occurred, and its been possible to simply have a retry at the TP. It seems that the time-out tolerance has been increased and that's definitely helpful.

  • Like 1
  • Thanks 2
Link to comment
Share on other sites

50 minutes ago, Odaks said:

Good idea! A bit more than a week might be required though?

 

We'll see. I hope that LL has done more than tweaking the time-out tolerance, as it would be just like applying a band-aid hoping to cure a serious disease with it.

  • Like 2
Link to comment
Share on other sites

You are about to reply to a thread that has been inactive for 1774 days.

Please take a moment to consider if this thread is worth bumping.

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...