Jump to content

Obscure question: when does the simulator send EstablishAgentCommunication to the viewer?


animats
 Share

Recommended Posts

2 hours ago, Monty Linden said:

This keeps getting worse the more I look.  So *both* viewer and simulator implement a 30-second timeout on these requests.  A guaranteed race condition.  Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout.  More variability.  Simulator can send at least three different responses related to its timeout actions:

  • A 200 status with empty body and no content-type header
  • A 200 status with llsd content-type and a body of:  '<llsd><undef /></llsd>'
  • A 499 status with html content-type and a body I haven't looked at but it's about 400 bytes.

Sharpview uses a 90 second timeout viewer side, so the simulator's server always times out first. Henri has chosen to time out first viewer side. Both will probably work. Same timeout at  both ends, though, seems iffy. I wonder if the long timeout I'm using is why I see skipped ID numbers from the Other Simulator, but not SL. I'll take that up with their devs.

Sharpview processes the 200 status if LLSD is present, stops on a 404 status per documentation, and retries on any other status. I use the Rust crate "ureq" and make routine HTTP requests. I still have a bypass to deal with that 20 year old SL self-signed SSL root cert which uses an obsolete encryption system for which there is no Rust support. (SSL's ancient root cert expires in 2025; the clock is ticking.) All this seems to work.

(How is the SL mobile dev team coming along with this? They're re-implementing, not re-using the C++ code, correct? So they get to face all the legacy issues.)

(For less technical readers: SL does well at not crashing, as in "region down". Most of the remaining immersion-breaking problems, from teleport fails to clothing not rezzing, involve the simulator or viewer stalled, waiting for something the other side is never going to send, or that was previously sent and lost. Thus the need for this fussy attention to detail down at the communications protocol level.)

Link to comment
Share on other sites

1 hour ago, Monty Linden said:

Well....  you don't really know that unless what you've received matches what was intended to be sent.  And it turns out, they don't necessarily match and retries are part of the problem.  I already know event-get is one mode of TP/RC failure. 

All the TP failure modes I get happen before event polls are even started: the UDP message from the arrival sim just never gets to the viewer, so the latter is ”left in the blue” and ends up timing out on the departure sim as well..

Link to comment
Share on other sites

  • Lindens
27 minutes ago, Henri Beauchamp said:

All the TP failure modes I get happen before event polls are even started: the UDP message from the arrival sim just never gets to the viewer, so the latter is ”left in the blue” and ends up timing out on the departure sim as well..

There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet.  A more elaborate scenario is this:

  • [Viewer]  Request timeout fires.  Begins to commit to connection tear down.
  • [Sim]  Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request.
  • [Sim]  Having prepared content for outstanding request, declares events delivered and clears queue.
  • [Apache]  Adds more delay and reordering between Sim and Viewer because that's what apache does.
  • [Viewer]  Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request.
  • [Viewer]  Never sees TeleportFinish, stalls, then disconnects from Sim.
Link to comment
Share on other sites

@Monty Linden taking a simple-minded approach to this (the only sort I am capable of), it does not seem too onerous a task to insert some sort of abort command in the viewer code that returns the avatar to the departure sim, and announces a teleport failure if the destination sim never sends the TeleportFinish?

Clearly that is not so, or it would have been done long ago...wouldn't it?

Link to comment
Share on other sites

  • Lindens
6 minutes ago, Aishagain said:

Clearly that is not so, or it would have been done long ago...wouldn't it?

Even that requires negotiation after going that far down the TP path.  For myself, I'm looking to turn failure into success rather than into disappointment or a trip to the Bardo.

  • Thanks 1
Link to comment
Share on other sites

1 hour ago, Monty Linden said:

There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet.  A more elaborate scenario is this:

  • [Viewer]  Request timeout fires.  Begins to commit to connection tear down.
  • [Sim]  Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request.
  • [Sim]  Having prepared content for outstanding request, declares events delivered and clears queue.
  • [Apache]  Adds more delay and reordering between Sim and Viewer because that's what apache does.
  • [Viewer]  Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request.
  • [Viewer]  Never sees TeleportFinish, stalls, then disconnects from Sim.

Yes, this might indeed happen... I will have to try and log one such scenario (got nice DEBUG level messages for event polls and, now, server/viewer messaging)...

The problem here, is that we do not have a way for the viewer to acknowledge the server TeleportFinish message... The latter used to be an UDP ”reliable” message (with its own private handler), but got UDPDeprecated then UDPBlacklisted in favour of the event poll queue/processing... It was not the wisest move...

A possible workaround would be to allow the viewer to (re)request TeleportFinish; in this case, a simple short (5 seconds or so) timeout could be implemented viewer side after a TP has started, and if TeleportFinish has not been received when it expires, then it would re-request it...

EDIT: I'm also seeing a possible viewer-side workaround for such cases via the implementation of a ”teleport window” timer. That timer would be reset each time the viewer starts an event poll request for the agent region: when the user asks for a TP, the timer would be checked and if less than, say, 2 seconds are left before the timeout would fire (since it is set viewer side, at least in my code, for SL, it is known), the TP request would be delayed till the next poll is started...

Edited by Henri Beauchamp
  • Like 1
Link to comment
Share on other sites

9 hours ago, Monty Linden said:

Even that requires negotiation after going that far down the TP path.

That's a good argument for making the event polling system reliable. Making it reliable won't break existing behavior. Trying to wrap recovery around lost events gets complicated very fast.

Quote

The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable.  Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point.  In fact, it is still unreliable now as we'll drop queue overruns on the floor.  They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action.

Oh, joy. Most of the events the event poller deals with are not in the "unreliable/OK to miss" category.

  • Thanks 1
Link to comment
Share on other sites

23 hours ago, Monty Linden said:

[Viewer]  Request timeout fires.  Begins to commit to connection tear down.

In fact, I could verify today that this scenario cannot happen at all in SL. I instrumented my viewer code with better DEBUG messages and a timer for event poll requests.

Under normal conditions (no network issue, sim server running normally), event polls never timeout in the agent region before an event comes in. Even in an empty sim, without any neighbour regions, the ParcelProperties message is always transmitted every 60 seconds (and for an agent region with neighbours within draw distance, you also get EnableSimulator for each neighbour every minute).

Timeouts only occur for neighbour regions, when nothing happens in the latter, and after 293.8 seconds only.

So, when a user requests a TP, the agent region will not risk seeing the poll request timing out just at the moment TeleportFinish arrives, causing a race condition in the HTTP connection tear down sequence, like you described.

However, what would happen if, say, a ParcelProperties message (or any other event in the agent region) arrives milliseconds before the user triggers a TP request ?... The poll request N finishes with ParcelProperties, the TP request fires, and what if TeleportFinish is sent by the server just before the viewer can initiate poll request N+1 (reminder: llcore uses a thread for HTTP requests) ?... Maybe a race condition could happen here (depending on how events are queued server side, and how the Apache delays in connection building and tear down could lag/race things; this might explain why TeleportFinish is sometimes queued but never sent, maybe ?)...

In any case, I would suggest reconsidering the way TeleportFinish is sent to viewers: what about restoring the old UDP reliable path for it ?... Or implementing a message for viewers to re-request it, when they did not get it ”in time”...

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens
1 hour ago, Henri Beauchamp said:

In fact, I could verify today that this scenario cannot happen at all in SL. I instrumented my viewer code with better DEBUG messages and a timer for event poll requests.

You might be monitoring at the wrong level.  Try llcorehttp or wireshark/tcpdump as a second opinion.  The simulator will not keep an event-get request alive longer than 30s.  If you find it is, it is either retries masking the timeout or yet more simulator bugs.  Or both.  There are always simulator bugs.

(Side note:  once 'Agent Communication' is established, events queue regardless of whether or not an event-get is active.  Up to a high-water level.  So it is not required that an event-get be kept active constantly.  It is required that no more than one be active per region - this is not documented well.)

Link to comment
Share on other sites

40 minutes ago, Monty Linden said:

You might be monitoring at the wrong level.  Try llcorehttp or wireshark/tcpdump as a second opinion.  The simulator will not keep an event-get request alive longer than 30s.  If you find it is, it is either retries masking the timeout

Yup, you are right... Can see this with Wireshark. The retry is likely done at libcurl level... More race condition opportunities ! 😢

Which only advocates for a return of reliable message events such as TeleportFinish to the ”reliable UDP” path provided by the viewer...

Link to comment
Share on other sites

31 minutes ago, Monty Linden said:

Probably my fault.  The policy group for long-poll should probably not attempt retries.  It hides things that should have been brought up to viewer awareness.

The LLAppCoreHttp::AP_LONG_POLL policy group does not define the retry attempts, at least not in my viewer... But explicitly setting mHttpOptions->setRetries(0) causes ”502” errors in disguise (502 body, 500 or 499 header) to happen...

However, setting mHttpOptions->setTransferTimeout(25) (25s timeouts, i.e. below the server timeout) with mHttpOptions->setRetries(0) seems to work just fine: libcurl then timeouts after 25s and the viewer fires a new poll, as expected (an no trace of retries in Wireshark)... This would eliminate a possible cause for a race condition.

And I got an idea to avoid TP failures that would possibly be the result of a race between a received event processing, the triggering of a TP by the user just at that moment, the firing of a new poll, and the TeleportFinish transmission. I'll try to set an ”in flight” flag on starting the poll request, reset it on request return, and on TP test that flag: when not ”in flight”, yield to coroutines until the coroutine for the event poll can fire a new request (setting the flag); the TP would then be fired while the poll request is ”stable” and waiting for a server transmission.

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens

My plan for a plan is roughly:

  • Phase 1.  Simulator changes, compatible with viewers, to make it more robust.  Might include temporary change in timeout to 25 or 20 seconds.
  • Phase 2.  Robust event transfer.  Might require viewer changes.

There's a change in here from "at most once" logic (i.e. viewers see an event at most once but possibly never) to "at least once" (event may be sent several times under the same 'id'/'ack' and viewer needs to expect that).  Don't know where this best fits in.  I'm hoping P1 but that might break something.

Don't know if we can re-enable the old message.  Or do resend asks.  I really want to just make forward fixes and not add patches to the patches.

  • Like 1
Link to comment
Share on other sites

Plan sounds good. Useful technique from robotics: if you have to put something on a queue and the queue is full, put in a message that says "some messages were dropped". Then, at the receiving end, you can tell the difference between a missing message because of overload, and a missing message because it wasn't sent at all. Those errors have completely different causes, and this tells you where to look.

  • Thanks 1
Link to comment
Share on other sites

11 hours ago, Monty Linden said:

Might include temporary change in timeout to 25 or 20 seconds.

EEEK !

Don't do that: viewers would see those ugly ”502 in disguise” errors, which would be considered as poll request failures in the current viewers' code, and only retried a limited amount of times !

With the current viewer code and in SL (*), the poll request timeout must occur on the viewer side (yes, even though it is ”transparently” retried on libcurl level: the important point is that the fake 502 error is not seen by the viewer code).

If anything, increasing the server side timeout from 30s to 65s or so (so that a ”ParcelProperties” message would make it through before each request would timeout), would reduce the opportunities for race conditions.

(*) For OpenSim-compatible viewers, a (true) 502 error test is added, which is considered a timeout and retried like for a viewer-side libcurl timeout, but this test is only performed while connected to OpenSim servers, which do not lie on 502 errors by disguising them as 499 or 500 ones in their header.

Quote

Robust event transfer.  Might require viewer changes.

Pretty please, make it so that these changes remain backward-compatible...

One possible such change would be as follow:

Currently, viewers acknowledge the previous poll event ”id” on restarting a request, by setting the ”ack” request field equal to the previous result ”id”. It means that, for TeleportFinish, the server would normally see the ”id” used to transmit it on its side coming back immediately in the ”ack” field of the request following its receival by the viewer. If the server does not get it (because it does not get a new request posted by the viewer), then the TeleportFinish was not received and should be resent.

To be 100% sure that the request is not just in flight or delayed, the server could send two different commands in a row on TP: TeleportFinish first, then, for example the ParcelProperties, in a different message (different Id): then if no ”ack” for TeleportFinish has been received, re-issue it.

Edited by Henri Beauchamp
Link to comment
Share on other sites

Success !

I managed to:

  1. Reproduce reliably TP failures modes related to event poll requests expiration and restart delays (race condition with the servers).
  2. Find and implement a robust work around for those.

The problem seen is indeed due to how a TP request by the user can be sent to the server while the poll request is about to timeout, or was just closed and is being restarted as the result of an event poll message receival. If the server queues the TeleportFinish message (or any message, but this one is unique and supposed to be 100% reliable, unlike ParcelProperties & Co) while the viewer is in the process of restarting a poll request, somehow that message will never be received.

To confirm this, I use a LLTimer which is reset just before I post (and yield) the request in LLEventPoll. I also use a 25s timeout and no libcurl-level retries for those requests, so that they always timeout on the viewer side and that the said timeout is always seen happening by the LLEventPoll code. I also implemented a debug display for that timer in the viewer window, so that I can easily manually trigger a TP just before or just after the event poll request has expired or started; doing so, I can reliably reproduce the TP failures that so far seemed to happen ”randomly”.

As for the workaround, it is implemented in the form of a TP queuing and event poll timer window checking; whenever a TP request is done 500ms or less before the agent region poll request would time out or has been restarted, the TP is queued (via a new LLAgent::TELEPORT_QUEUED state, which allows to use the existing state machine implemented in llagent.cpp and llviewerdisplay.cpp), and the corresponding UDP message (either TeleportLocationRequest, TeleportLandmarkRequest or TeleportLureRequest) requesting the TP to the server is put on hold until the event poll request timer is again in the stable/established connection window, at which point the TP request message is sent. So far (stress-testing still in progress), it works wonders and I do not experience failed TPs any more.

If everything runs as expected and I am satisfied with the stress-testing, this code will be implemented in the next releases of the Cool VL Viewer.

Edited by Henri Beauchamp
Link to comment
Share on other sites

10 hours ago, Henri Beauchamp said:

Don't do that: viewers would see those ugly ”502 in disguise” errors, which would be considered as poll request failures in the current viewers' code, and only retried a limited amount of times !

With the current viewer code and in SL (*), the poll request timeout must occur on the viewer side (yes, even though it is ”transparently” retried on libcurl level: the important point is that the fake 502 error is not seen by the viewer code).

If anything, increasing the server side timeout from 30s to 65s or so (so that a ”ParcelProperties” message would make it through before each request would timeout), would reduce the opportunities for race conditions.

(*) For OpenSim-compatible viewers, a (true) 502 error test is added, which is considered a timeout and retried like for a viewer-side libcurl timeout, but this test is only performed while connected to OpenSim servers, which do not lie on 502 errors by disguising them as 499 or 500 ones in their header.

That's the reverse of what I'm doing in Sharpview. I let the server side time out.

Timeout server side is well defined - you get a complete HTTP transaction with some status. Timeout viewer side is a TCP connection abort. If that happens while the server is starting to send something, the server probably won't notice. Certainly not if it goes through Apache. So I wanted to time out server side, and use a 90 second timeout. As usual, I start out by doing what the documentation says to do, then look for problems.

Current Sharpview error status handling for the event poller:

  • 200 - normal. Zero length "events" array in LLSD is OK. LLSD parsing errors or missing expected fields are logged. ID must be in sequence and errors are logged, but no action is taken.
  • 404 - end of file. Quit polling.
  • 502 - normal server side timeout per documentation. Poll again. (The Other Simulator sends this).
  • Other 5xx errors - log error, after 3 errors wait 2 secs to avoid excessive retry on server fails, poll again. Never give up as long as region is connected.
  • Connection close - log error, poll again. This seems to be the normal SL behavior on server side timeout.

The viewer side HTTPS client is the Rust crate "ureq". "Curl" is not used.

I'd suggest is that the server should time out first, and it should send an empty events array with a normal 200 status. that is,

<llsd><map><key>events</key><array></array><key>id</key><integer>28</integer></map></llsd>

That empty LLSD array

<array></array>

unambiguously says "everything is fine, no new events right now." So no troublesome HTTP error paths are taken in normal operation. Normal operation stays on the happy path. Existing viewers should accept such a no-events message. This is a once or twice a minute event, if that, so it's low overhead.

Link to comment
Share on other sites

28 minutes ago, animats said:

502 - normal server side timeout per documentation. Poll again. (The Other Simulator sends this).

The problem is that you do not get that when logged to SL: you get a 499 or 500 error header (and ”502 error” printed in body). Meaning, somehow, the 502 error gets mutated into another, and is then not recognized as such by the viewers. Thus why you cannot let the server time out when connected to SL (everything working as expected when connected to OpenSim, where I do let the server time out in my code).

Link to comment
Share on other sites

27 minutes ago, Henri Beauchamp said:

The problem is that you do not get that when logged to SL: you get a 499 or 500 error header (and ”502 error” printed in body). Meaning, somehow, the 502 error gets mutated into another, and is then not recognized as such by the viewers.

Right. I never see a legit 502 error from Second Life. I handle timeout as described above, which does seem to work. But I have more control of the protocol stack in Sharpview, since "curl" is not involved.

So, if this is being overhauled, I'd suggest that sim server side send a 200 (OK) status with the no-events LLSD body given above. That avoids problems with error case handing in Apache, Curl, AWS, VPNs, firewalls, and other not entirely transparent middle boxes.

Link to comment
Share on other sites

13 hours ago, animats said:

Right. I never see a legit 502 error from Second Life. I handle timeout as described above, which does seem to work. But I have more control of the protocol stack in Sharpview, since ”curl” is not involved.

Yes, I tried to setup a larger timeout than the SL's servers, and to take into account the bogus ”502 in disguise” I get, considering them simple poll timeouts. I then observe a strange thing, that can only be explained by a libcurl weird internal working: the timeouts occur after 61.25 seconds or so (instead of 30.25 seconds or so, which would correspond to the server timeout plus the ping time), and I do see in Wireshark libcurl retrying the connection once on first server timeout (i.e. after ~30s) instead of passing the latter to the application, like instructed to do (setRetries(0)) !... Maybe it is due to that ”502 in disguise” issue (libcurl won't recognize a ”genuine” timeout and retry once ?)...

So in the end, the only way for me to see a genuine timeout occurring is to set the viewer-side timeout below the server one...

 

Also, everyone, you all can stop holding your breath: after stress-testing it (and despite more refinements brought to its code), my workaround does not prove robust enough, and I can still see TP failures happening sometimes (rarer than without it, but still happening nonetheless)... I will publish it in next Cool VL Viewer releases (with debugs settings for a kill switch and several knobs to play with, and that handy poll request age debug display for easy repros of TP failures), but it is not a solution to TP failures, sadly. 😢

So, we will have to wait for Monty to fix the server side of things... 😛

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens
5 hours ago, Henri Beauchamp said:

So, we will have to wait for Monty to fix the server side of things... 😛

This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

I have some more behavioral probing to do, particularly around the bizarre handling of dropped connections by apache, but a proposal is coming.  This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

Link to comment
Share on other sites

20 minutes ago, Monty Linden said:

This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

I'd say I am not surprised, given the absence of retries from the server part, when TeleportFinish is not received by viewers...

22 minutes ago, Monty Linden said:

This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

Count me in among the ”guinea pigs”. 😜

I'll gladly help testing a fix (or several) for this years-long TP bug that is plaguing SL.

Link to comment
Share on other sites

  • Lindens
29 minutes ago, Henri Beauchamp said:

I'll gladly help testing a fix (or several) for this years-long TP bug that is plaguing SL.

Just to keep expectations set correctly:  this isn't the only TP bug.  It's just one of the most glaring.  We won't have 100% TP/RC success after this but it should be better.

  • Thanks 1
  • Sad 1
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...