Jump to content

Obscure question: when does the simulator send EstablishAgentCommunication to the viewer?


animats
 Share

Recommended Posts

1 hour ago, Monty Linden said:

This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

I kind of suspected that, since my first implementation simply sent the string "TEST" instead of the proper LLSD poll request. It still worked.

1 hour ago, Monty Linden said:

This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

Willing to help there.

36 minutes ago, Monty Linden said:

Just to keep expectations set correctly:  this isn't the only TP bug.  It's just one of the most glaring.  We won't have 100% TP/RC success after this but it should be better.

That's OK. Having a sound communications protocol underneath will reduce the noise level, and make it easier to find the remaining intermittent bugs.

I'd like to plug again for the idea that, in the unusual event the protocol has to drop something due to overload, put something in the event stream that tells the viewer that data was lost. Then we can distinguish between overload (as in, we need more capacity somewhere for busy event regions, or something has to be done about the disco with all the material-changing floor tiles), and missing messages (as in, viewer has been waiting for EstablishAgentCommunication for 10 seconds now.). Something like that in the UDP message protocol, should a reliable message be lost, would also be helpful. Silently losing important messages made debugging difficult.

This is real progress.

Link to comment
Share on other sites

  • Lindens
37 minutes ago, animats said:

I'd like to plug again for the idea that, in the unusual event the protocol has to drop something due to overload, put something in the event stream that tells the viewer that data was lost.

I want to do enough eventually to detect and log that your peer is doing something unexpected.  Might just be an 'event lost' flag or a sequence number scheme.  Not certain it will go as far as a digest of failure.  Open to suggestions.

Simulator is subject to amnesia where it necessarily tosses connection information away when an agent departs.  When the agent comes back, simulator starts over.  Viewer, on the other hand, may keep simulator information around for the life of the session.  The resetting needs to be treated correctly.

Link to comment
Share on other sites

3 hours ago, Monty Linden said:

Viewer, on the other hand, may keep simulator information around for the life of the session.

Not for the full session, no, but typically up to two minutes or so after the ”departure” (actually after the region is out of draw distance): the event poll is terminated when it time outs after the departure, then the region itself is removed from memory, within one minute, the last thing to go being the UDP data circuit, about two minutes after departure.

Of course, should the avatar go back to the region, things still around may be reused...

Edited by Henri Beauchamp
Link to comment
Share on other sites

3 hours ago, Monty Linden said:

I want to do enough eventually to detect and log that your peer is doing something unexpected.  Might just be an 'event lost' flag or a sequence number scheme.  Not certain it will go as far as a digest of failure.  Open to suggestions.

An "event lost" indicator, for both event poller events and UDP messages, would be a good start. That would distinguish overload-related situations (busy shopping event with many avatars entering and leaving) from other kinds of problems. If there are no missing events, and the fixes mentioned above have been made, but things still reach a "stuck" situation, then there's a well defined problem to look for.

Link to comment
Share on other sites

  • Lindens

Agreed.  The overload scenario isn't just theoretical, it is happening to the tune of 3 million per day.  I also have 6K per day where viewers send 'ack' values the simulator didn't supply.  No idea what that is (yet).

(I really need to stop peeling this onion.  It's full of brain spiders and they're making me miserable.)

  • Like 1
  • Thanks 1
  • Haha 1
Link to comment
Share on other sites

5 hours ago, Monty Linden said:

Agreed.  The overload scenario isn't just theoretical, it is happening to the tune of 3 million per day.  I also have 6K per day where viewers send 'ack' values the simulator didn't supply.  No idea what that is (yet).

(I really need to stop peeling this onion.  It's full of brain spiders and they're making me miserable.)

it's to bad they never open sourced an old version of the simulator server side,  would be neat to tinker with and see what all happens on a terminal.   ugh pipe dreams hehe.

Link to comment
Share on other sites

On 9/15/2023 at 3:01 PM, Monty Linden said:

The overload scenario isn't just theoretical, it is happening to the tune of 3 million per day. 

That's great news! Those are clear problems to work on. When solved, the remaining problems should be more repeatable, and a big step closer to being fixed.

Quote

I also have 6K per day where viewers send 'ack' values the simulator didn't supply.  No idea what that is (yet).

Hm. If you need more checking viewer side for that, just ask. Is this UDP side or event poller? I see sequential numbers from the SL simulators for both. The Other Simulator does sometimes skip sequence numbers.

I get a few duplicate UDP messages; not many. I just noticed that I log any ACK that doesn't match an un-ACKed message as a "duplicate ACK", so totally spurious ACKs will be logged as "duplicate". I'll take a look for spurious message numbers coming from the simulator.

Link to comment
Share on other sites

  • Lindens
3 hours ago, animats said:

Hm. If you need more checking viewer side for that, just ask. Is this UDP side or event poller? I see sequential numbers from the SL simulators for both. The Other Simulator does sometimes skip sequence numbers.

Event poller.  I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift.  But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized.  Haven't dug into it yet - future project.  Even the 3e6 dropped events per day are just 4 / hour / region when amortized.

Link to comment
Share on other sites

4 hours ago, Monty Linden said:

Event poller.  I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift.  But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized.  Haven't dug into it yet - future project.  Even the 3e6 dropped events per day are just 4 / hour / region when amortized.

Well, the viewer indeed keeps the event poll alive after the agent has left the region, which is needed to keep region communications alive in the case when the region border was simply crossed, or in the case of a ”medium range” TP in a neighbour region and still within draw distance.

Of course, in the ”far TP” case, the viewer will keep polling until it finds out the region is to be disconnected, so it might restart a poll after a far TP, acknowledging (again) the last received message Id (same Id as previous poll)...

Double acks will also happen whenever a poll request ”fails” (or simply times out) for a live region, and the viewer restarts a second poll: here again, the ”id” of the last received message is repeated in the ”ack” field of the poll request.

Edited by Henri Beauchamp
Link to comment
Share on other sites

9 hours ago, Henri Beauchamp said:

Double acks will also happen whenever a poll request ”fails” (or simply times out) for a live region, and the viewer restarts a second poll: here again, the ”id” of the last received message is repeated in the ”ack” field of the poll request.

Right. If you time out, you have to re-poll with the last ID you received.

That's why I suggested that event poller timeout should be server side and should consist of sending a normal LLSD reply with an empty array of events. That unambiguously says everything is fine, no traffic right now. The ID then advances by 1 and both ends stay in sync.

Link to comment
Share on other sites

50 minutes ago, animats said:

That's why I suggested that event poller timeout should be server side and should consist of sending a normal LLSD reply with an empty array of events.

That won't be a timeout, but a periodic empty message sent by the server before the request would actually timeout at the HTTP stack level. It means that, if it did not send any message to each viewer with an active request in the past 28 seconds of so (to avoid the 30s HTTP timeout, counting the ”ping” time, the frame time, and possible server side lag at next frame) it must send a reply with an empty events array.

But yes, it would work with the current code used by viewers, and would definitely prevent some race conditions on TP (the race happening when the TP request is sent just as the server times out and libcurl silently retries the request, with TeleportFinish sent by the server too soon before libcurl could reconnect).

Edited by Henri Beauchamp
Link to comment
Share on other sites

2 hours ago, Monty Linden said:

This initial phase is going to be constrained by the http server we have in the simulator.  It's based on that pump/pipe model.  Without rework, the only timeout response it can formulate is a closed socket.  :(

Which would be an argument in favour of @animats' suggestion to send an empty array of events instead of letting the request timing out... Of course, it means more work for the sim server (monitoring the requests timing for each connected viewer and sending an empty event when it gets close to an HTTP timeout to avoid the latter), but it should not prove too overwhelming either...

  • Like 1
Link to comment
Share on other sites

This is great!

Some notes:

"At least once" delivery

The most important change is the move to "at least once" delivery instead of "no more than once". That alone will solve some major problems. There may be some viewer-side problems with duplicate events, but those can be detected viewer side. The events

  • EstablishAgentCommunication
  • CrossedRegion
  • TeleportFinish

are probably the ones that need viewer-side duplicate detection, because they all imply major viewer side state changes. No big problem detecting that viewer side. Duplicates are far better than losing any of those crucial messages. Losing one of those means a viewer stalled at login, region crossing, teleport, or neighbor region appearance.

Are there any others not in that list for which processing a duplicate message causes serious trouble?

Event poll timeout semantics

Henri and I were arguing above over how to handle this. It's even worse than either of us thought.

I'll plug again for the server sending an empty array of events when it has no events to send. Stay on the happy path of HTTP status 200 if at all possible. With Apache on one end and Curl on the other, plus a possible Squid proxy, plus possible VPNs, firewalls, and middle boxes, the error code cases just aren't going to reliably make it through the chain. Existing viewers should be able to handle this without modification, since it's their happy path too.  As Monty Linden said above, "Simplify rather than bolt on more special case code. We're dying from state-space explosions".

Event order

Once we have at-least-once delivery, is it worth guaranteeing in-order delivery of event poller events? Out of order delivery is a feature on the UDP message side, but it doesn't seem to add value on the HTTP event poller side. Out of order delivery makes this much harder to think about and adds code complexity. What is the server-side benefit of out of order event queuing? Something to think about. A discussion topic at this point, not a proposal.

Metadata

Useful debug tool. That will be useful in correlating visible problems with transient server overloads that caused event drops. SL has more crowds now, which is good. Look at "What's Hot Now" in the viewer and you'll see many regions with over 50 avatars. Now the problems that appear under big-crowd load become important.

Retry

"And none have magical retry systems that make errors go away without effort."

Right. For login failure, the viewer could conceivably try the login again automatically, instead of the user having to do it manually. There are situations where that would help, where a simulator has gone down but the login server is not yet aware of that fact. But those are rare.

The viewer is somewhat lost if a TeleportFinish doesn't arrive, because the viewer has no idea its avatar has left for parts unknown.

Failed region crossings are potentially recoverable. The usual failed state is vehicle in new region, avatar in old region, avatar state is sitting but on nothing, vehicle seat state in new region is "sat on." with nothing on the seat. It's possible to move the vehicle back to the old region and get everything re-connected, although it takes a sit/unsit/sit cycle to get things cleaned up, and the cleanup is incomplete regarding sit target state. It's far better to avoid getting into that situation than to try to get out of it. Once event lossage is fixed, we'll be able to see what problems remain in that area.

Conclusion

This is a huge step forward.

  • Like 2
Link to comment
Share on other sites

Thank you for a really useful paper !

It indeed explains a lot of things I could observe here with my improved LLEventPoll logging and my new debug settings for playing with poll requests timeouts... As well as some so far ”unexplainable” TP failure modes, that resist my new TP queuing code (queuing till next poll has started, when too close from a timeout).

Tomorrow's releases of the Cool VL Viewer (both the stable and experimental branches) will have all the code changes I made and will allow you to experiment with it. I will post details about the debug settings and log tags here after release.

Looking forward for the server changes. I'll have a look at what duplicated messages could entail viewer side (especially CrossedRegion and TeleportFinish which could possibly be problematic if received twice) and whether it would mandate viewer code changes or not.

Edited by Henri Beauchamp
  • Like 1
Link to comment
Share on other sites

Cool VL Viewer releases (v1.30.2.28 and v1.31.0.6) published, with my new LLEventPoll code and experimental race condition (partial) workaround for TP failures.

The new goodies work as follow:

  • LLEventPoll was made robust against 499 and 500 errors often seen in SL when letting the server time out on its side (which is not the case with LL's current code since libcurl retries long enough and times out by itself). 502 errors (that were already accepted for Open Sim) are now also treated as ”normal” timeouts for SL. It will also retry 404 errors (instead of committing suicide) when they happen for the Agent's sim (the Agent sim should never be disconnected spuriously, or at least not after many retries).
  • LLEventPoll now sets HTTP retries to 0 and a viewer-side timeout of 25 seconds by default for SL. This can be changed via the ”EventPollTimeoutForSL” debug setting, which new value would be taken into account on next start of an event poll.
  • LLEventPoll got its debug message made very explicit (with human-readable sim names, detailed HTTP error dump, etc). You can toggle the ”EventPoll” debug tag (from ”Advanced” -> ”Consoles” -> ”Debug tags”) at any time to see them logged.
  • LLEventPoll now uses an LLTimer to measure the poll requests age. The timer is started/reset just before a new request is posted. Two methods have been added to get the event poll age (getPollAge() in seconds) and a boolean (isPollInFlight()) which is true when a poll request is waiting for server events and its age is within the ”safe” window (i.e. when it is believed to be old enough for the server to have received it and not too close from the timeout). The ”safe window” is determined by the viewer-side timeout and a new ”EventPollAgeWindowMargin” debug setting: when the poll request age is larger than that margin and smaller than the timeout minus this margin, the poll is considered ”safe enough” for a TP request to be sent to the server without risking a race condition. Note that, for the ”minimum” age side of the safe window, EventPollAgeWindowMargin is automatically adjusted down if needed for each LLEventPoll instance (by measuring the minimum time taken for the server to reply a request) and the frame time is also taken into account (else you could end up never being able to TP, when the events rate equals the frame rate or is smaller than EventPollAgeWindowMargin).
  • The age of the agent region event poll can be displayed in the bottom right corner of the viewer window via the ”Advanced” -> ”HUD info” -> ”Show poll request age” toggle: the time (in seconds) gets a ”*” appended whenever the poll request age is outside the ”safe window”.
  • An experimental TP race workaround has been implemented (off by default), which can be toggled via the new ”TPRaceWorkAround” debug setting. It works by checking isPollInFlight() whenever a TP request is made, and if not in the safe window, it ”queues” the request until isPollInFlight() returns true, at which point the corresponding TP request UDP message is sent to the server. To debug TPs and log their progress, use the ”Teleport” debug tag.

 

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens
On 9/22/2023 at 8:14 PM, animats said:

 

"At least once" delivery

The most important change is the move to "at least once" delivery instead of "no more than once". That alone will solve some major problems. There may be some viewer-side problems with duplicate events, but those can be detected viewer side. The events

  • EstablishAgentCommunication
  • CrossedRegion
  • TeleportFinish

are probably the ones that need viewer-side duplicate detection, because they all imply major viewer side state changes. No big problem detecting that viewer side. Duplicates are far better than losing any of those crucial messages. Losing one of those means a viewer stalled at login, region crossing, teleport, or neighbor region appearance.

Are there any others not in that list for which processing a duplicate message causes serious trouble?

Duplicates are easy to remove for the simple case (viewer and sim in agreement).  The case where LLEventPoll gets recreated losing its memory of previous 'ack'/'id' progress still needs work.  Hope is to remove duplication at that level where it happens in one place.  If it has to happen at the level of individual events/messages, that's duplicated work now and complexity in the future when more are added.

Not certain about completeness of that list yet.  The survey is pretty complete but I haven't checked yet.

On 9/22/2023 at 8:14 PM, animats said:

Event poll timeout semantics

Henri and I were arguing above over how to handle this. It's even worse than either of us thought.

I'll plug again for the server sending an empty array of events when it has no events to send. Stay on the happy path of HTTP status 200 if at all possible.

For various reasons, changing the server side is going to be painful (look at lliohttpserver.cpp in the viewer) and the viewer needs to retry on 499/5xx receipt anyway.  I think if this were changed, I'd just flip the timeout response to be the same as for a duplicate connection:  200 with naked 'undefined' as LLSD value.  But this current behavior is shared with all simulator and dataserver http endpoints as it is based on the common HTTP server 'framework'.  There needs to be more awareness of this and our docs don't help.

Viewer timeout was still the real problem as that's what introduced the race condition for the 'at most once' case.  Without it, loss was still possible but would have happened more rarely and no one would ever have found it.  So this is a good thing.  (Ha)

On 9/22/2023 at 8:14 PM, animats said:

Event order

Once we have at-least-once delivery, is it worth guaranteeing in-order delivery of event poller events? Out of order delivery is a feature on the UDP message side, but it doesn't seem to add value on the HTTP event poller side. Out of order delivery makes this much harder to think about and adds code complexity. What is the server-side benefit of out of order event queuing? Something to think about. A discussion topic at this point, not a proposal.

It's very deliberately ordered right now but it was doing event combining in the past, overwriting certain existing events when new ones of the same type were queued.  The way it was done, it also resulted in out-of-order delivery.  But that was removed long ago.  Combined with disallowing multiple connections at once, order is now preserved.  I don't know that we'd ever want it back.  Except for the fact that a lot of data can flow through this pathway and it does overflow.  Combining might reduce that.  Not really proposing it, just making it clear that there were decisions here.  I'm trying to leave enough breadcrumbs for future debates.

Link to comment
Share on other sites

  • Lindens
On 9/23/2023 at 4:32 PM, Henri Beauchamp said:

Cool VL Viewer releases (v1.30.2.28 and v1.31.0.6) published, with my new LLEventPoll code and experimental race condition (partial) workaround for TP failures.

The new goodies work as follow:

  • LLEventPoll was made robust against 499 and 500 errors often seen in SL when letting the server time out on its side (which is not the case with LL's current code since libcurl retries long enough and times out by itself). 502 errors (that were already accepted for Open Sim) are now also treated as ”normal” timeouts for SL. It will also retry 404 errors (instead of committing suicide) when they happen for the Agent's sim (the Agent sim should never be disconnected spuriously, or at least not after many retries).
  • LLEventPoll now sets HTTP retries to 0 and a viewer-side timeout of 25 seconds by default for SL. This can be changed via the ”EventPollTimeoutForSL” debug setting, which new value would be taken into account on next start of an event poll.

With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger:  45s or more.  This hands connection lifecycle mostly to the simulator, allowing it to know when it can truly send and closing the race.  This should give a 'mostly working' implementation of 'at most once' event delivery.  It will still break on other failures on the network as well as at TPs/RCs where the LLEventPoll/LLAgentCommunication endpoints may get cycled.  Not a fix but an experiment.

Link to comment
Share on other sites

All this looks very good. "At least once" is fine. If we have problems with duplicates, that's detectable and fixable. Order preservation means some pathological out of order cases should not come up viewer side.

Consolidating events is useful for object updates. But that's UDP side. I don't see anything in the Appendix A list that looks like a good prospect for consolidation for performance improvement.

Looking at the list of events in Appendix A, what can generate an event overload? PBR material updates, which could, were just moved to UDP.  That was the only time I ever saw much event traffic. The "chatterbox" stuff? Beq Janus may have something to say about that; she's been struggling with group chat synchronization. The "bridge" mechanism? Should we be prepared for any UDP message to appear on the event channel in the future?

Viewer event poll timeout bad, simulator timeout normal. Sounds good.

Looking good.

Link to comment
Share on other sites

5 hours ago, Monty Linden said:

With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger:  45s or more.

3 hours ago, animats said:

Viewer event poll timeout bad, simulator timeout normal. Sounds good.

You can increase the timeout to 45s with the Cool VL Viewer now, but sadly, in some regions (*) this will translate in a libcurl-level ”spurious” retry after 30s or so (i.e. a first server-side timeout gets silently retried by libcurl), before you do get a viewer-side timeout after the configured 45s delay; why this happens in unclear (*), but sadly, it does happen, meaning there is no possibility, for now, to always get a genuine server-side timeout in the agent region (the one that matters), neither to prevent a race during the first ”silent retry” by libcurl...

(*) I would need to delve into libcurl's code and/or instrument it, but I saw cases (thus ”in some regions”) when there was no silent retries by libcurl and I did get proper server-side timeout after 30s, meaning there might be a way to fix this issue server-side, since it looks like it depends on some server(s) configuration (at Apache level, perhaps... Are all your Apache servers configured the same ?)...

3 hours ago, animats said:

If we have problems with duplicates, that's detectable and fixable

I already determined that a duplicate TeleportFinish message could possibly cause a failed TP in existing viewers code, because there is no guard in process_teleport_finish() for TeleportFinish received after the TP state machine moved to another state than TELEPORT_MOVING, and process_teleport_finish() is the function responsible for setting that state machine to TELEPORT_MOVING... So, if the second TeleportFinish message (which is sent by the departure sim) is received after the AgentMovementComplete message (which is sent by the arrival sim which was itself connected on the first TeleportFinish message occurrence and sets TELEPORT_START_ARRIVAL), you get a ”roll back” in the TP state machine form TELEPORT_START_ARRIVAL to TELEPORT_MOVING, which will cause a failure by the viewer to finish the TP process properly.

 

So, basically, a procedure must be put in place so that viewers without future hardened/modified code will not get those duplicate event poll messages. My proposal is as follow:

  • The server sends a first (normal) event poll reply with the message of interest (TeleportFinish in our example) and registers the ”id” of that poll reply for that message.
  • The viewer should receive it and restart immediately a poll request with the ”id” in the ”ack” field; if it does not, or if the ”ack” field contains an older ”id”, the viewer probably missed the message, but the server cannot know for sure, because the poll request it receives might be one started just as it was sending TeleportFinish to the viewer (request timeout race condition case).
  • To make sure, and when the ”ack” field for the new poll does not match the ”id” field of the TeleportFinish message it sent, the server can reply to the viewer new poll with an empty array of ”events”, registering the ”id” for that empty reply.
  • If the viewer next poll still does not contain the ”id” field of the ”TeleportFinish” request but does contain the ”id” field of the empty poll, obviously, it did not get the first ”TeleportFinish” message, and it is safe for the server to resend it...

 

EDIT: but the more I think about it, the more I am persuaded that the definitive solution to prevent race conditions is to suppress entirely the risk of poll request timeouts anywhere in the chain (sim server, Apache, libcurl, viewer). This would ”simply” entail implementing the proposal made above. By ensuring a dummy/empty message is sent before any timeout would occur, we ensure there is no race at all, since the HTTP connection closing initiative is then exclusively the fact of the sim server (via the reply to the current event poll request, be it by a ”normal” message or by a ”dummy” message when there is nothing to do but preventing a timeout), and the poll request HTTP connection initiation only happens at the viewer code level.

Edited by Henri Beauchamp
Link to comment
Share on other sites

5 hours ago, Monty Linden said:

and the viewer needs to retry on 499/5xx receipt anyway

LL's viewer current code considers these cases as errors, which are only retried a limited amount of times before the viewer would give up on the event polls for that sim server; these should therefore not happen in ”normal” condition, and it does not happen just because the code currently lets libcurl retry and timeout by itself, at which point the viewer gets a libcurl-level timeout, which is considered normal (not an error), and retried indefinitely.

Edited by Henri Beauchamp
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...