Jump to content

Obscure question: when does the simulator send EstablishAgentCommunication to the viewer?


animats
 Share

Recommended Posts

  • Lindens
5 hours ago, animats said:

Looking at the list of events in Appendix A, what can generate an event overload?

I haven't dug into root cause yet but we have 3 million per day.  I suspect most of these are associated with logout or deteriorating network conditions.  But it is pretty common.

Link to comment
Share on other sites

8 hours ago, Henri Beauchamp said:

I already determined that a duplicate TeleportFinish message could possibly cause a failed TP in existing viewers code, because there is no guard in process_teleport_finish() for TeleportFinish received after the TP state machine moved to another state than TELEPORT_MOVING, and process_teleport_finish() is the function responsible for setting that state machine to TELEPORT_MOVING...

Interesting. That's very similar to the problem of a double region crossing. We know that if a second region crossing starts before the first has completed (a common case at region corners) a stuck state results. The problems may be related.

Link to comment
Share on other sites

36 minutes ago, animats said:

Interesting. That's very similar to the problem of a double region crossing. We know that if a second region crossing starts before the first has completed (a common case at region corners) a stuck state results. The problems may be related.

I wonder if this also applies to the case of being offered a teleport to a destination region that I am already in the process of teleporting to because this always gets me "Darn!" and requires I close the viewer and try again.  While I am stuck waiting for "Darn!" I can hear the sounds of my destination and am told I am there, but all the viewer shows me is the teleport blinder with a progress bar.

Link to comment
Share on other sites

  • Lindens
20 hours ago, Henri Beauchamp said:

You can increase the timeout to 45s with the Cool VL Viewer now, but sadly, in some regions (*) this will translate in a libcurl-level ”spurious” retry after 30s or so (i.e. a first server-side timeout gets silently retried by libcurl), before you do get a viewer-side timeout after the configured 45s delay; why this happens in unclear (*), but sadly, it does happen, meaning there is no possibility, for now, to always get a genuine server-side timeout in the agent region (the one that matters), neither to prevent a race during the first ”silent retry” by libcurl...

There are other players in here.  I've deliberately avoided mention of Apache details.  However, it will contribute its own timeout-like behaviors as well as classic Linden behaviors (returning 499, for example).  The modules in here aren't candidates for rework for a number of reasons and so general principles must apply:  expect broken sockets (non-http failures), 499s, and flavors of 5xx and don't treat these as errors.

That the LL viewer does is really a bug.  The viewer-side timeouts enable hang and lost peer detection on a faster timescale than, say, tcp keepalives would (not that that's the right tool here).  Slowed retry cycles might be better and then allowing UDP circuit timeout to drive the decision to declare the peer gone.  But that didn't happen.

Whatever happens, I really want it to work regardless of how many entities think they should implement timeouts.  Timeouts are magic numbers and the data transport needs to be robust regardless of the liveness checking.

(I do want to document this better.  This is something present on all capabilities.  For most caps it won't matter but that depends on the case.)

 

20 hours ago, Henri Beauchamp said:

So, basically, a procedure must be put in place so that viewers without future hardened/modified code will not get those duplicate event poll messages.

 

The situation isn't entirely grim with the existing viewer code but there is still a duplication opportunity.  First, the optimistic piece.  The 'ack'/'id' value is supposed to have meaning but the current LL viewer and server do absolutely nothing with it except attempt a safe passthru from server to viewer and back to server.

With the passthru and the enforcement of, at most, a single live request to the region, there is good agreement between viewer and server on what the viewer has processed.  As long as the 'ack' value held by the viewer isn't damaged in some way (reset to 0, made an 'undef', etc.)

There is one window:  during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'.  If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync.

So, the four cases:

  1. Viewer (LLEventPol) at 'ack = 0/undef' and server (LLAgentCommunication) at 'id = 0'.  Reset/startup/first visit condition, everything is correct.  No events have been sent or processed.
  2. Viewer at 'ack = N' and server at 'id = N or N + 1'.  Normal operation.  If viewer issues request, server knows viewer has processed 'id = N' case and does not resend.  Any previous send of 'id = N + 1' case has been dropped on the wire due to single request constraint on the service.
  3. Viewer at 'ack = N' and server at 'id = 0'.  Half-reset.  Viewer walked away from region then later returned and server re-constructed its endpoint but the viewer didn't.  Safe from duplication as the event queue in the server has been destroyed and cannot be resent.  Viewer does have to resync with server by taking on the 'id = 1' payload that will eventually come to it.  (Could even introduce a special event for endpoint restart to signal viewer should resync but I don't think this is needed.)
  4. Viewer at 'ack = 0' and server at 'id = N'.  Half-reset.  Opposite case.  Viewer walked away from region then re-constructed it's endpoint (LLEventPoll) but the server didn't.  This is the bad case.  Viewer has no knowledge of what it has processed and server has a set of events, 'id = N', which the viewer may or may not have processed in that aforementioned window.

I believe that last case is the only real exposure to duplicated events we have given a) the various request constraints and b) no misbehavior by viewer or server during this protocol.  So what are the options.  Here are some:

  • Case 4. doesn't happen in practice.  There's some hidden set of constraints that doesn't allow this to happen.  System transitions to cases 1. or 3. and 4. never arises.  Hooray!
  • There is the 'done' mechanism in the caps request body to terminate the server's endpoint.  It is intended that the viewer use this in a reliable way when it knows it's going to give up its endpoint.  I.e. it forces a transition to case 1. if 4. is a possibility.  I think this control is buggy and has races so there are concerns.  Requires viewer change.
  • Viewer keeps 'last ack' information for each region forever.  When the viewer walks away from a region and comes back, it reconstructs its endpoint with this 'last ack' info.  So instead of coming back into case 4., it enters into 2. or 3. and we're fine again.  Requires viewer change.
  • Use an 'ack = 0' request as an indicator that viewer has reset its knowledge of the region and assume that any current payload ('id = N') has either been processed or can be dropped as uninteresting because the viewer managed to return here by some means.  Next response from the server will be for 'id = N + 1'.  No viewer change but viewers might be startled.  TPV testing important.
  • Magical system I don't know about.

I would really like for that first bullet point to be true but I need to do some digging.

 

Link to comment
Share on other sites

3 hours ago, Monty Linden said:

There is one window:  during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'.  If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync.

This is not the issue at hand, and not what I am observing or would cause the race condition I do observe and I am now able (thanks to the ”request poll age” new debug display in the Cool VL Viewer) to reproduce at will; this is really easy with configured defaults (25s viewer side timeout, and experimental TP race workaround disabled): wait until the poll age display gets a ”*” appended, which will occur at around 24.5s of age , and immediately trigger a TP: bang, TP fails (with timeout quit) !

The issue I am seeing in ”normal viewers” (viewers with LL's unchanged code and that my changes only allow to artificially reproduce ”reliably”), is a race at the request timeout boundary: the agent sim server (or Apache behind it) is about to time out (30s after the poll request has been started viewer side, which will cause a ”silent retry” by libcurl), and the user requests a TP just before the timeout occurs, but the TeleportFinish message is sent by the server just after the silent retry occurred or while it is occurring. The TeleportFinish is then lost, so what would happen in this case is:

  1. The sim server sent a previous message (e.g. ParcelProperties) with id=N, and the viewer replied with ack=N in the following request (with that new request not yet used, but N+1 being the next ”id” to send by the server).
  2. The user triggers a TP just as the ”server-side” (be it at the sim server or Apache server level, this I do not know) is about to time out on us, which happens 30s after it received the poll request from the viewer. At this point a Teleport*Request UDP message is sent to the sim server.
  3. The poll request started after ”ParcelProperties” receival by the viewer times out server-side and Teleport*Request (which took the faster UDP route) is also received by the sim server. What exactly happens at this point server-side is unknown to me: is there a race between Apache and the sim server, a race between the Teleport*Request and the HTTP timeout causing a failure to queue TeleportFinish, is TeleportFinish queued in the wrong request queue (the N+1 one, which the viewer did not even start, because the sim server would consider the N one dead) ?... You'll have to find out.
  4. Viewer side, libcurl gets the server timeout and retries silently the request (unknown to the viewer code in LLEventPoll), and a ”new” (actually the same request, but retried ”as is” by libcurl) request with the same ack=N is sent to the server (this is likely why you get 3 millions ”repeated acks”: each libcurl retry reusing the same request body).
  5. The viewer never receives TeleportFinish, and never started a new poll request (seen from LLEventPoll), so is still at ack=N, with the request started after ParcelProperties still live/active/valid/waiting for server reply, from its perspective (since successfully retried by libcurl).

 

With my new code and its default settings (25s viewer-side timeout, TP race workaround OFF), the same thing as above occurs, but  the request times out at LLEventPoll level (meaning the race only reproduces after 24.5s or so of request age), instead of server-side (and then retried at libcurl level); the only difference you will see server-side is that a ”new” request (still with ack=N) by the viewer arrives before the former timed out server-side (which might not be much ”safer” either, race-condition-wise, server-side).

This at least allows a more deterministic ”danger window”, thus the easiness to reproduce the race, and my attempt at the TP race workaround (in which the sending of the UDP message corresponding to the user TP request is delayed until outside the ”danger window”), which is sadly insufficient to prevent all TP failures.

 

As for ack=0 issues, they are, too, irrelevant to cases when TPs and region crossings fail: in these two cases, the poll request with the agent region is live, and so it is for the region crossing to a neighbour region. There will be no reset to ack=0 from the viewer in these cases since the viewer would never kill the poll request coroutines (on which stack ack is stored) for the agent and close ( = within draw distance) neighbour regions.

 

But I want to reiterate: all these timeout issues/races would vanish altogether, if only the server could send a dummy message when nothing else needs to be sent, before the dreaded 30s HTTP timeout barrier (say, one message every 20s, to be safe).

Edited by Henri Beauchamp
  • Like 1
Link to comment
Share on other sites

  • Lindens
On 9/26/2023 at 5:17 PM, animats said:

Interesting. That's very similar to the problem of a double region crossing. We know that if a second region crossing starts before the first has completed (a common case at region corners) a stuck state results. The problems may be related.

There's a lot going on on our side both before *and* after the TeleportFinish.  Before the message there's a lot of communication between simhosts and other services.  It is very much a non-local operation (region crossings have some optimizations in this area).  After the message, some deferred work continues for an indefinite number of frames.  Yeah, there are more races in there - I started seeing their fingerprint when looking at London City issues.

This is a bit of a warning...  the teleport problems aren't going to be a single-fix matter.  The problem in this thread (message loss) is very significant.  But there's more lurking ahead.

  • Thanks 1
Link to comment
Share on other sites

3 minutes ago, Monty Linden said:

There's a lot going on on our side both before *and* after the TeleportFinish.

I'm aware of that. We don't see it viewer side, but know it's happening. After all, this is TeleportFinish. Double region crossings will probably be the hardest problem. Each time you fix an intermittent bug in messaging, though, the remaining bugs acquire greater clarity, and become closer to being fixed.

I'm encouraged at progress in this area.

Tuesday's demo at Server User Group of faster avatar rezzing in test regions was impressive. Pink/white clouds disappeared within 2-3 secs, and avatars had no parts missing. That was great to see. SL is getting more crowds, and crowds of avatars half-rezzed in event spaces have become a serious problem. Unwanted nude avatars, avatars with heads on backwards, stuff like that. Looks awful. That may be an unrelated problem technically, but from a user perspective, seeing old immersion-breaking bugs being fixed is a clear indicator of progress. From a marketing perspective, half the avatars in the WelcomeHub are stuck in pink/white cloud mode, and some of those won't come back. That alone justifies to management the effort expended in fixing that problem.

Keep plugging, and thanks.

  • Like 2
Link to comment
Share on other sites

3 hours ago, Monty Linden said:

This is a bit of a warning...  the teleport problems aren't going to be a single-fix matter.  The problem in this thread (message loss) is very significant.  But there's more lurking ahead

I am totally conscious about this, however we (animats & I) proposed you a ”free lunch”: implementing those dummy poll reply messages server-side (a piece of cake to implement server side, and which won't break anything, not even in old viewers) to get fully rid of HTTP-timeouts-related race conditions. Then we will see how things fare with TeleportFinish already, i.e. will it be always received by viewers ?...

There is nothing to loose trying this, and this could possibly solve a good proportion of failed TPs... If anything, and even should it fail, it would allow to eliminate a race condition candidate (or several), and reverting the code server side would be easy and without any consequence.

Edited by Henri Beauchamp
  • Like 1
Link to comment
Share on other sites

  • Lindens

[ Okay, sorry for the delay.  I've been working through the cases and reformulating the problem and I'm starting to like what I have.  Thanks to everyone who's been whiteboarding with me, especially @Henri Beauchamp and @animats.  It's been extremely valuable and enjoyable.  Now, on to the next refinement with responses to earlier comments... ]

Reframing the Problem

Two race conditions exist in the system:

1.  Inner race.  Involves the lifecycle of a single request.  Race starts as soon as the viewer commits to timing out an EventQueueGet operation.  Even if the entire response, less one byte, is in viewer buffers, that response will be lost as will any response written later until the race closes.  The race closes when:

  • Simulator times out the request and will not send any event data until a new request comes in.  This produces a closed socket between simulator and apache stack.
  • Simulator starts processing a new request from the viewer which causes early completion of outstanding request.  This is sent as a 200/LLSD::undef response.

(Note:  the race is also started by other entities.  There are actions within the Apache stack and simply on the network generally that can end that request early without the simulator being aware.)

This race is due to application protocol design and the use of unreliable HTTP modes (timeouts, broken sockets).

2.  Outer race.  Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress.  Instances of these classes can disappear and be recreated without past memory without informing the peer.  This causes a jump in ack/id number and the possibility of duplicate events being sent.

This race is due to implementation decisions.  I.e. endpoint classes that forget and that don't have a synchronization protocol between them.

Before and After

Under "At Most Once" event delivery, 1. is a source of lost events and 2. doesn't matter.  Events are being forgotten on every send so the outer race adds no additional failures (loss or duplication).

Under "At Least Once" event delivery, 1. still exists but retaining events on the simulator side until a positive acknowledgement of receipt is received corrects the lost response.  2. then becomes a path for duplicated event processing in the viewer.  A simulator will have sent events for an 'ack = N/id = N+1' exchange when the viewer moves out of the region and eventually releasing its endpoint.  The viewer may or may not have received that response.  When the viewer comes back, it will do so with a new LLEventPoll and 'ack = 0' as its memory and the simulator is faced with a dilemma:  should the viewer be sent a duplicate 'id = N + 1' response or should it assume that that payload has been processed and wait until the 'id = N + 2' payload is ready?

Simulator sending 200 response before timeout.  This doesn't really close the race.  The race is started outside the simulator, usually by the viewer but apache or devices on the network path can also start it.  The simulator can only close the race by electing not to send event data until a new request comes in.  Trying to end the race before it starts isn't really robust:  libcurl timers can be buggy, someone changes a value and doesn't know why, apache does it's thing, sea cable ingress decides to shut a circuit down, network variability and packet loss, and everyone in the southern hemisphere.  And it locks the various timeout values into the API contract.  Solution:  embrace the race, make it harmless via application protocol.

Updated Phased Proposal

Phase 1

As before with one change to try to address the outer race scenarios.  Simulator will retain the last 'ack' received from the viewer.  If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason.  Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value.  'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues.  This potentially drops unprocessed events but that isn't any worse than the current situation.

With this, the inner race is corrected by the application protocol.  There's no sensitivity to timeouts anywhere in the chain:  viewer, simulator, apache, internet.  Performance and delivery latency may be affected but correctness won't be.

Phase 2

This becomes viewer-only.  Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever.  A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point.  If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible.  Viewer just takes up the new 'id' sequence.  This should have been done from the first release.

Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior.  Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid.

Logging and defensive coding to avoid processing of duplicated payloads (which should never happen).  There's one special case where that won't be handled correctly.  When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'.  In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case.  May Just let this go.  Also, diagnostic logging for the new metadata fields being returned.

Forum Thread Points and Miscellany Addressed

3 million repeated acks.  Just want to correct this.  The 3e6 is for overflows.  When the event queue in the simulator reaches its high-water mark and additional events are just dropped.  Any request from the viewer that makes it in is going to get the full queue in one response.

Starting and closing the race in the viewer.  First, a warning...  I'm not actually proposing this as a solution but it is something someone can experiment with.  Closing the race as outlined above only happens in the simulator.  BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first.  Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request.  Viewer can then launch another request after 20s.  Managing this oscillator in the viewer would be ugly and still not deal with some cases.

Fuzzing Technique.  You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected.  This will retire an outstanding request in the simulator.  That, in turn, comes back to the viewer which will launch another request which retires the curl command request.

TLS 1.3 and EDH.  There is a Jira to enable keylogging in libcurl/openssl to aid in wire debugging of https: traffic.  This will require some other work first to unpin the viewer from the current libcurl.  But it is on the roadmap.

If we do open up this protocol to change in the future (likely under a capability change), I would likely add a restart message to the event suite in both directions.  It declares to the peer that this endpoint has re-initialized, has no memory of what has transpired, and that the peer should reset itself as appropriate.  Likely with some sort of generation number on each side.

 

  • Like 3
  • Thanks 1
Link to comment
Share on other sites

15 hours ago, Monty Linden said:

2.  Outer race.  Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress.  Instances of these classes can disappear and be recreated without past memory without informing the peer.

Currently, such a race will pretty much never happen viewer-side in the Agent's region...

The viewer always keeps the LLEventPoll instance it starts for a region (LLViewerRegion instance) on EventQueueGet capability URL receival, until the said region gets farther than the draw distance, at which point the simulator is disconnected, then the LLViewerRegion instance is destroyed, and the LLEventPoll instance for that region with it; as long as the LLEventPoll instance is live, it will keep the last received message ”id” on its coroutine stack (in the 'acknowledge' LLSD). However, should EventQueueGet be received a second time during the connection with the region, the existing LLEventPoll instance would be destroyed and a new one would be created with the new (or identical: no check is done) capability URL.

For the agent's region, I so far never, ever observed a second EventQueueGet receival, and so the risk to see the LLEventPoll destroyed and replaced with a new one (with a reset ”ack” field on first request of the new instance) is pretty much inexistent.

This could however possibly happen for neighbour regions (sim capabilities are often ”updated” or received in several ”bundles” for neighbour sims; not too sure why LL made it that way), but I am not even sure it does happen for EventQueueGet.

I of course do not know what is the LLAgentCommunication lifespan server-side, but if race happens, it can currently only be because it does not match the lifespan of the connection between the sim server and the viewer.

 

15 hours ago, Monty Linden said:

As before with one change to try to address the outer race scenarios.  Simulator will retain the last 'ack' received from the viewer.  If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason.  Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value.  'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues. 

In fact, ”ack” is very a badly chosen key name. It is not so much of an ”ack” than a ”last received message id” field: it means that unless the viewer receives a new message, the ”ack” value stays the same for each new poll request it fires and that do not result in the server sending any new message before the poll times out (this is very common for poll requests to neighbour regions).

Note also, that as I already pointed out in my previous posts, several requests with the same ”ack” will appear server-side because these requests have simply been retried ”silently” by libcurl on the client side: the viewer code does not see these retries. For LLEventPoll, a request will not been seen timing out before libcurl retried it several times and gives up with a curl timeout: with neighbour sims, the timeout may only occur after 300s or so in LLEventPoll, while libcurl will have retried the request every 30s with the server (easily seen with Wireshark), and the latter will have seen 10 requests with the same ”ack” as a result.

Also, be aware that with the current code, the first ”ack” sent by the viewer (on first connection to the sim server, i.e. when the LLEventPoll coroutine is created for that region, which happens when the viewer receives the EventQueueGet capability URL), will be an undefined/empty LLSD, and not a ”0” LLSD::Integer !

Afterwards, the viewer simply repeats the ”id” field it gets in an event poll reply into the next ”ack” field of the next request.

To summarize: viewer-side, ”ack” means nothing at all (its value is not used in any way, and the type of its value is not even checked), and can be used as the server sees fit.

15 hours ago, Monty Linden said:

This becomes viewer-only.  Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever.  A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point.  If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible.  Viewer just takes up the new 'id' sequence.  This should have been done from the first release.

Easy to implement, but it will not be how the old viewers work, so... Plus, it would only be of use should the viewer restart an LLEventPoll with the sim server during a viewer-sim (not viewer-grid) connection/session, which pretty much never happens (see my explanations above).

15 hours ago, Monty Linden said:

Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior.

That hardening part is already in the Cool VL Viewer for 499, 500, 502 HTTP errors, which are considered simple timeouts (just like the libcurl timeout) and trigger an immediate relaunch of a request. All other HTTP errors are retried several times (and that retries number is doubled for the agent region: it has been of invaluable help a couple years ago, when poll requests were failing left and right with spurious HTTP errors for no reason, including in the agent region).

15 hours ago, Monty Linden said:

Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid.

This is already the case in current viewers code: there's a llcoro::suspendUntilTimeout(waitToRetry) call for each HTTP error, with waitToRetry increased with the number of consecutive errors.

15 hours ago, Monty Linden said:

Logging and defensive coding to avoid processing of duplicated payloads (which should never happen).

Already done in the latest Cool VL Viewer releases, for duplicate TeleportFinish and duplicate/out-of-order AgentMovementComplete messages (for the latter, based on its Timestamp field).

15 hours ago, Monty Linden said:

There's one special case where that won't be handled correctly.  When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'.  In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case.  May Just let this go.  Also, diagnostic logging for the new metadata fields being returned.

Frankly, this should never be a problem... Messages received via poll requests from a neighbour region that reconnects, or a region the agent left a while ago (e.g. via TP) and comes back in, are not ”critical” messages, unlike messages received from the current Agent region the agent is leaving (e.g. TeleportFinish)...

15 hours ago, Monty Linden said:

3 million repeated acks.  Just want to correct this.  The 3e6 is for overflows.  When the event queue in the simulator reaches its high-water mark and additional events are just dropped.  Any request from the viewer that makes it in is going to get the full queue in one response.

I do not even know why you bother counting those... As I already explained, you'll get repeated ”ack” fields at each timed out poll request retry. These repeats should simply be fully ignored; the only thing that matters, it that one ”ack” does not suddenly becomes different from the previous ones for no reason.

15 hours ago, Monty Linden said:

Starting and closing the race in the viewer.  First, a warning...  I'm not actually proposing this as a solution but it is something someone can experiment with.  Closing the race as outlined above only happens in the simulator.  BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first.  Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request.  Viewer can then launch another request after 20s.  Managing this oscillator in the viewer would be ugly and still not deal with some cases.

Fuzzing Technique.  You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected.  This will retire an outstanding request in the simulator.  That, in turn, comes back to the viewer which will launch another request which retires the curl command request.

That's a very interesting piece of info, and I used it to improve my experimental TP race workaround, albeit not with an added POST like you suggest: now, instead of just delaying the TP request until outside the ”danger window” (during which a race risks to happen), I also fake an EventQueueGet capability receival for the Agent's sim (reusing the same capability URL, of course), which causes LLViewerRegion to destroy the old LLEventPoll instance and recreate one immediately (the server then receives a second request while the first is in the process of closing (*), and I do get the ”cancel” from the server in the old coroutine). I will refine it (will add ”ack” field preservation between LLEventPoll instances, for example), but it seems to work very well... 😜

(*) yup, I'm using a race condition to fight another race condition !  Yup, I'm totally perverted ! 🤣

Edited by Henri Beauchamp
  • Like 1
Link to comment
Share on other sites

  • Lindens

Okay, I made a state machine diagram for the simulator end of things based on my current work.  Looking for comments, bugs, suggestions.  (Also interested in better tools for doing this.)  But, briefly, it's an event-driven design with:

  • Four driving events
  • Six states
  • Thirty-four transitions
  • A two-way reset election

EventQueueGetStateMachine(1).thumb.png.55fa2cec7df4ddf3aa28ad59a13ce877.png

Data Model

  • Event Sequence.  The simulator attempts to deliver a stream of events to each attached viewer.  As an event is presented, it gets a (virtual) sequential number in the range [1..S32_MAX].  This number isn't attached to the event and it isn't part of the API contract.  But it does appear in metadata for consistency checking and logging.
  • Pending Queue.  Events are first queued to a pending queue where they are allowed to gather.  There is a maximum count of events allowed for each viewer.  Once reached, events are dropped without retry.  This dropping is counted in the metadata.
  • Pending Metadata.  First and last event sequence numbers of events currently in the queue as well as a count of events dropped due to quota or other reasons.
  • Staged Queue.  When the viewer has indicated it is ready for a new batch of events, the pending queue and pending metadata are copied to the staged queue and staged metadata.  This collection of events is given an incremented 'id'/'ack' value based on the 'Sent ID' data.  These events are frozen as is the binding of the ID value and the metadata.  They won't change and the events won't appear under any other ID value.
  • Staged Metadata.  Snapshot of Pending Metadata when events were copied.
  • Client Advanced.  Internal state which records that the viewer has made forward progress and acknowledge at least one event delivery.  Used to gate conditional reset operations.
  • Sent ID.  Sequential counter of bundled of events sent to viewer.  Range of [1..S32_MAX].
  • Received Ack.  Last positive acknowledgement received from viewer.  Range of [0..S32_MAX].

States

The States are simply derived from the combination of four items in the Data Model:

  • Staged Queue empty/full (indicated by '0'/'T' above).
  • Sent ID == Received Ack.  Indicates whether the viewer is fully caught up or is the simulator waiting for an updated ack from the viewer.
  • Pending Queue empty/full ('0'/'T').

A combination of three booleans gives eight possible states but two are unreachable/meaningless in this design.  Of the valid six states:

  • 0. and 1. represent reset states.  The simulator quickly departs from these mostly staying in 2-5.
  • 2. and 3. represent waiting for the viewer.  Sent ID has advanced and we're waiting for the viewer to acknowledge receipt and processing.
  • 4. and 5. represent waiting for new events to deliver.  Viewer has caught up and we need to get new events moving.

Events

Raw inputs for the system come from two sources:  requests to queue and deliver events (by the simulator) and EventQueueGet queries (by the viewers).  The former map directly to the 'Send' event in the diagram.

The other events are more complicated:

  • Get_Reset.  If viewer-requested reset is enabled in the simulator, requests with 'ack = 0' (or simply missing acks) are treated as conditional requests to reset before fetching new events.  One special characteristic of this event is that after making its state transition, it injects a follow-up Get_Nack event which is processed immediately (the reason for all the transitions pointing into the next event column).  If the reset is not enabled, this becomes a Get_Nack event, instead.
  • Get_Ack.  If the viewer sends an 'ack' value matching the Send ID data, this event is generated.  This represents a positive ack and progress can be made.
  • Get_Nack.  All other combinations are considered Get_Nack events with the current Staged Queue events not being acknowledged and likely being resent.

Operations

Each event-driven transition leads to a box containing a sequence of one or more micro operations which then drive the machine on to a new state (or one of two possible states).  Operations with the 'c_' prefix are special.  Their execution produces a boolean result (0/1/false/true).  That result determines the following state.

The operations:

  • Idle.  Do nothing.
  • Push.  Add event to Pending Q, if possible.
  • Move.  Clear Staged Queue, increment Send ID, move Pending Queue and Metadata to Staged Queue and Metadata.
  • Send.  Send Staged Queue as response data to active request.
  • Ack.  Capture request's 'ack' value as Received Ack data.
  • Advance.  Set Client Advanced to true, viewer has made forward progress.
  • c_Send.  Conditional send.  If there's a usable outstanding request idling, send Staged Queue data out and status is '1'.  Otherwise, status is '0'.
  • c_Reset.  Conditional reset.  If Client Advanced not true, status is '0'.  Otherwise, reset state around viewer:
    • Release Staged Queue without sending.
    • Sent ID set to 0
    • Received Ack set to 0
    • Set Client Advanced flag to false
    • Set status to '1'.

 

  • Like 1
  • Thanks 2
Link to comment
Share on other sites

I wish Linden Labs would spend $18 

Quote

.. are you tired of those useless topics that tells you nothing and has dozens of posts in one day ?

This plugin will allow users (from selected groups) to ignore topics (on selected forums). Ignored topics won't appear in Activity Streams or search results.

Members can manage their ignored topics on Accounts Settings.

https://www.sosinvision.com.br/index.php?/file/60-ignore-topics/

  • Like 1
  • Haha 1
  • Confused 1
Link to comment
Share on other sites

This is complicated. Can it be simplified?

The intent here is simply to create a reliable pipe with backpressure, similar to a basic TCP connection. If that's possible, a raw TCP connection would be much simpler. But that wouldn't be encrypted. The system used for browser push messages isn't a good match, either. So long-poll HTTPS, treated as unreliable delivery of big packets, it is.

The happy path is fine. All the problems involve problems which force things off the happy path.

Some lost events are more serious to users than others. Loss of a TeleportFinish means a failed teleport. Loss of a CrossedRegion means a failed region crossing. Loss of an EnableSimulator means a missing region. Many SL users have seen those immersion-breaking occurences, and they are seen as serious bugs. For the last two, the viewer is stuck in an out-of-sync condition that is hard to recover.

On the other hand, losing a ParcelProperties or an AgentGroupDataUpdate might cause a ban line not to appear in the viewer (it's still enforced; that's server side) or the groups list for a user to be temporarily out of date. Those are not immersion-breaking problems. This argues for classifying events as Essential or Non-Essential, and not losing the Essential ones.

The Essential messages tend to be infrequent. So, if there were separate Pending Queues for Essential and Non-Essential events, with Essential events having priority and having their own small dedicated queue, loss of Essential events could be eliminated. If the viewer is so out of communication with the simulator that Essential events have to be dropped, things are probably headed for a disconnect and logout. (Possible bad case: user sets draw distance to 1024m, and the viewer has to be told about sixty regions, each generating an EnableSimulator message. Is there anything worse than that?) (Related note: the main agent region might have a larger Essential queue than child agent regions. Most of the Essential messages will come from the main agent region. There's an upper memory usage bound of max Essential queue length x max number of main agents, so this can't grow without bound.)

The lost event metadata item, if and when generated, can be put on the Essential queue, but only one such event should be on that queue at a time, to prevent clogging that queue with debugging data.

As to numbering, it looks like ID numbers are simply sequential starting at 1, which is reasonable. All the server has to be able to do is resend the last poll reply, or advance to the next one. There's no going backwards into the past. That seems to be implicit in the state machine above. It's conceivable that the viewer might get a duplicate of an old poll reply, perhaps due to something at the Apache level, and that needs to be ignored by the viewer. It must not set the ID number back. If the viewer gets ID 101, ID 102, ID 103, ID 101, and then ACKs ID 101 again, the protocol is stuck, because the viewer is asking for a value the server cannot provide.That's probably a NACK condition, for which the previous ID # is sent.

The "reset" mechanism seems to be a backwards compatibility feature. Viewers currently just send back whatever ID they saw last. So a duplicate old event breaks the sequencing. The server can get things un-stuck by going back to 0 and starting over. Maybe. Do existing viewers understand that? Not sure about this.

Link to comment
Share on other sites

A bit more on my original question of when the simulator sends EstablishAgentCommunication. I've implemented what;s been described above, following Beq's discussion. I'm testing by logging into the beta grid at the center of Morris. This is the Morris/Ahern/Dore/Bonifacio quadrant, the traditional place for testing region boundary issues. What happens is interesting. At login, I'm facing northeast, and I see Morris and Ahern. Until the avatar moves a few meters in some direction,  Dore and Bonifacio don't appear. This is with a long draw distance, so they should all come up.

What's happening down at the message level is slightly strange. Here's  a summary, from my own logging.

02:51:21 [WARN] Region (255232,256256) state change from Idle to Discovered: New
02:51:21 [WARN] Region (255232,256256) state change from Discovered to Connected: Region handshake
02:51:21 [WARN] Region (255232,256256) state change from Connected to SeedCapabilityReceived: Have seed capability from login/teleport
02:51:21 [WARN] Region (255232,256256) state change from SeedCapabilityReceived to CapabilitiesReceived: Capabilities received.
02:51:21 [WARN] Region (255232,256256) state change from CapabilitiesReceived to Live: Live
02:51:22 [WARN] Region (255232,256512) state change from Idle to Discovered: New
02:51:22 [WARN] Region (254976,256256) state change from Idle to Discovered: New
02:51:22 [WARN] Region (254976,256512) state change from Idle to Discovered: New
02:51:23 [WARN] Region (254976,256256) state change from Discovered to Connected: Region handshake
02:51:24 [WARN] Region (255232,256512) state change from Discovered to Connected: Region handshake
02:51:24 [WARN] Region (254976,256512) state change from Discovered to Connected: Region handshake
02:51:26 [WARN] Region (255232,256512) state change from Connected to SeedCapabilityReceived: Seed capability from establish agent communication event
02:51:26 [WARN] Region (255232,256512) state change from SeedCapabilityReceived to CapabilitiesReceived: Capabilities received.
02:51:26 [WARN] Region (255232,256512) state change from CapabilitiesReceived to Live: Live
02:52:25 [WARN] Region (254976,256512) state change from Connected to SeedCapabilityReceived: Seed capability from establish agent communication event
02:52:25 [WARN] Region (254976,256256) state change from Connected to SeedCapabilityReceived: Seed capability from establish agent communication event
02:52:26 [WARN] Region (254976,256512) state change from SeedCapabilityReceived to CapabilitiesReceived: Capabilities received.
02:52:26 [WARN] Region (254976,256512) state change from CapabilitiesReceived to Live: Live
02:52:26 [WARN] Region (254976,256256) state change from SeedCapabilityReceived to CapabilitiesReceived: Capabilities received.
02:52:26 [WARN] Region (254976,256256) state change from CapabilitiesReceived to Live: Live

That's the sequence of attaching to a new region. EnableSimulator starts the process and moves the viewer's region to Discovered state in Sharpview's internal state machine. The UDP connection to the new region is set up. Receiving RegionHandshake causes the viewer to generate RegionHandshakeReply, and advances things to Connected state. Happy path so far.

Next is getting the seed capability. For the login region, that came with the login, so we get to go directly to fetching capabilities. For neighbor regions, we have to wait for an EstablishAgentCommunication event, sent via the event poller to the avatar's region. Ahearn responds in about 2 seconds, at 02:51:25. Dore and Bonifacio, though, don't respond yet.

A minute later, at 02:52:26 I move the avatar a few meters. Doesn't matter which direction the avatar moves or which direction they are looking. Then the EstablishAgentCommunication events for Dore and Bonifacio come in, the region goes live in the viewer, object updates start pouring in, and those regions get drawn.

This is a strange place in the protocol to get stuck. This doesn't happen with Firestorm. So I suspect there's something I have to do to kickstart the new region. Maybe send it an agent update?

Also, once in a while testing this, I get a 403 Forbidden status from the login server. A second try works. This leads me to suspect I didn't shut something down properly on logout.

(On the OtherSimulator, EstablishAgentCommunication comes in before RegionHandshake, incidentally.)

morrishub.thumb.jpg.a778beff644952c9a8dfaa87acad9867.jpg

The Morris hub, beta grid, all four quadrants, in Sharpview. So, approximately the right stuff is happening. The devil is in the details.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

The diagram is very nice, and while it brings some understanding on how things work, especially sim-server side, it does not give any clue about the various timings and potential races encountered, servers-side (sim server, Apache server, perhaps even the SQUID proxy ?)...

You can have the best designed protocol at the sim server level, but if in the end, it suffers from races due to communications with other servers and/or because of weird network routing issues (two successive TCP packets might not take the same route) between viewer and servers, you still see bugs in the end.

What we need is a race-resilient protocol; this will likely involve redoing the server and viewer code to implement a new ”reliable” event transmission (*), especially for essential messages such as the ones involved in sim crossing, TPs, and sims connections. I like Animat's suggestion to split message queues; we could keep the current event poll queue (for backward compatibility sake and to transmit non-essential messages such as ParcelProperties & co), and design/implement a new queue for viewers with the necessary support code, where the essential messages would be exchanged with the server (the new viewer code would simply ignore such messages transmitted over the old, unreliable queue).

(*) One with a proper handshake, and no timeout, meaning a way to send ”keep-alive” messages to ensure the HTTP channel is never closed on timeout. Or perhaps... resuscitating the UDP messages that got blacklisted, because the viewer/server ”reliable UDP” protocol is pretty resilient and indeed reliable !

 

44 minutes ago, Monty Linden said:

It TPs successfully.

Try the latest Cool VL Viewer releases (v1.30.2.32 & v1.31.0.10): they implement your idea of restarting a poll before the current poll would timeout, and use my ”danger window” and TP request delaying/queuing to ensure the request is only issued after the poll has indeed been restarted anew. It works beautifully (I did not experiment a single TP failure in the past week, even when trying to race it and TPing just as the poll times out). The toggle for the TP workaround is in the Advanced -> Network menu. 😉

Edited by Henri Beauchamp
Link to comment
Share on other sites

3 hours ago, Monty Linden said:

It's going to be a bit before I can dig through it all and find out what was actually implemented.  But there's likely a "good enough" answer in libremetaverse.  It TPs successfully.

 

2 hours ago, Henri Beauchamp said:

Try the latest Cool VL Viewer releases (v1.30.2.32 & v1.31.0.10): they implement your idea of restarting a poll before the current poll would timeout, and use my ”danger window” and TP request delaying/queuing to ensure the request is only issued after the poll has indeed been restarted anew. It works beautifully (I did not experiment a single TP failure

OK, so there is a workaround. Documentation would be appreciated. Does this "restarting a poll" refer to restarting it before the Curl library retries, or restarting it before the simulator finishes the HTTP transaction with an error or empty reply? I don't have Curl in the middle, and have full control of timing. Currently I have the poll connection timeout set to 90 seconds, and the server side always times out first.

2 hours ago, Henri Beauchamp said:

What we need is a race-resilient protocol...

Definitely for events. The polled event channel should come with a firm guarantee that all "Essential" events are delivered exactly once, in order, absent a simulator, viewer, or network outage. This eliminates a large number of cases which probably don't consistently work right and are really hard to think about.

Notes:

In order delivery, reliable delivery, no head of line blocking - pick two. You cannot have all three.

  • TCP picks the first two, as should the event poller.
  • SL's reliable UDP system picks the second two.
  • Unreliable UDP messages (TerseImprovedObjectUpdate and AgentUpdate, mostly)  have only the last property. Those are loss-tolerant - if you lose one, there will be another one along soon, with later info about where the object or avatar is.

Anything that doesn't have in-order delivery has to be commutative - it must not matter whether A arrives before B. This is mostly the case. It adds some special handing for some operations. Viewers have to handle object for child prims arriving before parent prims, for example. (Which they do, frequently. I track "orphans" in Sharpview and log any whose parents never show up. Out of order is common, but the parents do always show up and claim the child prims.) SL's protocols mostly seem to have this right, but I think there can be minor out-of-order problems that affect single prims.

If it turns out that some UDP message really does require in-order delivery, it's a good candidate to be moved over to the polled event channel. A reliable event channel provides a path to a reasonably easy fix when some order-dependent problem is found. The old messages on the "UDP blacklist" already made that move.

 

Edited by animats
  • Thanks 1
Link to comment
Share on other sites

10 hours ago, animats said:

OK, so there is a workaround. Documentation would be appreciated. Does this ”restarting a poll” refer to restarting it before the Curl library retries, or restarting it before the simulator finishes the HTTP transaction with an error or empty reply? I don't have Curl in the middle, and have full control of timing. Currently I have the poll connection timeout set to 90 seconds, and the server side always times out first.

The documentation is in the code... 😛

OK, not so easy to get a grasp on it all, so here is how it works (do hold on to your hat ! 🤣 ) :

  1. I added a timer for event polls age measurement; this timer (one timer per LLEventPoll instance, i.e. per region) is started as soon as the viewer launches a new request, and is then free-running until a new request is started (at which point it is reset). You can visualize the agent region event poll age via the ”Advanced” -> ”HUD info” -> ”Show poll request age” toggle.
  2. For SL (OpenSim is another story), I reduced the event poll timeout to 25 seconds (configurable via the ”EventPollTimeoutForSL” debug setting), and set HTTP retries to 0 (it used to be left unconfigured, meaning the poll was previously retried ”transparently” by libcurl until it decided to timeout by itself). This allows to timeout on poll requests viewer-side, and before the server would itself timeout (like it would after 30s). Ideally, we should let the server timeout on us and never retry (this is what is done and works just fine for OpenSim), but sadly, and even when setting HTTP retries to 0, libcurl ”disobeys” us and sometimes retries ”transparently” the request once (probably because it gets a 502 error from SL's Apache server, while this should be 499 or 500, and does not understand it as a timeout, thus retrying instead), masking the server-side timeout from our viewer-side code. This also involved adding ”support” for HTTP 499/500/502 errors in the code, so that these won't be considered actual errors but just timeouts.
  3. In order to avoid sending TP requests (the only kind of event the viewer is the originator for and may therefore decide to send as it sees fit, unlike what happens with sim crossing events, for example) just as the poll request is about to timeout (causing the race condition, which prevents to receive the TeleportFinish message), I defined a ”danger window” during which the TP request by the user shall be delayed until the next poll request for the agent region is fully/stably established. This involves a delay (adjustable via the ”EventPollAgeWindowMargin” debug setting, defaulting to 600ms), which is subtracted from the configured timeout (”EventPollTimeoutForSL”) to set the expiry of the free-running event poll timer (note: expiring an LLTimer does not stop it, it just flags it as expired), and is also used after the request has been restarted as a minimum delay before which we should not either send the TP request (i.e. we account for the time it takes for the sim server to receive the new request, which depends on the ”ping” time and the delay in the Apache server); note that since the configured ”EventPollAgeWindowMargin” may be too large for a delay after a poll restart (I have seen events arriving continuously with 200ms intervals or so, e.g. when facing a ban wall), the minimum delay before we can fire a TP request is also adjusted to be less than the minimum observed poll age for this sim, and I also do take into account the current frame rendering time of the viewer (else should the viewer render slower than events come in, we would not be able to TP at all). Once everything properly accounted for, this translates into a simple boolean value returned by a new LLEventPoll::isPollInFlight() method (true meaning ready to send requests to the server; false meaning not ready, must delay the request). In the agent poll age display, an asterisk ”*” is added to the poll age whenever the poll ”is not in flight”, i.e. we are within the danger window for the race condition.
  4. I added a new TELEPORT_QUEUED state to the TP state machine, as well as code to allow queuing TP request triggered by the user whenever isPollInFlight() returns false, and to allow sending it just after it returns true again.

With the above workaround, I could avoid around 50% of the race conditions and improve the TP success rate, but it was not bullet-proof... Then @Monty Linden suggested to start a (second) poll request before the current one would expire, in order to ”kick” the server into resync. This is what I did, this way:

  1. When the TP request needs to be queued because we are within the ”danger window”, the viewer now destroys the LLEventPoll instance for the agent region and recreates one immediately.
  2. When an LLEventPoll instance is deleted, it yet keeps its underlying ”LLEventPollImpl” instance live until the coroutine which runs within this LLEventPollImpl finishes, and it sends an abort message to the llcorehttp stack for that (suspended, since waiting for the HTTP reply for the poll request) coroutine. As it is implemented, the abortion will actually only occur on next frame, because it goes through the ”mainloop” event pump, which is checked on start of each new render frame. So, the server will not see the current poll request closed by the viewer until next viewer render frame, and as far as it is concerned, that request is still ”live”.
  3. Since a new LLEventPoll instance is created as soon as the old one is destroyed, the viewer immediately launches a new coroutine with a new HTTP request to the server: this coroutine immediately establishes a new HTTP connection with the server, then suspends itself and yields/returns back to the viewer main coroutine. Seen from the server side, this indeed results in a new event poll request arriving while the previous one is still ”live”, and this triggers the resync we need.

With this modification done, my workaround is now working beautifully... 😜

 

Edited by Henri Beauchamp
  • Thanks 1
Link to comment
Share on other sites

7 hours ago, Henri Beauchamp said:

Ideally, we should let the server timeout on us and never retry (this is what is done and works just fine for OpenSim), but sadly, and even when setting HTTP retries to 0, libcurl ”disobeys” us and sometimes retries ”transparently” the request once (probably because it gets a 502 error from SL's Apache server, while this should be 499 or 500, and does not understand it as a timeout, thus retrying instead), masking the server-side timeout from our viewer-side code. This also involved adding ”support” for HTTP 499/500/502 errors in the code, so that these won't be considered actual errors but just timeouts.

Hm. Is there an actual need to ever time out viewer side, or is that just a workaround for libcurl's internal problems?

In Sharpview, I'm just doing a long 90 second poll, and the HTTP/HTTPS library (Rust's "ureq" crate) obeys whatever timeout I give it and does not do retries on its own. So I don't have that problem. Do I need all this "danger timer" stuff?

Edited by animats
Link to comment
Share on other sites

32 minutes ago, animats said:

I'm just doing a long 90 second poll, and the HTTP/HTTPS library (Rust's ”ureq” crate) obeys whatever timeout I give it and does not do retries on its own. So I don't have that problem.

The timeout happens server-side after 30s without event. If you do not observe this at your viewer code level with a 90s configured timeout, then you are also the victim of ”silent retries” by your HTTP stack.

Fire Wireshark (with a filter such as ”tcp and ip.addr == <sim_ip_here>”), launch the viewer and observe: when nothing happens in the sim (no event message) for 30s after the last poll request is launched, you will see the connection closed (FIN) by the server, and there, the rust HTTP stack is likely doing just what libcurl is doing, retrying ”silently” the request with SL's Apache server...

Note that you won't observe this in OpenSim; I think this weird behaviour is due to the 499 or 500 errors ”in disguise” (you get a 499/500 reported in body, but a 502 in the header) we often get from SL's Apache server (you can easily observe those by enabling the ”EventPoll” debug tag in the Cool VL Viewer: errors are then logged with both header error number and body)...

Edited by Henri Beauchamp
Link to comment
Share on other sites

5 minutes ago, Henri Beauchamp said:

The timeout happens server-side after 30s without event. If you do not observe this at your viewer code level with a 90s configured timeout, then you are also the victim of ”silent retries” by your HTTP stack.

I get only server side timeouts, expressed as an HTTP completion with some status. The viewer side never times out with a 90 second viewer side. If you can do a long poll, you're good. Remember, there's keep-alive down at the TCP level

There are way too many different timout statuses. HTTP 500, 502, and just sending an empty reply all show up from SL. The Other Simulator follows the spec and sends 502 when it has nothing to say.

Link to comment
Share on other sites

22 minutes ago, animats said:

I get only server side timeouts, expressed as an HTTP completion with some status.

This is good for the timeout part, then (and a proof that libcurl is the culprit for those silent retries we get in C++ viewers).

 

22 minutes ago, animats said:

Remember, there's keep-alive down at the TCP level

Nope, not for poll requests... IIRC, only a few capabilities were configured with HTTP Keep-Alive (e.g. GetMesh2).

 

However, and even though you get the proper server-side timeouts at your Rust code level (which is indeed a good thing), you still have the issue with the race condition occurring during the timed out HTTP poll request tear down (as explained by Monty in the first posts of this very thread): you are then still vulnerable to this race condition, unless you use the same kind of trick I implemented... or Monty fixes that race server side.... or we get a new ”reliable events” transmission channel implemented (I still think that reviving the old (for now blacklisted) UDP messages would be the simplest way to do it and would be plenty reliable enough).

 

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens

On the original post...  I just spent too long reading more depression-inducing code and I still can't give you the absolutely correct answer.  Too many interacting little pieces, exceptions, diversions.  But I think I can get you close to reliable.

  • UseCircuitCode must be sent to the neighbor before anything can work.  This should permit a RegionHandshake from the neighbor.  You may receive multiple RH packets with different details.  For reasons.
  • At least one RegionHandshakeReply must go to the neighbor.  Can multiple replies be sent safely?  Unknown.  This enables interest list activity and other things.
  • Back on the main region, interest list activity must then occur.  IL actions include child camera updates to the neighbors.  These, along with the previous two gates, drive the Seed cap generation process that is part of HTTP setup.
  • The Seed cap generation is an async process involving things outside of the simulator.  Rather than being driven locally in the neighbor, it is driven partially by camera updates from the main region.  No comment.
  • When enough of this is done, the neighbor can (mostly) complete it's part of the region crossing in two parallel actions:
    • Respond to the main region's crossing request (HTTP). This includes the Seed capability URL which will eventually be sent as a CrossedRegion message via HTTP to the viewer via main's EventQueue, and
    • Enqueue an EstablishAgentCommunication message to be sent to the event forwarder on the main region to be forwarded up to the viewer via main's EventQueue.

Note that there seems to be a race between the CrossedRegion and EstablishAgentCommunication messages.  I could see these arriving in either order.  But if you see one message, the other should be available as well.

I don't know if this is enough for you to make progress.  If not, send me some details of a failed test session (regions involved, time, agent name, etc.) to the obvious address (monty @) and I'll dig in.  Ideally from an Aditi or Thursday/Friday/Saturday session avoiding bounces.

  • Like 1
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...