Jump to content

Monty Linden

Lindens
  • Posts

    534
  • Joined

Posts posted by Monty Linden

  1. On 9/23/2023 at 4:32 PM, Henri Beauchamp said:

    Cool VL Viewer releases (v1.30.2.28 and v1.31.0.6) published, with my new LLEventPoll code and experimental race condition (partial) workaround for TP failures.

    The new goodies work as follow:

    • LLEventPoll was made robust against 499 and 500 errors often seen in SL when letting the server time out on its side (which is not the case with LL's current code since libcurl retries long enough and times out by itself). 502 errors (that were already accepted for Open Sim) are now also treated as ”normal” timeouts for SL. It will also retry 404 errors (instead of committing suicide) when they happen for the Agent's sim (the Agent sim should never be disconnected spuriously, or at least not after many retries).
    • LLEventPoll now sets HTTP retries to 0 and a viewer-side timeout of 25 seconds by default for SL. This can be changed via the ”EventPollTimeoutForSL” debug setting, which new value would be taken into account on next start of an event poll.

    With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger:  45s or more.  This hands connection lifecycle mostly to the simulator, allowing it to know when it can truly send and closing the race.  This should give a 'mostly working' implementation of 'at most once' event delivery.  It will still break on other failures on the network as well as at TPs/RCs where the LLEventPoll/LLAgentCommunication endpoints may get cycled.  Not a fix but an experiment.

  2. On 9/22/2023 at 8:14 PM, animats said:

     

    "At least once" delivery

    The most important change is the move to "at least once" delivery instead of "no more than once". That alone will solve some major problems. There may be some viewer-side problems with duplicate events, but those can be detected viewer side. The events

    • EstablishAgentCommunication
    • CrossedRegion
    • TeleportFinish

    are probably the ones that need viewer-side duplicate detection, because they all imply major viewer side state changes. No big problem detecting that viewer side. Duplicates are far better than losing any of those crucial messages. Losing one of those means a viewer stalled at login, region crossing, teleport, or neighbor region appearance.

    Are there any others not in that list for which processing a duplicate message causes serious trouble?

    Duplicates are easy to remove for the simple case (viewer and sim in agreement).  The case where LLEventPoll gets recreated losing its memory of previous 'ack'/'id' progress still needs work.  Hope is to remove duplication at that level where it happens in one place.  If it has to happen at the level of individual events/messages, that's duplicated work now and complexity in the future when more are added.

    Not certain about completeness of that list yet.  The survey is pretty complete but I haven't checked yet.

    On 9/22/2023 at 8:14 PM, animats said:

    Event poll timeout semantics

    Henri and I were arguing above over how to handle this. It's even worse than either of us thought.

    I'll plug again for the server sending an empty array of events when it has no events to send. Stay on the happy path of HTTP status 200 if at all possible.

    For various reasons, changing the server side is going to be painful (look at lliohttpserver.cpp in the viewer) and the viewer needs to retry on 499/5xx receipt anyway.  I think if this were changed, I'd just flip the timeout response to be the same as for a duplicate connection:  200 with naked 'undefined' as LLSD value.  But this current behavior is shared with all simulator and dataserver http endpoints as it is based on the common HTTP server 'framework'.  There needs to be more awareness of this and our docs don't help.

    Viewer timeout was still the real problem as that's what introduced the race condition for the 'at most once' case.  Without it, loss was still possible but would have happened more rarely and no one would ever have found it.  So this is a good thing.  (Ha)

    On 9/22/2023 at 8:14 PM, animats said:

    Event order

    Once we have at-least-once delivery, is it worth guaranteeing in-order delivery of event poller events? Out of order delivery is a feature on the UDP message side, but it doesn't seem to add value on the HTTP event poller side. Out of order delivery makes this much harder to think about and adds code complexity. What is the server-side benefit of out of order event queuing? Something to think about. A discussion topic at this point, not a proposal.

    It's very deliberately ordered right now but it was doing event combining in the past, overwriting certain existing events when new ones of the same type were queued.  The way it was done, it also resulted in out-of-order delivery.  But that was removed long ago.  Combined with disallowing multiple connections at once, order is now preserved.  I don't know that we'd ever want it back.  Except for the fact that a lot of data can flow through this pathway and it does overflow.  Combining might reduce that.  Not really proposing it, just making it clear that there were decisions here.  I'm trying to leave enough breadcrumbs for future debates.

  3. Information is still coming in (*really* looking forward to AWS' explanation).  HTTP-Out was running at elevated levels (including higher error rates) from 21:30slt yesterday until 2:45slt today.  That's now running as expected.  Teleports remained unreliable (~80% successful) until around 6:30slt today.  They've now recovered.  Lingering issues are likely and we do want to hear about them.  Please contact support.

    • Like 2
    • Thanks 1
  4. 3 hours ago, animats said:

    Hm. If you need more checking viewer side for that, just ask. Is this UDP side or event poller? I see sequential numbers from the SL simulators for both. The Other Simulator does sometimes skip sequence numbers.

    Event poller.  I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift.  But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized.  Haven't dug into it yet - future project.  Even the 3e6 dropped events per day are just 4 / hour / region when amortized.

  5. 37 minutes ago, animats said:

    I'd like to plug again for the idea that, in the unusual event the protocol has to drop something due to overload, put something in the event stream that tells the viewer that data was lost.

    I want to do enough eventually to detect and log that your peer is doing something unexpected.  Might just be an 'event lost' flag or a sequence number scheme.  Not certain it will go as far as a digest of failure.  Open to suggestions.

    Simulator is subject to amnesia where it necessarily tosses connection information away when an agent departs.  When the agent comes back, simulator starts over.  Viewer, on the other hand, may keep simulator information around for the life of the session.  The resetting needs to be treated correctly.

  6. 5 hours ago, Henri Beauchamp said:

    So, we will have to wait for Monty to fix the server side of things... 😛

    This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

    I have some more behavioral probing to do, particularly around the bizarre handling of dropped connections by apache, but a proposal is coming.  This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

  7. My plan for a plan is roughly:

    • Phase 1.  Simulator changes, compatible with viewers, to make it more robust.  Might include temporary change in timeout to 25 or 20 seconds.
    • Phase 2.  Robust event transfer.  Might require viewer changes.

    There's a change in here from "at most once" logic (i.e. viewers see an event at most once but possibly never) to "at least once" (event may be sent several times under the same 'id'/'ack' and viewer needs to expect that).  Don't know where this best fits in.  I'm hoping P1 but that might break something.

    Don't know if we can re-enable the old message.  Or do resend asks.  I really want to just make forward fixes and not add patches to the patches.

    • Like 1
  8. 1 hour ago, Henri Beauchamp said:

    In fact, I could verify today that this scenario cannot happen at all in SL. I instrumented my viewer code with better DEBUG messages and a timer for event poll requests.

    You might be monitoring at the wrong level.  Try llcorehttp or wireshark/tcpdump as a second opinion.  The simulator will not keep an event-get request alive longer than 30s.  If you find it is, it is either retries masking the timeout or yet more simulator bugs.  Or both.  There are always simulator bugs.

    (Side note:  once 'Agent Communication' is established, events queue regardless of whether or not an event-get is active.  Up to a high-water level.  So it is not required that an event-get be kept active constantly.  It is required that no more than one be active per region - this is not documented well.)

  9. 27 minutes ago, Henri Beauchamp said:

    All the TP failure modes I get happen before event polls are even started: the UDP message from the arrival sim just never gets to the viewer, so the latter is ”left in the blue” and ends up timing out on the departure sim as well..

    There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet.  A more elaborate scenario is this:

    • [Viewer]  Request timeout fires.  Begins to commit to connection tear down.
    • [Sim]  Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request.
    • [Sim]  Having prepared content for outstanding request, declares events delivered and clears queue.
    • [Apache]  Adds more delay and reordering between Sim and Viewer because that's what apache does.
    • [Viewer]  Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request.
    • [Viewer]  Never sees TeleportFinish, stalls, then disconnects from Sim.
  10. 53 minutes ago, Henri Beauchamp said:

    Yes, it is indeed as bad as it looks... This said, my modified code performs just fine in both SL and OpenSim now, and the failed TP issues still seen are not related with event polls anyway (event polls are simply retried on timeouts).

    Well....  you don't really know that unless what you've received matches what was intended to be sent.  And it turns out, they don't necessarily match and retries are part of the problem.  I already know event-get is one mode of TP/RC failure. 

    • Thanks 1
  11. On 9/10/2023 at 12:43 AM, animats said:

    As in RSS? Right, there it's interpreted as "I've seen and processed everything through #117, send me anything from #118 and later."

    No, simpler, as in HTTP.  To support very basic 'If-Match'/'If-None-Match' tests.  The one special case is where the simulator side resets and this may need to be acted upon. 

    On 9/10/2023 at 12:43 AM, animats said:

    Incidentally, on a timed out event poll, SL does not send the documented 502 status; it just closes the connection. (The Other Simulator does send a 502). Timeout occurs around 58 seconds on both SL and the Other Simulator.

    This keeps getting worse the more I look.  So *both* viewer and simulator implement a 30-second timeout on these requests.  A guaranteed race condition.  Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout.  More variability.  Simulator can send at least three different responses related to its timeout actions:

    • A 200 status with empty body and no content-type header
    • A 200 status with llsd content-type and a body of:  '<llsd><undef /></llsd>'
    • A 499 status with html content-type and a body I haven't looked at but it's about 400 bytes.

    Between viewer and simulator is an apache stack that adds a layer of interpretation, time dependencies, and other factors to something that's already a hodge-podge.

    If a 502 is returned anywhere, it's probably an accident.

    • Thanks 1
  12. 46 minutes ago, animats said:

    Those are used for something? For a while, I was sending the string "TEST" instead of the proper LLSD message with ID number, and it still worked. Sequentially numbered events showed up. (SL has consistent event ascending numbers from 1; the Other Simulator will sometimes skip a number, and starts from some arbitrary large integer.)

    The event queue has ordered reliable delivery, which is useful where order matters.

    Well, LLSD is meant to allow slop (c++ and python implementations particularly).  But there are arguments other than id/ack in the body.  They're not currently used in the SL viewer so documentation is diddly squat.

    The whole id/ack thing is badly done.  It should have been treated more like ETag.  As it is, old events can get a new id and other sins.  I think the incremental nature is accident, not contract.  And it will reset to 0 in a region/agent pair in certain circumstances (including that viewer-initiated reset).  So, beware.  Hope to document this.

    And it's not necessarily ordered-reliable.  The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable.  Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point.  In fact, it is still unreliable now as we'll drop queue overruns on the floor.  They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action.

    Future hope is to homologate this in-place and as-is with minimal optional extensions that would also allow each end to monitor how badly the other side is messing up.

     

  13. Pulling out the true story is going to require reverse engineering and documentation on our part.  Henri's logging is something that should probably have been available in the SL viewer from day 1.  But here we are.  And as always, never mistake anecdotes for a contract.  You might see the same pattern ten times in a row but that doesn't necessarily mean there's a temporal constraint.  Especially in our code.

    You may not need quite as much deferred logic as hinted at.  Once you send the UCC into the neighbor, there will be a RegionHandshake in your future.  You may be able to delay the RHR (I mean, it's async anyway so pretend you're on Mars).  What I'd meant before was that after sending UCC (and probably better after RH receipt), send AgentUpdates into the Host region.  This is one of several paths that drives a private protocol between simulators that causes a neighbor to emit the EAC message, which is routed back via the host region, that you need to begin HTTP operations with the neighbor in anger.

    As always, you shouldn't entirely trust me on this.  I haven't walked through all of this to verify what's accidental and what's a true constraint or dependency.  HTTP was bolted onto the old UDP message system.  It's a Frankenstein monster.  'AgentCommunication' isn't just an abstract concept in the simulators, it's a physical class LLAgentCommunication and they are tied together (also hints that the people who did this are not the ones who should have been allowed to do it).  This class 'encapsulates' all the capabilities as well as the low-level plumbing for HTTP bridging of LLMessage comms.  This thing isn't actually tied to an agent strongly.  Nor to circuits (it doesn't really know about them).  It can live on after an agent has departed.  And yet the viewer has the ability to tear down this entity via arguments to 'EventQueueGet' requests.  (This resets the 'id'/'ack' value on the query among other things.)  It's quite the thing.

    • Thanks 2
×
×
  • Create New...