Jump to content

Monty Linden

Lindens
  • Posts

    522
  • Joined

Posts posted by Monty Linden

  1. On 10/30/2023 at 11:27 PM, animats said:

    > But Dore's seed capability invocation by the viewer is delayed.  I'd normally blame the viewer for the delay. 

    No, as soon as the viewer gets EstablishAgentCommunication, it makes the HTTPS request for the initial region capabilities. I have logging of that; I just didn't send an entire log file.

    How does the viewer "invoke the seed capability"? RegionHandshakeReply is done, and we know the simulator got it, because object updates started coming in.

    This is where the delay shows up.  For Dore from the server side, at 06:06:51, the seed capability (cap-granting cap) is constructed and should be available in an EAC message.  At 06:07:55, the server handles an invocation of the seed capability (POST to the seed cap) which generates the full capability set.

    The delay sits between those two points.  Will take as a given from the viewer log that viewer receives *an* EAC message by 06:07:55 but may not be the original message.  Was delay caused by:

    • Additional gate on sending first EAC message
    • Sim-to-sim loss of EAC message requiring regeneration
    • Sim-to-viewer loss of EAC message on EventQueue or elsewhere requiring regeneration
    • Delay in receiving or processing response somewhere
    On 10/30/2023 at 11:27 PM, animats said:

    > but it must be driven by IL activity forwarded from main (Morris).  That is where I'd look.

    "IL activity"? What is that?

    There are definitely plenty of AgentUpdate messages being generated. The avatar can be moved around within Morris, and another viewer in Firestorm looking at the same spot sees it moving. Moving the avatar around, or keeping it still, does not seem to affect the 1 minute delay. Is there something else that the viewer must send to the main agent simulator?

    AgentUpdate messages to main region are required to drive updates into child regions.  Resulting Interest List activity there drives EAC message generation (and re-generation).  The only obvious artificial delay is the mentioned 5-second retry throttle.  The code is a tangle but there is no obvious 1min throttle involved.

     

    On 10/30/2023 at 11:27 PM, animats said:

    The 1 minute delay (± 2 secs) before EstablishAgentCommunication shows up is very consistent in all these scenarios.

    More excerpts from the same log summarized previously:

    06:06:53 [WARN] EventEstablishAgentCommunication, unparsed: Map({"seed-capability": String("CENSORED"), "agent-id": UUID(CENSORED), "sim-ip-and-port": String("54.202.5.63:13003")})
    06:06:53 [WARN] Establish agent communication event: [MsgServerEvent { region_handle: (255232,256256), event: EstablishAgentCommunication(EventEstablishAgentCommunication { socket_addr: 54.202.5.63:13003, agent_id: CENSORED, seed_capability: "CENSORED" }) }]
    06:06:53 [WARN] Establishing agent communication to (255232,256512)
    06:06:53 [WARN] Region (255232,256512) state change from Connected to SeedCapabilityReceived: Seed capability from establish agent communication event
    06:06:53 [WARN] Fetching capabilities for region (255232,256512)
    ...
    06:06:53 [WARN] Fetched capabilities: RegionCapabilities(RegionCapabilities { start_time: Instant { tv_sec: 2291337, tv_nsec: 167534157 }, region_handle: (255232,256512), region_capabilities: {"GetMesh2": "CENSORED", "GetMesh": "CENSORED", "RenderMaterials": "CENSORED", "GetTexture": "http://asset-cdn.glb.aditi.lindenlab.com", "EventQueueGet": "CENSORED"} })
    06:06:53 [WARN] Region (255232,256512) state change from SeedCapabilityReceived to CapabilitiesReceived: Capabilities received.
    06:06:53 [WARN] Region [Ahern] (255232,256512) state change from CapabilitiesReceived to Live: Live

    This is from the same Sharpview log as previously posted. There's been a 1 minute delay between RegionHandshake/RegionHandshakeReply for Ahern before EstablishAgentCommunication was received at the viewer end. That message was rerouted from the viewer's main agent region (Morris) to the neighbor region in the viewer (Ahern). That causes the viewer to request region capabilities "GetMesh2", etc, and those come back within the same second. Ahern then goes to Live state and appears on screen. So the delay is before EstablishAgentCommunication is received, not in fetching capabilities.

    It's quite possible that Sharpview needs to send something more to wake up the simulator, but I don't know what.

    Above fragment is from Ahern which is a child region that *doesn't* have the delay.  The capability setup ran through at the desired rate.  So there is a difference between children.  One appears to operate normally, two appear delayed.  We don't have enough logging enabled by default for me to tell from this run what the source of the delay is.

    • Thanks 1
  2. I only have default logging active so can only compare some phases.  Ahern:

    Quote

    Oct 26 06:06:51 simhost-0ebe97adfa0d247d0.aditi.secondlife.io simulator[2447]: INFO: init_agent: init_agent - Creating new child agent <animats>
    Oct 26 06:06:51 simhost-0ebe97adfa0d247d0.aditi.secondlife.io simulator[2447]: INFO:#AgentUsher LLAgentCommunication::grantSeedCapability: LLAgentCommunication::grantSeedCapability service URL: http://localhost:12040/service-proxy/cap/grant, private url http://localhost:13003/agent/<animats>/base-capabilities
    Oct 26 06:06:53 simhost-0ebe97adfa0d247d0.aditi.secondlife.io simulator[2447]: INFO: LLInterestList::copyCacheState: global-interest-list provided 1331 cacheable subscriptions to for agent <animats>
    Oct 26 06:06:53 simhost-0ebe97adfa0d247d0.aditi.secondlife.io simulator[2447]: INFO:#CapsInfo LLAgentCommunication::grantCapabilities: Starting grantCapabilities for agent <animats>

    The last line indicates where you've invoked the seed capability and things proceed as expected.

    Dore:

    Quote

    Oct 26 06:06:51 simhost-0ed8234b30a9d6270.aditi.secondlife.io simulator[8982]: INFO: init_agent: init_agent - Creating new child agent <animats>
    Oct 26 06:06:51 simhost-0ed8234b30a9d6270.aditi.secondlife.io simulator[8982]: INFO:#AgentUsher LLAgentCommunication::grantSeedCapability: LLAgentCommunication::grantSeedCapability service URL: http://localhost:12040/service-proxy/cap/grant, private url http://localhost:13003/agent/<animats>/base-capabilities
    Oct 26 06:06:53 simhost-0ed8234b30a9d6270.aditi.secondlife.io simulator[8982]: INFO: LLInterestList::copyCacheState: global-interest-list provided 5749 cacheable subscriptions to for agent <animats>
    Oct 26 06:07:55 simhost-0ed8234b30a9d6270.aditi.secondlife.io simulator[8982]: INFO:#CapsInfo LLAgentCommunication::grantCapabilities: Starting grantCapabilities for agent <animats>

    The seed caps are generated at the same time.  Interest list activity should indicate it is enabled (though I didn't check) so I think the RegionHandshakeReply is complete.  EAC should be clear to send.  But Dore's seed capability invocation by the viewer is delayed.  I'd normally blame the viewer for the delay.  :)  Lost EACs will be resent at a maximum rate of once per 5s but it must be driven by IL activity forwarded from main (Morris).  That is where I'd look.

    (BTW, it may be beneficial to get a key logging setup going with your TLS support.  Viewer will get this in the future but being able to wireshark the https: stream and decode it now and in the future may be very useful.)

    • Thanks 1
  3. On the original post...  I just spent too long reading more depression-inducing code and I still can't give you the absolutely correct answer.  Too many interacting little pieces, exceptions, diversions.  But I think I can get you close to reliable.

    • UseCircuitCode must be sent to the neighbor before anything can work.  This should permit a RegionHandshake from the neighbor.  You may receive multiple RH packets with different details.  For reasons.
    • At least one RegionHandshakeReply must go to the neighbor.  Can multiple replies be sent safely?  Unknown.  This enables interest list activity and other things.
    • Back on the main region, interest list activity must then occur.  IL actions include child camera updates to the neighbors.  These, along with the previous two gates, drive the Seed cap generation process that is part of HTTP setup.
    • The Seed cap generation is an async process involving things outside of the simulator.  Rather than being driven locally in the neighbor, it is driven partially by camera updates from the main region.  No comment.
    • When enough of this is done, the neighbor can (mostly) complete it's part of the region crossing in two parallel actions:
      • Respond to the main region's crossing request (HTTP). This includes the Seed capability URL which will eventually be sent as a CrossedRegion message via HTTP to the viewer via main's EventQueue, and
      • Enqueue an EstablishAgentCommunication message to be sent to the event forwarder on the main region to be forwarded up to the viewer via main's EventQueue.

    Note that there seems to be a race between the CrossedRegion and EstablishAgentCommunication messages.  I could see these arriving in either order.  But if you see one message, the other should be available as well.

    I don't know if this is enough for you to make progress.  If not, send me some details of a failed test session (regions involved, time, agent name, etc.) to the obvious address (monty @) and I'll dig in.  Ideally from an Aditi or Thursday/Friday/Saturday session avoiding bounces.

    • Like 1
  4. Okay, I made a state machine diagram for the simulator end of things based on my current work.  Looking for comments, bugs, suggestions.  (Also interested in better tools for doing this.)  But, briefly, it's an event-driven design with:

    • Four driving events
    • Six states
    • Thirty-four transitions
    • A two-way reset election

    EventQueueGetStateMachine(1).thumb.png.55fa2cec7df4ddf3aa28ad59a13ce877.png

    Data Model

    • Event Sequence.  The simulator attempts to deliver a stream of events to each attached viewer.  As an event is presented, it gets a (virtual) sequential number in the range [1..S32_MAX].  This number isn't attached to the event and it isn't part of the API contract.  But it does appear in metadata for consistency checking and logging.
    • Pending Queue.  Events are first queued to a pending queue where they are allowed to gather.  There is a maximum count of events allowed for each viewer.  Once reached, events are dropped without retry.  This dropping is counted in the metadata.
    • Pending Metadata.  First and last event sequence numbers of events currently in the queue as well as a count of events dropped due to quota or other reasons.
    • Staged Queue.  When the viewer has indicated it is ready for a new batch of events, the pending queue and pending metadata are copied to the staged queue and staged metadata.  This collection of events is given an incremented 'id'/'ack' value based on the 'Sent ID' data.  These events are frozen as is the binding of the ID value and the metadata.  They won't change and the events won't appear under any other ID value.
    • Staged Metadata.  Snapshot of Pending Metadata when events were copied.
    • Client Advanced.  Internal state which records that the viewer has made forward progress and acknowledge at least one event delivery.  Used to gate conditional reset operations.
    • Sent ID.  Sequential counter of bundled of events sent to viewer.  Range of [1..S32_MAX].
    • Received Ack.  Last positive acknowledgement received from viewer.  Range of [0..S32_MAX].

    States

    The States are simply derived from the combination of four items in the Data Model:

    • Staged Queue empty/full (indicated by '0'/'T' above).
    • Sent ID == Received Ack.  Indicates whether the viewer is fully caught up or is the simulator waiting for an updated ack from the viewer.
    • Pending Queue empty/full ('0'/'T').

    A combination of three booleans gives eight possible states but two are unreachable/meaningless in this design.  Of the valid six states:

    • 0. and 1. represent reset states.  The simulator quickly departs from these mostly staying in 2-5.
    • 2. and 3. represent waiting for the viewer.  Sent ID has advanced and we're waiting for the viewer to acknowledge receipt and processing.
    • 4. and 5. represent waiting for new events to deliver.  Viewer has caught up and we need to get new events moving.

    Events

    Raw inputs for the system come from two sources:  requests to queue and deliver events (by the simulator) and EventQueueGet queries (by the viewers).  The former map directly to the 'Send' event in the diagram.

    The other events are more complicated:

    • Get_Reset.  If viewer-requested reset is enabled in the simulator, requests with 'ack = 0' (or simply missing acks) are treated as conditional requests to reset before fetching new events.  One special characteristic of this event is that after making its state transition, it injects a follow-up Get_Nack event which is processed immediately (the reason for all the transitions pointing into the next event column).  If the reset is not enabled, this becomes a Get_Nack event, instead.
    • Get_Ack.  If the viewer sends an 'ack' value matching the Send ID data, this event is generated.  This represents a positive ack and progress can be made.
    • Get_Nack.  All other combinations are considered Get_Nack events with the current Staged Queue events not being acknowledged and likely being resent.

    Operations

    Each event-driven transition leads to a box containing a sequence of one or more micro operations which then drive the machine on to a new state (or one of two possible states).  Operations with the 'c_' prefix are special.  Their execution produces a boolean result (0/1/false/true).  That result determines the following state.

    The operations:

    • Idle.  Do nothing.
    • Push.  Add event to Pending Q, if possible.
    • Move.  Clear Staged Queue, increment Send ID, move Pending Queue and Metadata to Staged Queue and Metadata.
    • Send.  Send Staged Queue as response data to active request.
    • Ack.  Capture request's 'ack' value as Received Ack data.
    • Advance.  Set Client Advanced to true, viewer has made forward progress.
    • c_Send.  Conditional send.  If there's a usable outstanding request idling, send Staged Queue data out and status is '1'.  Otherwise, status is '0'.
    • c_Reset.  Conditional reset.  If Client Advanced not true, status is '0'.  Otherwise, reset state around viewer:
      • Release Staged Queue without sending.
      • Sent ID set to 0
      • Received Ack set to 0
      • Set Client Advanced flag to false
      • Set status to '1'.

     

    • Like 1
    • Thanks 2
  5. Aditi always has something going on these days.  GLTF is the big one but others as well.  Try a different region on a different simhost.  And the Linden viewer.  FS local mesh problems kind of points to local machine so check the viewer logs.  If all of that fails, Jira!

    • Thanks 1
  6. [ Okay, sorry for the delay.  I've been working through the cases and reformulating the problem and I'm starting to like what I have.  Thanks to everyone who's been whiteboarding with me, especially @Henri Beauchamp and @animats.  It's been extremely valuable and enjoyable.  Now, on to the next refinement with responses to earlier comments... ]

    Reframing the Problem

    Two race conditions exist in the system:

    1.  Inner race.  Involves the lifecycle of a single request.  Race starts as soon as the viewer commits to timing out an EventQueueGet operation.  Even if the entire response, less one byte, is in viewer buffers, that response will be lost as will any response written later until the race closes.  The race closes when:

    • Simulator times out the request and will not send any event data until a new request comes in.  This produces a closed socket between simulator and apache stack.
    • Simulator starts processing a new request from the viewer which causes early completion of outstanding request.  This is sent as a 200/LLSD::undef response.

    (Note:  the race is also started by other entities.  There are actions within the Apache stack and simply on the network generally that can end that request early without the simulator being aware.)

    This race is due to application protocol design and the use of unreliable HTTP modes (timeouts, broken sockets).

    2.  Outer race.  Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress.  Instances of these classes can disappear and be recreated without past memory without informing the peer.  This causes a jump in ack/id number and the possibility of duplicate events being sent.

    This race is due to implementation decisions.  I.e. endpoint classes that forget and that don't have a synchronization protocol between them.

    Before and After

    Under "At Most Once" event delivery, 1. is a source of lost events and 2. doesn't matter.  Events are being forgotten on every send so the outer race adds no additional failures (loss or duplication).

    Under "At Least Once" event delivery, 1. still exists but retaining events on the simulator side until a positive acknowledgement of receipt is received corrects the lost response.  2. then becomes a path for duplicated event processing in the viewer.  A simulator will have sent events for an 'ack = N/id = N+1' exchange when the viewer moves out of the region and eventually releasing its endpoint.  The viewer may or may not have received that response.  When the viewer comes back, it will do so with a new LLEventPoll and 'ack = 0' as its memory and the simulator is faced with a dilemma:  should the viewer be sent a duplicate 'id = N + 1' response or should it assume that that payload has been processed and wait until the 'id = N + 2' payload is ready?

    Simulator sending 200 response before timeout.  This doesn't really close the race.  The race is started outside the simulator, usually by the viewer but apache or devices on the network path can also start it.  The simulator can only close the race by electing not to send event data until a new request comes in.  Trying to end the race before it starts isn't really robust:  libcurl timers can be buggy, someone changes a value and doesn't know why, apache does it's thing, sea cable ingress decides to shut a circuit down, network variability and packet loss, and everyone in the southern hemisphere.  And it locks the various timeout values into the API contract.  Solution:  embrace the race, make it harmless via application protocol.

    Updated Phased Proposal

    Phase 1

    As before with one change to try to address the outer race scenarios.  Simulator will retain the last 'ack' received from the viewer.  If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason.  Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value.  'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues.  This potentially drops unprocessed events but that isn't any worse than the current situation.

    With this, the inner race is corrected by the application protocol.  There's no sensitivity to timeouts anywhere in the chain:  viewer, simulator, apache, internet.  Performance and delivery latency may be affected but correctness won't be.

    Phase 2

    This becomes viewer-only.  Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever.  A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point.  If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible.  Viewer just takes up the new 'id' sequence.  This should have been done from the first release.

    Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior.  Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid.

    Logging and defensive coding to avoid processing of duplicated payloads (which should never happen).  There's one special case where that won't be handled correctly.  When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'.  In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case.  May Just let this go.  Also, diagnostic logging for the new metadata fields being returned.

    Forum Thread Points and Miscellany Addressed

    3 million repeated acks.  Just want to correct this.  The 3e6 is for overflows.  When the event queue in the simulator reaches its high-water mark and additional events are just dropped.  Any request from the viewer that makes it in is going to get the full queue in one response.

    Starting and closing the race in the viewer.  First, a warning...  I'm not actually proposing this as a solution but it is something someone can experiment with.  Closing the race as outlined above only happens in the simulator.  BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first.  Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request.  Viewer can then launch another request after 20s.  Managing this oscillator in the viewer would be ugly and still not deal with some cases.

    Fuzzing Technique.  You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected.  This will retire an outstanding request in the simulator.  That, in turn, comes back to the viewer which will launch another request which retires the curl command request.

    TLS 1.3 and EDH.  There is a Jira to enable keylogging in libcurl/openssl to aid in wire debugging of https: traffic.  This will require some other work first to unpin the viewer from the current libcurl.  But it is on the roadmap.

    If we do open up this protocol to change in the future (likely under a capability change), I would likely add a restart message to the event suite in both directions.  It declares to the peer that this endpoint has re-initialized, has no memory of what has transpired, and that the peer should reset itself as appropriate.  Likely with some sort of generation number on each side.

     

    • Like 3
    • Thanks 1
  7. On 9/26/2023 at 5:17 PM, animats said:

    Interesting. That's very similar to the problem of a double region crossing. We know that if a second region crossing starts before the first has completed (a common case at region corners) a stuck state results. The problems may be related.

    There's a lot going on on our side both before *and* after the TeleportFinish.  Before the message there's a lot of communication between simhosts and other services.  It is very much a non-local operation (region crossings have some optimizations in this area).  After the message, some deferred work continues for an indefinite number of frames.  Yeah, there are more races in there - I started seeing their fingerprint when looking at London City issues.

    This is a bit of a warning...  the teleport problems aren't going to be a single-fix matter.  The problem in this thread (message loss) is very significant.  But there's more lurking ahead.

    • Thanks 1
  8. 20 hours ago, Henri Beauchamp said:

    You can increase the timeout to 45s with the Cool VL Viewer now, but sadly, in some regions (*) this will translate in a libcurl-level ”spurious” retry after 30s or so (i.e. a first server-side timeout gets silently retried by libcurl), before you do get a viewer-side timeout after the configured 45s delay; why this happens in unclear (*), but sadly, it does happen, meaning there is no possibility, for now, to always get a genuine server-side timeout in the agent region (the one that matters), neither to prevent a race during the first ”silent retry” by libcurl...

    There are other players in here.  I've deliberately avoided mention of Apache details.  However, it will contribute its own timeout-like behaviors as well as classic Linden behaviors (returning 499, for example).  The modules in here aren't candidates for rework for a number of reasons and so general principles must apply:  expect broken sockets (non-http failures), 499s, and flavors of 5xx and don't treat these as errors.

    That the LL viewer does is really a bug.  The viewer-side timeouts enable hang and lost peer detection on a faster timescale than, say, tcp keepalives would (not that that's the right tool here).  Slowed retry cycles might be better and then allowing UDP circuit timeout to drive the decision to declare the peer gone.  But that didn't happen.

    Whatever happens, I really want it to work regardless of how many entities think they should implement timeouts.  Timeouts are magic numbers and the data transport needs to be robust regardless of the liveness checking.

    (I do want to document this better.  This is something present on all capabilities.  For most caps it won't matter but that depends on the case.)

     

    20 hours ago, Henri Beauchamp said:

    So, basically, a procedure must be put in place so that viewers without future hardened/modified code will not get those duplicate event poll messages.

     

    The situation isn't entirely grim with the existing viewer code but there is still a duplication opportunity.  First, the optimistic piece.  The 'ack'/'id' value is supposed to have meaning but the current LL viewer and server do absolutely nothing with it except attempt a safe passthru from server to viewer and back to server.

    With the passthru and the enforcement of, at most, a single live request to the region, there is good agreement between viewer and server on what the viewer has processed.  As long as the 'ack' value held by the viewer isn't damaged in some way (reset to 0, made an 'undef', etc.)

    There is one window:  during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'.  If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync.

    So, the four cases:

    1. Viewer (LLEventPol) at 'ack = 0/undef' and server (LLAgentCommunication) at 'id = 0'.  Reset/startup/first visit condition, everything is correct.  No events have been sent or processed.
    2. Viewer at 'ack = N' and server at 'id = N or N + 1'.  Normal operation.  If viewer issues request, server knows viewer has processed 'id = N' case and does not resend.  Any previous send of 'id = N + 1' case has been dropped on the wire due to single request constraint on the service.
    3. Viewer at 'ack = N' and server at 'id = 0'.  Half-reset.  Viewer walked away from region then later returned and server re-constructed its endpoint but the viewer didn't.  Safe from duplication as the event queue in the server has been destroyed and cannot be resent.  Viewer does have to resync with server by taking on the 'id = 1' payload that will eventually come to it.  (Could even introduce a special event for endpoint restart to signal viewer should resync but I don't think this is needed.)
    4. Viewer at 'ack = 0' and server at 'id = N'.  Half-reset.  Opposite case.  Viewer walked away from region then re-constructed it's endpoint (LLEventPoll) but the server didn't.  This is the bad case.  Viewer has no knowledge of what it has processed and server has a set of events, 'id = N', which the viewer may or may not have processed in that aforementioned window.

    I believe that last case is the only real exposure to duplicated events we have given a) the various request constraints and b) no misbehavior by viewer or server during this protocol.  So what are the options.  Here are some:

    • Case 4. doesn't happen in practice.  There's some hidden set of constraints that doesn't allow this to happen.  System transitions to cases 1. or 3. and 4. never arises.  Hooray!
    • There is the 'done' mechanism in the caps request body to terminate the server's endpoint.  It is intended that the viewer use this in a reliable way when it knows it's going to give up its endpoint.  I.e. it forces a transition to case 1. if 4. is a possibility.  I think this control is buggy and has races so there are concerns.  Requires viewer change.
    • Viewer keeps 'last ack' information for each region forever.  When the viewer walks away from a region and comes back, it reconstructs its endpoint with this 'last ack' info.  So instead of coming back into case 4., it enters into 2. or 3. and we're fine again.  Requires viewer change.
    • Use an 'ack = 0' request as an indicator that viewer has reset its knowledge of the region and assume that any current payload ('id = N') has either been processed or can be dropped as uninteresting because the viewer managed to return here by some means.  Next response from the server will be for 'id = N + 1'.  No viewer change but viewers might be startled.  TPV testing important.
    • Magical system I don't know about.

    I would really like for that first bullet point to be true but I need to do some digging.

     

  9. On 9/23/2023 at 4:32 PM, Henri Beauchamp said:

    Cool VL Viewer releases (v1.30.2.28 and v1.31.0.6) published, with my new LLEventPoll code and experimental race condition (partial) workaround for TP failures.

    The new goodies work as follow:

    • LLEventPoll was made robust against 499 and 500 errors often seen in SL when letting the server time out on its side (which is not the case with LL's current code since libcurl retries long enough and times out by itself). 502 errors (that were already accepted for Open Sim) are now also treated as ”normal” timeouts for SL. It will also retry 404 errors (instead of committing suicide) when they happen for the Agent's sim (the Agent sim should never be disconnected spuriously, or at least not after many retries).
    • LLEventPoll now sets HTTP retries to 0 and a viewer-side timeout of 25 seconds by default for SL. This can be changed via the ”EventPollTimeoutForSL” debug setting, which new value would be taken into account on next start of an event poll.

    With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger:  45s or more.  This hands connection lifecycle mostly to the simulator, allowing it to know when it can truly send and closing the race.  This should give a 'mostly working' implementation of 'at most once' event delivery.  It will still break on other failures on the network as well as at TPs/RCs where the LLEventPoll/LLAgentCommunication endpoints may get cycled.  Not a fix but an experiment.

  10. On 9/22/2023 at 8:14 PM, animats said:

     

    "At least once" delivery

    The most important change is the move to "at least once" delivery instead of "no more than once". That alone will solve some major problems. There may be some viewer-side problems with duplicate events, but those can be detected viewer side. The events

    • EstablishAgentCommunication
    • CrossedRegion
    • TeleportFinish

    are probably the ones that need viewer-side duplicate detection, because they all imply major viewer side state changes. No big problem detecting that viewer side. Duplicates are far better than losing any of those crucial messages. Losing one of those means a viewer stalled at login, region crossing, teleport, or neighbor region appearance.

    Are there any others not in that list for which processing a duplicate message causes serious trouble?

    Duplicates are easy to remove for the simple case (viewer and sim in agreement).  The case where LLEventPoll gets recreated losing its memory of previous 'ack'/'id' progress still needs work.  Hope is to remove duplication at that level where it happens in one place.  If it has to happen at the level of individual events/messages, that's duplicated work now and complexity in the future when more are added.

    Not certain about completeness of that list yet.  The survey is pretty complete but I haven't checked yet.

    On 9/22/2023 at 8:14 PM, animats said:

    Event poll timeout semantics

    Henri and I were arguing above over how to handle this. It's even worse than either of us thought.

    I'll plug again for the server sending an empty array of events when it has no events to send. Stay on the happy path of HTTP status 200 if at all possible.

    For various reasons, changing the server side is going to be painful (look at lliohttpserver.cpp in the viewer) and the viewer needs to retry on 499/5xx receipt anyway.  I think if this were changed, I'd just flip the timeout response to be the same as for a duplicate connection:  200 with naked 'undefined' as LLSD value.  But this current behavior is shared with all simulator and dataserver http endpoints as it is based on the common HTTP server 'framework'.  There needs to be more awareness of this and our docs don't help.

    Viewer timeout was still the real problem as that's what introduced the race condition for the 'at most once' case.  Without it, loss was still possible but would have happened more rarely and no one would ever have found it.  So this is a good thing.  (Ha)

    On 9/22/2023 at 8:14 PM, animats said:

    Event order

    Once we have at-least-once delivery, is it worth guaranteeing in-order delivery of event poller events? Out of order delivery is a feature on the UDP message side, but it doesn't seem to add value on the HTTP event poller side. Out of order delivery makes this much harder to think about and adds code complexity. What is the server-side benefit of out of order event queuing? Something to think about. A discussion topic at this point, not a proposal.

    It's very deliberately ordered right now but it was doing event combining in the past, overwriting certain existing events when new ones of the same type were queued.  The way it was done, it also resulted in out-of-order delivery.  But that was removed long ago.  Combined with disallowing multiple connections at once, order is now preserved.  I don't know that we'd ever want it back.  Except for the fact that a lot of data can flow through this pathway and it does overflow.  Combining might reduce that.  Not really proposing it, just making it clear that there were decisions here.  I'm trying to leave enough breadcrumbs for future debates.

  11. Information is still coming in (*really* looking forward to AWS' explanation).  HTTP-Out was running at elevated levels (including higher error rates) from 21:30slt yesterday until 2:45slt today.  That's now running as expected.  Teleports remained unreliable (~80% successful) until around 6:30slt today.  They've now recovered.  Lingering issues are likely and we do want to hear about them.  Please contact support.

    • Like 2
    • Thanks 1
  12. 3 hours ago, animats said:

    Hm. If you need more checking viewer side for that, just ask. Is this UDP side or event poller? I see sequential numbers from the SL simulators for both. The Other Simulator does sometimes skip sequence numbers.

    Event poller.  I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift.  But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized.  Haven't dug into it yet - future project.  Even the 3e6 dropped events per day are just 4 / hour / region when amortized.

  13. 37 minutes ago, animats said:

    I'd like to plug again for the idea that, in the unusual event the protocol has to drop something due to overload, put something in the event stream that tells the viewer that data was lost.

    I want to do enough eventually to detect and log that your peer is doing something unexpected.  Might just be an 'event lost' flag or a sequence number scheme.  Not certain it will go as far as a digest of failure.  Open to suggestions.

    Simulator is subject to amnesia where it necessarily tosses connection information away when an agent departs.  When the agent comes back, simulator starts over.  Viewer, on the other hand, may keep simulator information around for the life of the session.  The resetting needs to be treated correctly.

×
×
  • Create New...