Jump to content

Monty Linden

Lindens
  • Posts

    542
  • Joined

Posts posted by Monty Linden

  1. 1 hour ago, Henri Beauchamp said:

    In fact, I could verify today that this scenario cannot happen at all in SL. I instrumented my viewer code with better DEBUG messages and a timer for event poll requests.

    You might be monitoring at the wrong level.  Try llcorehttp or wireshark/tcpdump as a second opinion.  The simulator will not keep an event-get request alive longer than 30s.  If you find it is, it is either retries masking the timeout or yet more simulator bugs.  Or both.  There are always simulator bugs.

    (Side note:  once 'Agent Communication' is established, events queue regardless of whether or not an event-get is active.  Up to a high-water level.  So it is not required that an event-get be kept active constantly.  It is required that no more than one be active per region - this is not documented well.)

  2. 27 minutes ago, Henri Beauchamp said:

    All the TP failure modes I get happen before event polls are even started: the UDP message from the arrival sim just never gets to the viewer, so the latter is ”left in the blue” and ends up timing out on the departure sim as well..

    There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet.  A more elaborate scenario is this:

    • [Viewer]  Request timeout fires.  Begins to commit to connection tear down.
    • [Sim]  Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request.
    • [Sim]  Having prepared content for outstanding request, declares events delivered and clears queue.
    • [Apache]  Adds more delay and reordering between Sim and Viewer because that's what apache does.
    • [Viewer]  Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request.
    • [Viewer]  Never sees TeleportFinish, stalls, then disconnects from Sim.
  3. 53 minutes ago, Henri Beauchamp said:

    Yes, it is indeed as bad as it looks... This said, my modified code performs just fine in both SL and OpenSim now, and the failed TP issues still seen are not related with event polls anyway (event polls are simply retried on timeouts).

    Well....  you don't really know that unless what you've received matches what was intended to be sent.  And it turns out, they don't necessarily match and retries are part of the problem.  I already know event-get is one mode of TP/RC failure. 

    • Thanks 1
  4. On 9/10/2023 at 12:43 AM, animats said:

    As in RSS? Right, there it's interpreted as "I've seen and processed everything through #117, send me anything from #118 and later."

    No, simpler, as in HTTP.  To support very basic 'If-Match'/'If-None-Match' tests.  The one special case is where the simulator side resets and this may need to be acted upon. 

    On 9/10/2023 at 12:43 AM, animats said:

    Incidentally, on a timed out event poll, SL does not send the documented 502 status; it just closes the connection. (The Other Simulator does send a 502). Timeout occurs around 58 seconds on both SL and the Other Simulator.

    This keeps getting worse the more I look.  So *both* viewer and simulator implement a 30-second timeout on these requests.  A guaranteed race condition.  Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout.  More variability.  Simulator can send at least three different responses related to its timeout actions:

    • A 200 status with empty body and no content-type header
    • A 200 status with llsd content-type and a body of:  '<llsd><undef /></llsd>'
    • A 499 status with html content-type and a body I haven't looked at but it's about 400 bytes.

    Between viewer and simulator is an apache stack that adds a layer of interpretation, time dependencies, and other factors to something that's already a hodge-podge.

    If a 502 is returned anywhere, it's probably an accident.

    • Thanks 1
  5. 46 minutes ago, animats said:

    Those are used for something? For a while, I was sending the string "TEST" instead of the proper LLSD message with ID number, and it still worked. Sequentially numbered events showed up. (SL has consistent event ascending numbers from 1; the Other Simulator will sometimes skip a number, and starts from some arbitrary large integer.)

    The event queue has ordered reliable delivery, which is useful where order matters.

    Well, LLSD is meant to allow slop (c++ and python implementations particularly).  But there are arguments other than id/ack in the body.  They're not currently used in the SL viewer so documentation is diddly squat.

    The whole id/ack thing is badly done.  It should have been treated more like ETag.  As it is, old events can get a new id and other sins.  I think the incremental nature is accident, not contract.  And it will reset to 0 in a region/agent pair in certain circumstances (including that viewer-initiated reset).  So, beware.  Hope to document this.

    And it's not necessarily ordered-reliable.  The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable.  Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point.  In fact, it is still unreliable now as we'll drop queue overruns on the floor.  They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action.

    Future hope is to homologate this in-place and as-is with minimal optional extensions that would also allow each end to monitor how badly the other side is messing up.

     

  6. Pulling out the true story is going to require reverse engineering and documentation on our part.  Henri's logging is something that should probably have been available in the SL viewer from day 1.  But here we are.  And as always, never mistake anecdotes for a contract.  You might see the same pattern ten times in a row but that doesn't necessarily mean there's a temporal constraint.  Especially in our code.

    You may not need quite as much deferred logic as hinted at.  Once you send the UCC into the neighbor, there will be a RegionHandshake in your future.  You may be able to delay the RHR (I mean, it's async anyway so pretend you're on Mars).  What I'd meant before was that after sending UCC (and probably better after RH receipt), send AgentUpdates into the Host region.  This is one of several paths that drives a private protocol between simulators that causes a neighbor to emit the EAC message, which is routed back via the host region, that you need to begin HTTP operations with the neighbor in anger.

    As always, you shouldn't entirely trust me on this.  I haven't walked through all of this to verify what's accidental and what's a true constraint or dependency.  HTTP was bolted onto the old UDP message system.  It's a Frankenstein monster.  'AgentCommunication' isn't just an abstract concept in the simulators, it's a physical class LLAgentCommunication and they are tied together (also hints that the people who did this are not the ones who should have been allowed to do it).  This class 'encapsulates' all the capabilities as well as the low-level plumbing for HTTP bridging of LLMessage comms.  This thing isn't actually tied to an agent strongly.  Nor to circuits (it doesn't really know about them).  It can live on after an agent has departed.  And yet the viewer has the ability to tear down this entity via arguments to 'EventQueueGet' requests.  (This resets the 'id'/'ack' value on the query among other things.)  It's quite the thing.

    • Thanks 2
  7. I'm all for better docs.  (I'm also a proponent of *any* docs.)  Sequence diagrams would be nice but they also have weaknesses.  They tend to show happy paths and miss recovery transitions.  Not all ordering shown in sequence diagrams is necessarily a constraint.  Some are, some aren't, and real docs are needed for the distinction.  But having the sequence diagram to help frame that doc would be a win.  Just takes work to extract that from the frankencode.  *cough*  I'm also looking for some better options for API exchanges (particularly for http transport).  Not certain llidl is the tool.

    • Thanks 1
  8. 5 hours ago, animats said:

    I think it's UseCircuitCode or RegionHandshake to the neighbor that's needed, but I'm not sure yet.

    They're needed but they may not be enough.  If you find the EAC message delayed or still missing, try driving it with a few AgentUpdates after UCC and RH.  The Opensim version of this has got to be better than what I'm looking at.

    • Like 1
    • Thanks 1
  9. 5 minutes ago, animats said:

    A state machine diagram, or diagrams, would be a big help here. Then the code could check for event A coming in while in state B, where that's not an allowed event in that state. This is the kind of problem where you need formalism to get it right.

    Hahaha.  To try to fix the event-get issue I had to lay out a solution as a state machine diagram.  I hope to share these as part of the wiki documentation at a future point.  (More later...)

    • Like 1
    • Thanks 5
  10. I don't monitor the 'General' forum closely but noticed this while replying in another thread.  Speaking personally, I've enjoyed and agree with the complaints here.  I, too, find teleporting/region crossing appalling and it needs to change such that any failure would be surprising and not part of the 'SL Experience.'

    But this is where we are.  Teleport is a confluence of many buggy systems swirling into a whirlpool of frustration and failure.  It's going to be very hard to get this asynchronous pile of state sorted but that's the task and we want to see this done.

    I'll share a recent finding to illustrate what this looks like.  If you've looked into the code you may be familiar with the TeleportFinish message.  This is sent to the viewer which triggers it to move into the destination region.  The discovery is that this message is often queued but never delivered to the viewer leading the viewer to disconnect (yes, it's the viewer that disconnects) and display the quit/IM message.  The cause centers around the EventQueueGet cap and if you're a TPV developer who has sniffed around that and found a bad smell, your nose was not wrong.

    This will not be the *one* cause of TP failures nor are TP failures the only result of it.  That's what making this work as everyone would like will require:  many fixes over time incrementally improving the result while avoiding damage elsewhere.

    • Like 3
    • Thanks 10
  11. There's a lot in this topic so I'm just going to say a bunch of things in no particular order.

    • We're currently blocking ICMP on simhosts so you can't do a true ping to them.  (A subject of internal debate.)  Other tools need to be brought to the task.
    • One of the better ones is a true 'traceroute' program.  You will find un-cooperative hosts along the way including simhosts.  However, the numbers it reports are decent and a good version of traceroute is very configurable.  For example, you could target the simulator UDP port or the simhost TCP capability port (12043) for the probe.  How to do this is an exercise for the reader.
    • The internet does seem to be generally ill.  I'm on the US east coast and I'm getting 90ms roundtrips to us-west-2 where I usually see 55-60ms.  400ms from the UK is worse than what I'd expect from .au or .za or .aq.
    • us-east-1 is the greatest region - but everything is in us-west-2 (Oregon)
    • True 'ping' requires ICMP which should not be available to in-browser javascript.  (If it is, I'd like someone to point me at that.)  So the AWS latency test raises some questions.  Its numbers for me are high compared to traceroute and I get the frequent peak others report so I don't trust it except as a very noisy upper limit.
    • Application ping (Shift-Ctrl-1) is as Henri describes.  It should look something like actual ping plus frame time.  Unless we're under real load requiring multiple frames to service requests or a retransmission regime.
    • The 'Simulator' portion of the 'Statistics' window is most interesting.  If there is a simulator explanation for latency, it can show up there first.  It doesn't cover everything but it does cover most main loop activity.  (Would really like historical metrics charts on the viewer so trends and glitches can be watched a la viewer fps.)
    • Services like 'login.agni.lindenlab.com' and the asset CDN are not at AWS.  Don't test against those unless you specifically care about their behavior.
    • Simulators share resources on simhosts and perfect isolation isn't possible.  Sometimes the dice don't roll your way.
    • Like 2
    • Thanks 6
  12. That screen capture doesn't tell me much other than you're running two TPV instances.  Aditi's up and running (I was just in Dore) and is getting used quite a bit given the testing work we badly need to do right now.  :)

  13. 4 hours ago, bigmoe Whitfield said:

    I'm just wondering what all's changed and what can you share with us now?

    Wow, those are practically archaeological artifacts.  12Gbps aggregate network throughput.  Open sourcing the servers.  4GB memory per server!  Yeah, it's very different now.  Some of that has survived, all of it is very different now.  Not certain when we'll publish a wiki update but, man, does it need one.

    • Like 2
×
×
  • Create New...