Jump to content

Monty Linden

Lindens
  • Posts

    534
  • Joined

Everything posted by Monty Linden

  1. I haven't dug into root cause yet but we have 3 million per day. I suspect most of these are associated with logout or deteriorating network conditions. But it is pretty common.
  2. With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger: 45s or more. This hands connection lifecycle mostly to the simulator, allowing it to know when it can truly send and closing the race. This should give a 'mostly working' implementation of 'at most once' event delivery. It will still break on other failures on the network as well as at TPs/RCs where the LLEventPoll/LLAgentCommunication endpoints may get cycled. Not a fix but an experiment.
  3. Duplicates are easy to remove for the simple case (viewer and sim in agreement). The case where LLEventPoll gets recreated losing its memory of previous 'ack'/'id' progress still needs work. Hope is to remove duplication at that level where it happens in one place. If it has to happen at the level of individual events/messages, that's duplicated work now and complexity in the future when more are added. Not certain about completeness of that list yet. The survey is pretty complete but I haven't checked yet. For various reasons, changing the server side is going to be painful (look at lliohttpserver.cpp in the viewer) and the viewer needs to retry on 499/5xx receipt anyway. I think if this were changed, I'd just flip the timeout response to be the same as for a duplicate connection: 200 with naked 'undefined' as LLSD value. But this current behavior is shared with all simulator and dataserver http endpoints as it is based on the common HTTP server 'framework'. There needs to be more awareness of this and our docs don't help. Viewer timeout was still the real problem as that's what introduced the race condition for the 'at most once' case. Without it, loss was still possible but would have happened more rarely and no one would ever have found it. So this is a good thing. (Ha) It's very deliberately ordered right now but it was doing event combining in the past, overwriting certain existing events when new ones of the same type were queued. The way it was done, it also resulted in out-of-order delivery. But that was removed long ago. Combined with disallowing multiple connections at once, order is now preserved. I don't know that we'd ever want it back. Except for the fact that a lot of data can flow through this pathway and it does overflow. Combining might reduce that. Not really proposing it, just making it clear that there were decisions here. I'm trying to leave enough breadcrumbs for future debates.
  4. This initial phase is going to be constrained by the http server we have in the simulator. It's based on that pump/pipe model. Without rework, the only timeout response it can formulate is a closed socket.
  5. Information is still coming in (*really* looking forward to AWS' explanation). HTTP-Out was running at elevated levels (including higher error rates) from 21:30slt yesterday until 2:45slt today. That's now running as expected. Teleports remained unreliable (~80% successful) until around 6:30slt today. They've now recovered. Lingering issues are likely and we do want to hear about them. Please contact support.
  6. Just to confirm... AWS got a bit wobbly starting at around 11:00slt. AWS is still working on it. HTTP-Out is heavily impacted. Update: Numbers looking better from 21:30slt.
  7. Event poller. I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift. But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized. Haven't dug into it yet - future project. Even the 3e6 dropped events per day are just 4 / hour / region when amortized.
  8. Agreed. The overload scenario isn't just theoretical, it is happening to the tune of 3 million per day. I also have 6K per day where viewers send 'ack' values the simulator didn't supply. No idea what that is (yet). (I really need to stop peeling this onion. It's full of brain spiders and they're making me miserable.)
  9. I want to do enough eventually to detect and log that your peer is doing something unexpected. Might just be an 'event lost' flag or a sequence number scheme. Not certain it will go as far as a digest of failure. Open to suggestions. Simulator is subject to amnesia where it necessarily tosses connection information away when an agent departs. When the agent comes back, simulator starts over. Viewer, on the other hand, may keep simulator information around for the life of the session. The resetting needs to be treated correctly.
  10. Just to keep expectations set correctly: this isn't the only TP bug. It's just one of the most glaring. We won't have 100% TP/RC success after this but it should be better.
  11. This is going to be fun. One of the recent discoveries: simulator only uses the supplied 'id' value for logging. It has no functional effect in the current event queue handling scheme. I have some more behavioral probing to do, particularly around the bizarre handling of dropped connections by apache, but a proposal is coming. This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.
  12. The goal will be no changes to viewers unless they choose to. I just haven't had time to dig out how bad the story is on that side of the wire. Man, I hate overconstrained problems...
  13. My plan for a plan is roughly: Phase 1. Simulator changes, compatible with viewers, to make it more robust. Might include temporary change in timeout to 25 or 20 seconds. Phase 2. Robust event transfer. Might require viewer changes. There's a change in here from "at most once" logic (i.e. viewers see an event at most once but possibly never) to "at least once" (event may be sent several times under the same 'id'/'ack' and viewer needs to expect that). Don't know where this best fits in. I'm hoping P1 but that might break something. Don't know if we can re-enable the old message. Or do resend asks. I really want to just make forward fixes and not add patches to the patches.
  14. Probably my fault. The policy group for long-poll should probably not attempt retries. It hides things that should have been brought up to viewer awareness.
  15. You might be monitoring at the wrong level. Try llcorehttp or wireshark/tcpdump as a second opinion. The simulator will not keep an event-get request alive longer than 30s. If you find it is, it is either retries masking the timeout or yet more simulator bugs. Or both. There are always simulator bugs. (Side note: once 'Agent Communication' is established, events queue regardless of whether or not an event-get is active. Up to a high-water level. So it is not required that an event-get be kept active constantly. It is required that no more than one be active per region - this is not documented well.)
  16. Even that requires negotiation after going that far down the TP path. For myself, I'm looking to turn failure into success rather than into disappointment or a trip to the Bardo.
  17. There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet. A more elaborate scenario is this: [Viewer] Request timeout fires. Begins to commit to connection tear down. [Sim] Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request. [Sim] Having prepared content for outstanding request, declares events delivered and clears queue. [Apache] Adds more delay and reordering between Sim and Viewer because that's what apache does. [Viewer] Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request. [Viewer] Never sees TeleportFinish, stalls, then disconnects from Sim.
  18. Well.... you don't really know that unless what you've received matches what was intended to be sent. And it turns out, they don't necessarily match and retries are part of the problem. I already know event-get is one mode of TP/RC failure.
  19. No, simpler, as in HTTP. To support very basic 'If-Match'/'If-None-Match' tests. The one special case is where the simulator side resets and this may need to be acted upon. This keeps getting worse the more I look. So *both* viewer and simulator implement a 30-second timeout on these requests. A guaranteed race condition. Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout. More variability. Simulator can send at least three different responses related to its timeout actions: A 200 status with empty body and no content-type header A 200 status with llsd content-type and a body of: '<llsd><undef /></llsd>' A 499 status with html content-type and a body I haven't looked at but it's about 400 bytes. Between viewer and simulator is an apache stack that adds a layer of interpretation, time dependencies, and other factors to something that's already a hodge-podge. If a 502 is returned anywhere, it's probably an accident.
  20. Well, LLSD is meant to allow slop (c++ and python implementations particularly). But there are arguments other than id/ack in the body. They're not currently used in the SL viewer so documentation is diddly squat. The whole id/ack thing is badly done. It should have been treated more like ETag. As it is, old events can get a new id and other sins. I think the incremental nature is accident, not contract. And it will reset to 0 in a region/agent pair in certain circumstances (including that viewer-initiated reset). So, beware. Hope to document this. And it's not necessarily ordered-reliable. The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable. Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point. In fact, it is still unreliable now as we'll drop queue overruns on the floor. They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action. Future hope is to homologate this in-place and as-is with minimal optional extensions that would also allow each end to monitor how badly the other side is messing up.
  21. Pulling out the true story is going to require reverse engineering and documentation on our part. Henri's logging is something that should probably have been available in the SL viewer from day 1. But here we are. And as always, never mistake anecdotes for a contract. You might see the same pattern ten times in a row but that doesn't necessarily mean there's a temporal constraint. Especially in our code. You may not need quite as much deferred logic as hinted at. Once you send the UCC into the neighbor, there will be a RegionHandshake in your future. You may be able to delay the RHR (I mean, it's async anyway so pretend you're on Mars). What I'd meant before was that after sending UCC (and probably better after RH receipt), send AgentUpdates into the Host region. This is one of several paths that drives a private protocol between simulators that causes a neighbor to emit the EAC message, which is routed back via the host region, that you need to begin HTTP operations with the neighbor in anger. As always, you shouldn't entirely trust me on this. I haven't walked through all of this to verify what's accidental and what's a true constraint or dependency. HTTP was bolted onto the old UDP message system. It's a Frankenstein monster. 'AgentCommunication' isn't just an abstract concept in the simulators, it's a physical class LLAgentCommunication and they are tied together (also hints that the people who did this are not the ones who should have been allowed to do it). This class 'encapsulates' all the capabilities as well as the low-level plumbing for HTTP bridging of LLMessage comms. This thing isn't actually tied to an agent strongly. Nor to circuits (it doesn't really know about them). It can live on after an agent has departed. And yet the viewer has the ability to tear down this entity via arguments to 'EventQueueGet' requests. (This resets the 'id'/'ack' value on the query among other things.) It's quite the thing.
  22. That's our very special 'Class 10' channel with 4x script memory, SDRAM hand-matched for timing, and all the wiring using defect-free, single-crystal copper.
×
×
  • Create New...