Jump to content

Monty Linden

Lindens
  • Posts

    541
  • Joined

Everything posted by Monty Linden

  1. Aditi always has something going on these days. GLTF is the big one but others as well. Try a different region on a different simhost. And the Linden viewer. FS local mesh problems kind of points to local machine so check the viewer logs. If all of that fails, Jira!
  2. What @Henri Beauchamp said: report those bugs. The service endpoint that provides the offline IMs to the viewer on login was designed to be destructive on read. So no viewer can give you the option not to delete. (I don't like that design - part of our tradition of assuming all http requests succeed if the service can queue a response.)
  3. If you mean unix/linux/posix pipes, those are necessarily local to a host. For socket-based networking, we use a bit of everything.
  4. [ Okay, sorry for the delay. I've been working through the cases and reformulating the problem and I'm starting to like what I have. Thanks to everyone who's been whiteboarding with me, especially @Henri Beauchamp and @animats. It's been extremely valuable and enjoyable. Now, on to the next refinement with responses to earlier comments... ] Reframing the Problem Two race conditions exist in the system: 1. Inner race. Involves the lifecycle of a single request. Race starts as soon as the viewer commits to timing out an EventQueueGet operation. Even if the entire response, less one byte, is in viewer buffers, that response will be lost as will any response written later until the race closes. The race closes when: Simulator times out the request and will not send any event data until a new request comes in. This produces a closed socket between simulator and apache stack. Simulator starts processing a new request from the viewer which causes early completion of outstanding request. This is sent as a 200/LLSD::undef response. (Note: the race is also started by other entities. There are actions within the Apache stack and simply on the network generally that can end that request early without the simulator being aware.) This race is due to application protocol design and the use of unreliable HTTP modes (timeouts, broken sockets). 2. Outer race. Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress. Instances of these classes can disappear and be recreated without past memory without informing the peer. This causes a jump in ack/id number and the possibility of duplicate events being sent. This race is due to implementation decisions. I.e. endpoint classes that forget and that don't have a synchronization protocol between them. Before and After Under "At Most Once" event delivery, 1. is a source of lost events and 2. doesn't matter. Events are being forgotten on every send so the outer race adds no additional failures (loss or duplication). Under "At Least Once" event delivery, 1. still exists but retaining events on the simulator side until a positive acknowledgement of receipt is received corrects the lost response. 2. then becomes a path for duplicated event processing in the viewer. A simulator will have sent events for an 'ack = N/id = N+1' exchange when the viewer moves out of the region and eventually releasing its endpoint. The viewer may or may not have received that response. When the viewer comes back, it will do so with a new LLEventPoll and 'ack = 0' as its memory and the simulator is faced with a dilemma: should the viewer be sent a duplicate 'id = N + 1' response or should it assume that that payload has been processed and wait until the 'id = N + 2' payload is ready? Simulator sending 200 response before timeout. This doesn't really close the race. The race is started outside the simulator, usually by the viewer but apache or devices on the network path can also start it. The simulator can only close the race by electing not to send event data until a new request comes in. Trying to end the race before it starts isn't really robust: libcurl timers can be buggy, someone changes a value and doesn't know why, apache does it's thing, sea cable ingress decides to shut a circuit down, network variability and packet loss, and everyone in the southern hemisphere. And it locks the various timeout values into the API contract. Solution: embrace the race, make it harmless via application protocol. Updated Phased Proposal Phase 1 As before with one change to try to address the outer race scenarios. Simulator will retain the last 'ack' received from the viewer. If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason. Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value. 'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues. This potentially drops unprocessed events but that isn't any worse than the current situation. With this, the inner race is corrected by the application protocol. There's no sensitivity to timeouts anywhere in the chain: viewer, simulator, apache, internet. Performance and delivery latency may be affected but correctness won't be. Phase 2 This becomes viewer-only. Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever. A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point. If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible. Viewer just takes up the new 'id' sequence. This should have been done from the first release. Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior. Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid. Logging and defensive coding to avoid processing of duplicated payloads (which should never happen). There's one special case where that won't be handled correctly. When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'. In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case. May Just let this go. Also, diagnostic logging for the new metadata fields being returned. Forum Thread Points and Miscellany Addressed 3 million repeated acks. Just want to correct this. The 3e6 is for overflows. When the event queue in the simulator reaches its high-water mark and additional events are just dropped. Any request from the viewer that makes it in is going to get the full queue in one response. Starting and closing the race in the viewer. First, a warning... I'm not actually proposing this as a solution but it is something someone can experiment with. Closing the race as outlined above only happens in the simulator. BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first. Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request. Viewer can then launch another request after 20s. Managing this oscillator in the viewer would be ugly and still not deal with some cases. Fuzzing Technique. You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected. This will retire an outstanding request in the simulator. That, in turn, comes back to the viewer which will launch another request which retires the curl command request. TLS 1.3 and EDH. There is a Jira to enable keylogging in libcurl/openssl to aid in wire debugging of https: traffic. This will require some other work first to unpin the viewer from the current libcurl. But it is on the roadmap. If we do open up this protocol to change in the future (likely under a capability change), I would likely add a restart message to the event suite in both directions. It declares to the peer that this endpoint has re-initialized, has no memory of what has transpired, and that the peer should reset itself as appropriate. Likely with some sort of generation number on each side.
  5. There's a lot going on on our side both before *and* after the TeleportFinish. Before the message there's a lot of communication between simhosts and other services. It is very much a non-local operation (region crossings have some optimizations in this area). After the message, some deferred work continues for an indefinite number of frames. Yeah, there are more races in there - I started seeing their fingerprint when looking at London City issues. This is a bit of a warning... the teleport problems aren't going to be a single-fix matter. The problem in this thread (message loss) is very significant. But there's more lurking ahead.
  6. There are other players in here. I've deliberately avoided mention of Apache details. However, it will contribute its own timeout-like behaviors as well as classic Linden behaviors (returning 499, for example). The modules in here aren't candidates for rework for a number of reasons and so general principles must apply: expect broken sockets (non-http failures), 499s, and flavors of 5xx and don't treat these as errors. That the LL viewer does is really a bug. The viewer-side timeouts enable hang and lost peer detection on a faster timescale than, say, tcp keepalives would (not that that's the right tool here). Slowed retry cycles might be better and then allowing UDP circuit timeout to drive the decision to declare the peer gone. But that didn't happen. Whatever happens, I really want it to work regardless of how many entities think they should implement timeouts. Timeouts are magic numbers and the data transport needs to be robust regardless of the liveness checking. (I do want to document this better. This is something present on all capabilities. For most caps it won't matter but that depends on the case.) The situation isn't entirely grim with the existing viewer code but there is still a duplication opportunity. First, the optimistic piece. The 'ack'/'id' value is supposed to have meaning but the current LL viewer and server do absolutely nothing with it except attempt a safe passthru from server to viewer and back to server. With the passthru and the enforcement of, at most, a single live request to the region, there is good agreement between viewer and server on what the viewer has processed. As long as the 'ack' value held by the viewer isn't damaged in some way (reset to 0, made an 'undef', etc.) There is one window: during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'. If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync. So, the four cases: Viewer (LLEventPol) at 'ack = 0/undef' and server (LLAgentCommunication) at 'id = 0'. Reset/startup/first visit condition, everything is correct. No events have been sent or processed. Viewer at 'ack = N' and server at 'id = N or N + 1'. Normal operation. If viewer issues request, server knows viewer has processed 'id = N' case and does not resend. Any previous send of 'id = N + 1' case has been dropped on the wire due to single request constraint on the service. Viewer at 'ack = N' and server at 'id = 0'. Half-reset. Viewer walked away from region then later returned and server re-constructed its endpoint but the viewer didn't. Safe from duplication as the event queue in the server has been destroyed and cannot be resent. Viewer does have to resync with server by taking on the 'id = 1' payload that will eventually come to it. (Could even introduce a special event for endpoint restart to signal viewer should resync but I don't think this is needed.) Viewer at 'ack = 0' and server at 'id = N'. Half-reset. Opposite case. Viewer walked away from region then re-constructed it's endpoint (LLEventPoll) but the server didn't. This is the bad case. Viewer has no knowledge of what it has processed and server has a set of events, 'id = N', which the viewer may or may not have processed in that aforementioned window. I believe that last case is the only real exposure to duplicated events we have given a) the various request constraints and b) no misbehavior by viewer or server during this protocol. So what are the options. Here are some: Case 4. doesn't happen in practice. There's some hidden set of constraints that doesn't allow this to happen. System transitions to cases 1. or 3. and 4. never arises. Hooray! There is the 'done' mechanism in the caps request body to terminate the server's endpoint. It is intended that the viewer use this in a reliable way when it knows it's going to give up its endpoint. I.e. it forces a transition to case 1. if 4. is a possibility. I think this control is buggy and has races so there are concerns. Requires viewer change. Viewer keeps 'last ack' information for each region forever. When the viewer walks away from a region and comes back, it reconstructs its endpoint with this 'last ack' info. So instead of coming back into case 4., it enters into 2. or 3. and we're fine again. Requires viewer change. Use an 'ack = 0' request as an indicator that viewer has reset its knowledge of the region and assume that any current payload ('id = N') has either been processed or can be dropped as uninteresting because the viewer managed to return here by some means. Next response from the server will be for 'id = N + 1'. No viewer change but viewers might be startled. TPV testing important. Magical system I don't know about. I would really like for that first bullet point to be true but I need to do some digging.
  7. I haven't dug into root cause yet but we have 3 million per day. I suspect most of these are associated with logout or deteriorating network conditions. But it is pretty common.
  8. With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger: 45s or more. This hands connection lifecycle mostly to the simulator, allowing it to know when it can truly send and closing the race. This should give a 'mostly working' implementation of 'at most once' event delivery. It will still break on other failures on the network as well as at TPs/RCs where the LLEventPoll/LLAgentCommunication endpoints may get cycled. Not a fix but an experiment.
  9. Duplicates are easy to remove for the simple case (viewer and sim in agreement). The case where LLEventPoll gets recreated losing its memory of previous 'ack'/'id' progress still needs work. Hope is to remove duplication at that level where it happens in one place. If it has to happen at the level of individual events/messages, that's duplicated work now and complexity in the future when more are added. Not certain about completeness of that list yet. The survey is pretty complete but I haven't checked yet. For various reasons, changing the server side is going to be painful (look at lliohttpserver.cpp in the viewer) and the viewer needs to retry on 499/5xx receipt anyway. I think if this were changed, I'd just flip the timeout response to be the same as for a duplicate connection: 200 with naked 'undefined' as LLSD value. But this current behavior is shared with all simulator and dataserver http endpoints as it is based on the common HTTP server 'framework'. There needs to be more awareness of this and our docs don't help. Viewer timeout was still the real problem as that's what introduced the race condition for the 'at most once' case. Without it, loss was still possible but would have happened more rarely and no one would ever have found it. So this is a good thing. (Ha) It's very deliberately ordered right now but it was doing event combining in the past, overwriting certain existing events when new ones of the same type were queued. The way it was done, it also resulted in out-of-order delivery. But that was removed long ago. Combined with disallowing multiple connections at once, order is now preserved. I don't know that we'd ever want it back. Except for the fact that a lot of data can flow through this pathway and it does overflow. Combining might reduce that. Not really proposing it, just making it clear that there were decisions here. I'm trying to leave enough breadcrumbs for future debates.
  10. This initial phase is going to be constrained by the http server we have in the simulator. It's based on that pump/pipe model. Without rework, the only timeout response it can formulate is a closed socket.
  11. Information is still coming in (*really* looking forward to AWS' explanation). HTTP-Out was running at elevated levels (including higher error rates) from 21:30slt yesterday until 2:45slt today. That's now running as expected. Teleports remained unreliable (~80% successful) until around 6:30slt today. They've now recovered. Lingering issues are likely and we do want to hear about them. Please contact support.
  12. Just to confirm... AWS got a bit wobbly starting at around 11:00slt. AWS is still working on it. HTTP-Out is heavily impacted. Update: Numbers looking better from 21:30slt.
  13. Event poller. I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift. But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized. Haven't dug into it yet - future project. Even the 3e6 dropped events per day are just 4 / hour / region when amortized.
  14. Agreed. The overload scenario isn't just theoretical, it is happening to the tune of 3 million per day. I also have 6K per day where viewers send 'ack' values the simulator didn't supply. No idea what that is (yet). (I really need to stop peeling this onion. It's full of brain spiders and they're making me miserable.)
  15. I want to do enough eventually to detect and log that your peer is doing something unexpected. Might just be an 'event lost' flag or a sequence number scheme. Not certain it will go as far as a digest of failure. Open to suggestions. Simulator is subject to amnesia where it necessarily tosses connection information away when an agent departs. When the agent comes back, simulator starts over. Viewer, on the other hand, may keep simulator information around for the life of the session. The resetting needs to be treated correctly.
  16. Just to keep expectations set correctly: this isn't the only TP bug. It's just one of the most glaring. We won't have 100% TP/RC success after this but it should be better.
  17. This is going to be fun. One of the recent discoveries: simulator only uses the supplied 'id' value for logging. It has no functional effect in the current event queue handling scheme. I have some more behavioral probing to do, particularly around the bizarre handling of dropped connections by apache, but a proposal is coming. This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.
  18. The goal will be no changes to viewers unless they choose to. I just haven't had time to dig out how bad the story is on that side of the wire. Man, I hate overconstrained problems...
  19. My plan for a plan is roughly: Phase 1. Simulator changes, compatible with viewers, to make it more robust. Might include temporary change in timeout to 25 or 20 seconds. Phase 2. Robust event transfer. Might require viewer changes. There's a change in here from "at most once" logic (i.e. viewers see an event at most once but possibly never) to "at least once" (event may be sent several times under the same 'id'/'ack' and viewer needs to expect that). Don't know where this best fits in. I'm hoping P1 but that might break something. Don't know if we can re-enable the old message. Or do resend asks. I really want to just make forward fixes and not add patches to the patches.
  20. Probably my fault. The policy group for long-poll should probably not attempt retries. It hides things that should have been brought up to viewer awareness.
  21. You might be monitoring at the wrong level. Try llcorehttp or wireshark/tcpdump as a second opinion. The simulator will not keep an event-get request alive longer than 30s. If you find it is, it is either retries masking the timeout or yet more simulator bugs. Or both. There are always simulator bugs. (Side note: once 'Agent Communication' is established, events queue regardless of whether or not an event-get is active. Up to a high-water level. So it is not required that an event-get be kept active constantly. It is required that no more than one be active per region - this is not documented well.)
×
×
  • Create New...