Jump to content

Monty Linden

Lindens
  • Posts

    534
  • Joined

Everything posted by Monty Linden

  1. Haven't had time to run down protocol details so that they can be documented. Still a thing I want to see. I've been doing dev test of the new EventQueue logic, among other things. I can confirm that the outer race condition I talked about before (LLEventPoll destroyed in viewer, LLAgentCommunication retained in simulator) does, in fact, occur with unfortunate results. So I have some work ahead of me...
  2. Speed of light is the absolute minimum real time. Cable round trip to Oregon and back is about 20,000km. At 300km/ms that's an absolute minimum of 67ms with practical minimum double that (sea cables, routers, etc.). The 600ms is bad... jungle in the southern hemisphere bad. But the region is looking healthy with excellent frame rates in the past 48 hours. So I'd look to the network for the cause.
  3. Not too surprised. But for the record, I believe the rule is to send it only to the main region every time you arrive after TP or RC. Main then portions out a chunk of the bandwidth to children which cannot be set directly by the viewer (will just be ignored).
  4. BTW, looking to get something up on Aditi for testing. Be on the lookout for references to SRV-607 or DRTSIM-577.
  5. This is actually how the simulator is made aware of the viewer's bandwidth setting. It is a terrible mechanism and the weights used on both sides are nonsense these days. I have outstanding Jiras to update this. I don't think this is related to the EAC message problem but it is still desired. (Defaults and limits keep things running regardless.)
  6. The path in the simulator isn't direct so not surprised it's touchy. I haven't had a chance to identify everything that might get in the way of poking the neighbors. Why on earth anyone would design a stitching protocol this way I can't imagine....
  7. Yep, the sooner the better. If you can, include the 'cef_log.txt' logging file from the usual place. This file isn't rotated or truncated, which is another bug, so please make a copy and edit it deleting all the unneeded prologue (that's a 'MMDD' prefix on the lines). You may find the answer in the difference between that prologue and current sessions if you want to dig around.
  8. A service connected to the login process had to be rolled back yesterday. Login is restored but if anyone has ongoing failures, they should file a support ticket.
  9. Try using the uninstaller from the Start Menu then reinstalling the above package.
  10. I'm not certain about the planning and rationale involved (if any). But that's the technical source of the problem. Speak up if this is truly a problem (Jira, here, the user groups). Win7 is obsolescent at this point (Steam will stop functioning on it in 54 days).
  11. By any chance, are you running Windows 7? https://releasenotes.secondlife.com/viewer/6.6.16.6566955269.html
  12. This is where the delay shows up. For Dore from the server side, at 06:06:51, the seed capability (cap-granting cap) is constructed and should be available in an EAC message. At 06:07:55, the server handles an invocation of the seed capability (POST to the seed cap) which generates the full capability set. The delay sits between those two points. Will take as a given from the viewer log that viewer receives *an* EAC message by 06:07:55 but may not be the original message. Was delay caused by: Additional gate on sending first EAC message Sim-to-sim loss of EAC message requiring regeneration Sim-to-viewer loss of EAC message on EventQueue or elsewhere requiring regeneration Delay in receiving or processing response somewhere AgentUpdate messages to main region are required to drive updates into child regions. Resulting Interest List activity there drives EAC message generation (and re-generation). The only obvious artificial delay is the mentioned 5-second retry throttle. The code is a tangle but there is no obvious 1min throttle involved. Above fragment is from Ahern which is a child region that *doesn't* have the delay. The capability setup ran through at the desired rate. So there is a difference between children. One appears to operate normally, two appear delayed. We don't have enough logging enabled by default for me to tell from this run what the source of the delay is.
  13. I only have default logging active so can only compare some phases. Ahern: The last line indicates where you've invoked the seed capability and things proceed as expected. Dore: The seed caps are generated at the same time. Interest list activity should indicate it is enabled (though I didn't check) so I think the RegionHandshakeReply is complete. EAC should be clear to send. But Dore's seed capability invocation by the viewer is delayed. I'd normally blame the viewer for the delay. Lost EACs will be resent at a maximum rate of once per 5s but it must be driven by IL activity forwarded from main (Morris). That is where I'd look. (BTW, it may be beneficial to get a key logging setup going with your TLS support. Viewer will get this in the future but being able to wireshark the https: stream and decode it now and in the future may be very useful.)
  14. These are going to require some digging. Unexpected simulator restarts (crashes, hangs, host disappears) are mostly in the realm of undefined behavior. The UDP circuit will timeout eventually and that's pretty authoritative - session is done at that point. Clean restarts are mostly going to involve kicks.
  15. On the original post... I just spent too long reading more depression-inducing code and I still can't give you the absolutely correct answer. Too many interacting little pieces, exceptions, diversions. But I think I can get you close to reliable. UseCircuitCode must be sent to the neighbor before anything can work. This should permit a RegionHandshake from the neighbor. You may receive multiple RH packets with different details. For reasons. At least one RegionHandshakeReply must go to the neighbor. Can multiple replies be sent safely? Unknown. This enables interest list activity and other things. Back on the main region, interest list activity must then occur. IL actions include child camera updates to the neighbors. These, along with the previous two gates, drive the Seed cap generation process that is part of HTTP setup. The Seed cap generation is an async process involving things outside of the simulator. Rather than being driven locally in the neighbor, it is driven partially by camera updates from the main region. No comment. When enough of this is done, the neighbor can (mostly) complete it's part of the region crossing in two parallel actions: Respond to the main region's crossing request (HTTP). This includes the Seed capability URL which will eventually be sent as a CrossedRegion message via HTTP to the viewer via main's EventQueue, and Enqueue an EstablishAgentCommunication message to be sent to the event forwarder on the main region to be forwarded up to the viewer via main's EventQueue. Note that there seems to be a race between the CrossedRegion and EstablishAgentCommunication messages. I could see these arriving in either order. But if you see one message, the other should be available as well. I don't know if this is enough for you to make progress. If not, send me some details of a failed test session (regions involved, time, agent name, etc.) to the obvious address (monty @) and I'll dig in. Ideally from an Aditi or Thursday/Friday/Saturday session avoiding bounces.
  16. It's going to be a bit before I can dig through it all and find out what was actually implemented. But there's likely a "good enough" answer in libremetaverse. It TPs successfully.
  17. Okay, I made a state machine diagram for the simulator end of things based on my current work. Looking for comments, bugs, suggestions. (Also interested in better tools for doing this.) But, briefly, it's an event-driven design with: Four driving events Six states Thirty-four transitions A two-way reset election Data Model Event Sequence. The simulator attempts to deliver a stream of events to each attached viewer. As an event is presented, it gets a (virtual) sequential number in the range [1..S32_MAX]. This number isn't attached to the event and it isn't part of the API contract. But it does appear in metadata for consistency checking and logging. Pending Queue. Events are first queued to a pending queue where they are allowed to gather. There is a maximum count of events allowed for each viewer. Once reached, events are dropped without retry. This dropping is counted in the metadata. Pending Metadata. First and last event sequence numbers of events currently in the queue as well as a count of events dropped due to quota or other reasons. Staged Queue. When the viewer has indicated it is ready for a new batch of events, the pending queue and pending metadata are copied to the staged queue and staged metadata. This collection of events is given an incremented 'id'/'ack' value based on the 'Sent ID' data. These events are frozen as is the binding of the ID value and the metadata. They won't change and the events won't appear under any other ID value. Staged Metadata. Snapshot of Pending Metadata when events were copied. Client Advanced. Internal state which records that the viewer has made forward progress and acknowledge at least one event delivery. Used to gate conditional reset operations. Sent ID. Sequential counter of bundled of events sent to viewer. Range of [1..S32_MAX]. Received Ack. Last positive acknowledgement received from viewer. Range of [0..S32_MAX]. States The States are simply derived from the combination of four items in the Data Model: Staged Queue empty/full (indicated by '0'/'T' above). Sent ID == Received Ack. Indicates whether the viewer is fully caught up or is the simulator waiting for an updated ack from the viewer. Pending Queue empty/full ('0'/'T'). A combination of three booleans gives eight possible states but two are unreachable/meaningless in this design. Of the valid six states: 0. and 1. represent reset states. The simulator quickly departs from these mostly staying in 2-5. 2. and 3. represent waiting for the viewer. Sent ID has advanced and we're waiting for the viewer to acknowledge receipt and processing. 4. and 5. represent waiting for new events to deliver. Viewer has caught up and we need to get new events moving. Events Raw inputs for the system come from two sources: requests to queue and deliver events (by the simulator) and EventQueueGet queries (by the viewers). The former map directly to the 'Send' event in the diagram. The other events are more complicated: Get_Reset. If viewer-requested reset is enabled in the simulator, requests with 'ack = 0' (or simply missing acks) are treated as conditional requests to reset before fetching new events. One special characteristic of this event is that after making its state transition, it injects a follow-up Get_Nack event which is processed immediately (the reason for all the transitions pointing into the next event column). If the reset is not enabled, this becomes a Get_Nack event, instead. Get_Ack. If the viewer sends an 'ack' value matching the Send ID data, this event is generated. This represents a positive ack and progress can be made. Get_Nack. All other combinations are considered Get_Nack events with the current Staged Queue events not being acknowledged and likely being resent. Operations Each event-driven transition leads to a box containing a sequence of one or more micro operations which then drive the machine on to a new state (or one of two possible states). Operations with the 'c_' prefix are special. Their execution produces a boolean result (0/1/false/true). That result determines the following state. The operations: Idle. Do nothing. Push. Add event to Pending Q, if possible. Move. Clear Staged Queue, increment Send ID, move Pending Queue and Metadata to Staged Queue and Metadata. Send. Send Staged Queue as response data to active request. Ack. Capture request's 'ack' value as Received Ack data. Advance. Set Client Advanced to true, viewer has made forward progress. c_Send. Conditional send. If there's a usable outstanding request idling, send Staged Queue data out and status is '1'. Otherwise, status is '0'. c_Reset. Conditional reset. If Client Advanced not true, status is '0'. Otherwise, reset state around viewer: Release Staged Queue without sending. Sent ID set to 0 Received Ack set to 0 Set Client Advanced flag to false Set status to '1'.
  18. Aditi always has something going on these days. GLTF is the big one but others as well. Try a different region on a different simhost. And the Linden viewer. FS local mesh problems kind of points to local machine so check the viewer logs. If all of that fails, Jira!
  19. What @Henri Beauchamp said: report those bugs. The service endpoint that provides the offline IMs to the viewer on login was designed to be destructive on read. So no viewer can give you the option not to delete. (I don't like that design - part of our tradition of assuming all http requests succeed if the service can queue a response.)
  20. If you mean unix/linux/posix pipes, those are necessarily local to a host. For socket-based networking, we use a bit of everything.
  21. [ Okay, sorry for the delay. I've been working through the cases and reformulating the problem and I'm starting to like what I have. Thanks to everyone who's been whiteboarding with me, especially @Henri Beauchamp and @animats. It's been extremely valuable and enjoyable. Now, on to the next refinement with responses to earlier comments... ] Reframing the Problem Two race conditions exist in the system: 1. Inner race. Involves the lifecycle of a single request. Race starts as soon as the viewer commits to timing out an EventQueueGet operation. Even if the entire response, less one byte, is in viewer buffers, that response will be lost as will any response written later until the race closes. The race closes when: Simulator times out the request and will not send any event data until a new request comes in. This produces a closed socket between simulator and apache stack. Simulator starts processing a new request from the viewer which causes early completion of outstanding request. This is sent as a 200/LLSD::undef response. (Note: the race is also started by other entities. There are actions within the Apache stack and simply on the network generally that can end that request early without the simulator being aware.) This race is due to application protocol design and the use of unreliable HTTP modes (timeouts, broken sockets). 2. Outer race. Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress. Instances of these classes can disappear and be recreated without past memory without informing the peer. This causes a jump in ack/id number and the possibility of duplicate events being sent. This race is due to implementation decisions. I.e. endpoint classes that forget and that don't have a synchronization protocol between them. Before and After Under "At Most Once" event delivery, 1. is a source of lost events and 2. doesn't matter. Events are being forgotten on every send so the outer race adds no additional failures (loss or duplication). Under "At Least Once" event delivery, 1. still exists but retaining events on the simulator side until a positive acknowledgement of receipt is received corrects the lost response. 2. then becomes a path for duplicated event processing in the viewer. A simulator will have sent events for an 'ack = N/id = N+1' exchange when the viewer moves out of the region and eventually releasing its endpoint. The viewer may or may not have received that response. When the viewer comes back, it will do so with a new LLEventPoll and 'ack = 0' as its memory and the simulator is faced with a dilemma: should the viewer be sent a duplicate 'id = N + 1' response or should it assume that that payload has been processed and wait until the 'id = N + 2' payload is ready? Simulator sending 200 response before timeout. This doesn't really close the race. The race is started outside the simulator, usually by the viewer but apache or devices on the network path can also start it. The simulator can only close the race by electing not to send event data until a new request comes in. Trying to end the race before it starts isn't really robust: libcurl timers can be buggy, someone changes a value and doesn't know why, apache does it's thing, sea cable ingress decides to shut a circuit down, network variability and packet loss, and everyone in the southern hemisphere. And it locks the various timeout values into the API contract. Solution: embrace the race, make it harmless via application protocol. Updated Phased Proposal Phase 1 As before with one change to try to address the outer race scenarios. Simulator will retain the last 'ack' received from the viewer. If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason. Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value. 'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues. This potentially drops unprocessed events but that isn't any worse than the current situation. With this, the inner race is corrected by the application protocol. There's no sensitivity to timeouts anywhere in the chain: viewer, simulator, apache, internet. Performance and delivery latency may be affected but correctness won't be. Phase 2 This becomes viewer-only. Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever. A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point. If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible. Viewer just takes up the new 'id' sequence. This should have been done from the first release. Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior. Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid. Logging and defensive coding to avoid processing of duplicated payloads (which should never happen). There's one special case where that won't be handled correctly. When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'. In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case. May Just let this go. Also, diagnostic logging for the new metadata fields being returned. Forum Thread Points and Miscellany Addressed 3 million repeated acks. Just want to correct this. The 3e6 is for overflows. When the event queue in the simulator reaches its high-water mark and additional events are just dropped. Any request from the viewer that makes it in is going to get the full queue in one response. Starting and closing the race in the viewer. First, a warning... I'm not actually proposing this as a solution but it is something someone can experiment with. Closing the race as outlined above only happens in the simulator. BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first. Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request. Viewer can then launch another request after 20s. Managing this oscillator in the viewer would be ugly and still not deal with some cases. Fuzzing Technique. You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected. This will retire an outstanding request in the simulator. That, in turn, comes back to the viewer which will launch another request which retires the curl command request. TLS 1.3 and EDH. There is a Jira to enable keylogging in libcurl/openssl to aid in wire debugging of https: traffic. This will require some other work first to unpin the viewer from the current libcurl. But it is on the roadmap. If we do open up this protocol to change in the future (likely under a capability change), I would likely add a restart message to the event suite in both directions. It declares to the peer that this endpoint has re-initialized, has no memory of what has transpired, and that the peer should reset itself as appropriate. Likely with some sort of generation number on each side.
  22. There's a lot going on on our side both before *and* after the TeleportFinish. Before the message there's a lot of communication between simhosts and other services. It is very much a non-local operation (region crossings have some optimizations in this area). After the message, some deferred work continues for an indefinite number of frames. Yeah, there are more races in there - I started seeing their fingerprint when looking at London City issues. This is a bit of a warning... the teleport problems aren't going to be a single-fix matter. The problem in this thread (message loss) is very significant. But there's more lurking ahead.
  23. There are other players in here. I've deliberately avoided mention of Apache details. However, it will contribute its own timeout-like behaviors as well as classic Linden behaviors (returning 499, for example). The modules in here aren't candidates for rework for a number of reasons and so general principles must apply: expect broken sockets (non-http failures), 499s, and flavors of 5xx and don't treat these as errors. That the LL viewer does is really a bug. The viewer-side timeouts enable hang and lost peer detection on a faster timescale than, say, tcp keepalives would (not that that's the right tool here). Slowed retry cycles might be better and then allowing UDP circuit timeout to drive the decision to declare the peer gone. But that didn't happen. Whatever happens, I really want it to work regardless of how many entities think they should implement timeouts. Timeouts are magic numbers and the data transport needs to be robust regardless of the liveness checking. (I do want to document this better. This is something present on all capabilities. For most caps it won't matter but that depends on the case.) The situation isn't entirely grim with the existing viewer code but there is still a duplication opportunity. First, the optimistic piece. The 'ack'/'id' value is supposed to have meaning but the current LL viewer and server do absolutely nothing with it except attempt a safe passthru from server to viewer and back to server. With the passthru and the enforcement of, at most, a single live request to the region, there is good agreement between viewer and server on what the viewer has processed. As long as the 'ack' value held by the viewer isn't damaged in some way (reset to 0, made an 'undef', etc.) There is one window: during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'. If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync. So, the four cases: Viewer (LLEventPol) at 'ack = 0/undef' and server (LLAgentCommunication) at 'id = 0'. Reset/startup/first visit condition, everything is correct. No events have been sent or processed. Viewer at 'ack = N' and server at 'id = N or N + 1'. Normal operation. If viewer issues request, server knows viewer has processed 'id = N' case and does not resend. Any previous send of 'id = N + 1' case has been dropped on the wire due to single request constraint on the service. Viewer at 'ack = N' and server at 'id = 0'. Half-reset. Viewer walked away from region then later returned and server re-constructed its endpoint but the viewer didn't. Safe from duplication as the event queue in the server has been destroyed and cannot be resent. Viewer does have to resync with server by taking on the 'id = 1' payload that will eventually come to it. (Could even introduce a special event for endpoint restart to signal viewer should resync but I don't think this is needed.) Viewer at 'ack = 0' and server at 'id = N'. Half-reset. Opposite case. Viewer walked away from region then re-constructed it's endpoint (LLEventPoll) but the server didn't. This is the bad case. Viewer has no knowledge of what it has processed and server has a set of events, 'id = N', which the viewer may or may not have processed in that aforementioned window. I believe that last case is the only real exposure to duplicated events we have given a) the various request constraints and b) no misbehavior by viewer or server during this protocol. So what are the options. Here are some: Case 4. doesn't happen in practice. There's some hidden set of constraints that doesn't allow this to happen. System transitions to cases 1. or 3. and 4. never arises. Hooray! There is the 'done' mechanism in the caps request body to terminate the server's endpoint. It is intended that the viewer use this in a reliable way when it knows it's going to give up its endpoint. I.e. it forces a transition to case 1. if 4. is a possibility. I think this control is buggy and has races so there are concerns. Requires viewer change. Viewer keeps 'last ack' information for each region forever. When the viewer walks away from a region and comes back, it reconstructs its endpoint with this 'last ack' info. So instead of coming back into case 4., it enters into 2. or 3. and we're fine again. Requires viewer change. Use an 'ack = 0' request as an indicator that viewer has reset its knowledge of the region and assume that any current payload ('id = N') has either been processed or can be dropped as uninteresting because the viewer managed to return here by some means. Next response from the server will be for 'id = N + 1'. No viewer change but viewers might be startled. TPV testing important. Magical system I don't know about. I would really like for that first bullet point to be true but I need to do some digging.
×
×
  • Create New...