Jump to content

Monty Linden

Lindens
  • Posts

    541
  • Joined

Everything posted by Monty Linden

  1. Possibly. The message is somewhat misleading as are many of our messages. There isn't a "done" message, viewer just continues on into unrelated areas and possibly fails in something having nothing to do with capabilities. Someone needs to dive into the viewer log file to find out what's really going on. 12046 is the other capability port (http:). It took 32 hops from Boston: ... 18 be-3212-pe12.910fifteenth.co.ibone.comcast.net (96.110.33.134) 71.089 ms be-3211-pe11.910fifteenth.co.ibone.comcast.net (96.110.33.118) 63.584 ms be-3412-pe12.910fifteenth.co.ibone.comcast.net (96.110.33.142) 63.681 ms 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * 108.166.240.9 (108.166.240.9) 99.394 ms 108.166.232.47 (108.166.232.47) 99.456 ms 29 * * 108.166.240.16 (108.166.240.16) 91.945 ms 30 108.166.232.32 (108.166.232.32) 97.143 ms 108.166.232.34 (108.166.232.34) 91.746 ms * 31 * * * 32 ec2-54-184-44-5.us-west-2.compute.amazonaws.com (54.184.44.5) <syn,ack> 93.140 ms 99.358 ms * monty@Monty-DellXPS:~$ Might take more than 50 in .eu and beyond.
  2. Some more hints: A better tracert. If you have WSL2, you can install 'traceroute' and get a better trace on windows. 'sudo tcptraceroute -m 50 <simhost hostname or IP> 12046' can trace all the way to a simhost. (So what if WSL2 is just a 2GB traceroute installation?) Test with a VPN. If you can access SL with a VPN but not without, that points to the ISP for fault. It should get more of their attention. Read the viewer log file. There are more clues in there. (Do NOT post it here - personal information is in there.) Support ticket.
  3. If you mean the 'login.agni...' link, then you are through one potential blocker. Given that this happens with only one account, the answer is the same as the other recent failure: likely an inventory issue. Start with a support ticket.
  4. This and the rest point in the direction of an inventory issue. Keep working with support and supply the requested information. (E.g. The SecondLife.log from a bad session.)
  5. One thing to keep in mind: content like meshes and textures do not come directly from Linden. These are supplied by a CDN with PoPs (Points-of-Presence) around the world. Not all of these perform as well as they should. And we have seen cases where an ISP attempts to hijack the CDN using DNS. They then point the CDN's DNS names to their 'optimized' caching system which is then found to be both buggy and slow. This hijacking problem can be avoided by using a more trusted DNS server (8.8.8.8, 1.1.1.1, etc.).
  6. Correct. This is sim->viewer only and none of the UDP activity. I hadn't enabled all the viewer logging so details of what is going on there are not always clear. In this test, this is right after login so the first EAC is implicit in the login payload. Most of this test is movement between two regions with frequent movement to the far end of 13000 beyond drawing distance. There was a third region involved but it wasn't local so I don't have packet data from it. I expected to see additional EAC messages for the two test regions but never did. So this message has more conditionals on it that I thought. *sigh* This is the first Region Crossing from 12035 to 13000. The destination region actually gets set up at or before packet 1138 (see note). The LLAgentCommunication object gets constructed when the viewer wants a child agent. The EAC heads to the viewer at 1532 which *is* a bit delayed. The region crossing then happens at 1975 using the same seed cap sent in the EAC message. The same seed cap will be used for 13000 throughout - 12035 will flop onto new seed caps more frequently than I expected. Part of this is because the test path takes the viewer far from 12035 allowing it to fall out of the draw distance whereas the viewer is kept in or near 13000 for the duration of the test. In this case, something interesting is suspected. Viewer navigated to the far edge of 13000 and hung out there until 12035 was removed from view. Then approached 12035 and crossed into it at 3698. Viewer appears to make a valid request to 12035 at 3729 but here it gets interesting. At 3943, viewer makes the next request to 12035 with an 'undef' ack value. This only happens if the LLEventPoll object was torn down and recreated in the viewer. The 12035 region does *not* take down it's end of the connection so it resends the id=16 payload with event 19 (AgentStatusUpdate). This is an example of that outer race condition being hit. 12035 seems okay and continues talking to the viewer. At 4491, viewer crosses back into 13000 and original seed cap is reused. We're going to cross back into 12035 but more interesting things happen. Packet 4635, I believe, is part of an abandoned request. Note that its end comes after the beginning of 5331. At 1:50:34 or so I think the viewer kicked off a coordinated teardown of 12035: both the viewer's LLEventPoll and the simulator's LLAgentCommunication get taken down then rebuilt forgetting history. This is seen at pkt 5331 where ack=0 and id=1 (both sides new). As part of that teardown, a new seed cap is generated for 12035 which is supplied in the CrossedRegion message in 4713. This request was on the wire for over 28 seconds getting very near both the viewer and simulator timeouts but seems to have made it given 5343's exchange. That teardown and rebuild happens again between 6576 and 7217. Then another asymmetric and anomalous teardown happens at 8831. For no obvious reason, viewer has torn down its LLEventPoll causing an 'ack=undef' request while simulator retains state. I see two preceding timed-out request (7550, 8103) , followed by a possible success (8447), followed by the anomaly and I think maybe the HTTP handling in the viewer has a hand in this. For the viewer, a new LLEventPoll coro needs to be created to handle it. I believe it binds the URL only at creation time (may not be correct on this). On the simulator side, it's complicated. There's an aggressive attempt to cache and reuse Cap sets as they're somewhat expensive to set up. But we've thwarted it here probably with active revocations. Addr and port remain the same, just new caps. And the old EventQueue is dropped on the floor. It should only happen as a side-effect of a viewer request (TP, RC) but the viewer doesn't have direct control. More API contract details to recover later.
  7. I've been doing some manual flow analysis and can share some of the data. These involve two regions running the new state machine-based code. Login is to region on port 12035 and then a series of region crossings with a region on 13000 occur with varoius delays. Packet capture is near simulator so isn't necessarily identical to a capture on the viewer end. But already tons of oddities such as viewer either deliberately recreating LLEventPoll instances or allowing stale requests to run and simulator taking down the viewer's seed cap and LLAgentCommunication endpoint for reasons not captured here.
  8. Haven't had time to run down protocol details so that they can be documented. Still a thing I want to see. I've been doing dev test of the new EventQueue logic, among other things. I can confirm that the outer race condition I talked about before (LLEventPoll destroyed in viewer, LLAgentCommunication retained in simulator) does, in fact, occur with unfortunate results. So I have some work ahead of me...
  9. Speed of light is the absolute minimum real time. Cable round trip to Oregon and back is about 20,000km. At 300km/ms that's an absolute minimum of 67ms with practical minimum double that (sea cables, routers, etc.). The 600ms is bad... jungle in the southern hemisphere bad. But the region is looking healthy with excellent frame rates in the past 48 hours. So I'd look to the network for the cause.
  10. Not too surprised. But for the record, I believe the rule is to send it only to the main region every time you arrive after TP or RC. Main then portions out a chunk of the bandwidth to children which cannot be set directly by the viewer (will just be ignored).
  11. BTW, looking to get something up on Aditi for testing. Be on the lookout for references to SRV-607 or DRTSIM-577.
  12. This is actually how the simulator is made aware of the viewer's bandwidth setting. It is a terrible mechanism and the weights used on both sides are nonsense these days. I have outstanding Jiras to update this. I don't think this is related to the EAC message problem but it is still desired. (Defaults and limits keep things running regardless.)
  13. The path in the simulator isn't direct so not surprised it's touchy. I haven't had a chance to identify everything that might get in the way of poking the neighbors. Why on earth anyone would design a stitching protocol this way I can't imagine....
  14. Yep, the sooner the better. If you can, include the 'cef_log.txt' logging file from the usual place. This file isn't rotated or truncated, which is another bug, so please make a copy and edit it deleting all the unneeded prologue (that's a 'MMDD' prefix on the lines). You may find the answer in the difference between that prologue and current sessions if you want to dig around.
  15. A service connected to the login process had to be rolled back yesterday. Login is restored but if anyone has ongoing failures, they should file a support ticket.
  16. Try using the uninstaller from the Start Menu then reinstalling the above package.
  17. I'm not certain about the planning and rationale involved (if any). But that's the technical source of the problem. Speak up if this is truly a problem (Jira, here, the user groups). Win7 is obsolescent at this point (Steam will stop functioning on it in 54 days).
  18. By any chance, are you running Windows 7? https://releasenotes.secondlife.com/viewer/6.6.16.6566955269.html
  19. This is where the delay shows up. For Dore from the server side, at 06:06:51, the seed capability (cap-granting cap) is constructed and should be available in an EAC message. At 06:07:55, the server handles an invocation of the seed capability (POST to the seed cap) which generates the full capability set. The delay sits between those two points. Will take as a given from the viewer log that viewer receives *an* EAC message by 06:07:55 but may not be the original message. Was delay caused by: Additional gate on sending first EAC message Sim-to-sim loss of EAC message requiring regeneration Sim-to-viewer loss of EAC message on EventQueue or elsewhere requiring regeneration Delay in receiving or processing response somewhere AgentUpdate messages to main region are required to drive updates into child regions. Resulting Interest List activity there drives EAC message generation (and re-generation). The only obvious artificial delay is the mentioned 5-second retry throttle. The code is a tangle but there is no obvious 1min throttle involved. Above fragment is from Ahern which is a child region that *doesn't* have the delay. The capability setup ran through at the desired rate. So there is a difference between children. One appears to operate normally, two appear delayed. We don't have enough logging enabled by default for me to tell from this run what the source of the delay is.
  20. I only have default logging active so can only compare some phases. Ahern: The last line indicates where you've invoked the seed capability and things proceed as expected. Dore: The seed caps are generated at the same time. Interest list activity should indicate it is enabled (though I didn't check) so I think the RegionHandshakeReply is complete. EAC should be clear to send. But Dore's seed capability invocation by the viewer is delayed. I'd normally blame the viewer for the delay. Lost EACs will be resent at a maximum rate of once per 5s but it must be driven by IL activity forwarded from main (Morris). That is where I'd look. (BTW, it may be beneficial to get a key logging setup going with your TLS support. Viewer will get this in the future but being able to wireshark the https: stream and decode it now and in the future may be very useful.)
  21. These are going to require some digging. Unexpected simulator restarts (crashes, hangs, host disappears) are mostly in the realm of undefined behavior. The UDP circuit will timeout eventually and that's pretty authoritative - session is done at that point. Clean restarts are mostly going to involve kicks.
  22. On the original post... I just spent too long reading more depression-inducing code and I still can't give you the absolutely correct answer. Too many interacting little pieces, exceptions, diversions. But I think I can get you close to reliable. UseCircuitCode must be sent to the neighbor before anything can work. This should permit a RegionHandshake from the neighbor. You may receive multiple RH packets with different details. For reasons. At least one RegionHandshakeReply must go to the neighbor. Can multiple replies be sent safely? Unknown. This enables interest list activity and other things. Back on the main region, interest list activity must then occur. IL actions include child camera updates to the neighbors. These, along with the previous two gates, drive the Seed cap generation process that is part of HTTP setup. The Seed cap generation is an async process involving things outside of the simulator. Rather than being driven locally in the neighbor, it is driven partially by camera updates from the main region. No comment. When enough of this is done, the neighbor can (mostly) complete it's part of the region crossing in two parallel actions: Respond to the main region's crossing request (HTTP). This includes the Seed capability URL which will eventually be sent as a CrossedRegion message via HTTP to the viewer via main's EventQueue, and Enqueue an EstablishAgentCommunication message to be sent to the event forwarder on the main region to be forwarded up to the viewer via main's EventQueue. Note that there seems to be a race between the CrossedRegion and EstablishAgentCommunication messages. I could see these arriving in either order. But if you see one message, the other should be available as well. I don't know if this is enough for you to make progress. If not, send me some details of a failed test session (regions involved, time, agent name, etc.) to the obvious address (monty @) and I'll dig in. Ideally from an Aditi or Thursday/Friday/Saturday session avoiding bounces.
  23. It's going to be a bit before I can dig through it all and find out what was actually implemented. But there's likely a "good enough" answer in libremetaverse. It TPs successfully.
  24. Okay, I made a state machine diagram for the simulator end of things based on my current work. Looking for comments, bugs, suggestions. (Also interested in better tools for doing this.) But, briefly, it's an event-driven design with: Four driving events Six states Thirty-four transitions A two-way reset election Data Model Event Sequence. The simulator attempts to deliver a stream of events to each attached viewer. As an event is presented, it gets a (virtual) sequential number in the range [1..S32_MAX]. This number isn't attached to the event and it isn't part of the API contract. But it does appear in metadata for consistency checking and logging. Pending Queue. Events are first queued to a pending queue where they are allowed to gather. There is a maximum count of events allowed for each viewer. Once reached, events are dropped without retry. This dropping is counted in the metadata. Pending Metadata. First and last event sequence numbers of events currently in the queue as well as a count of events dropped due to quota or other reasons. Staged Queue. When the viewer has indicated it is ready for a new batch of events, the pending queue and pending metadata are copied to the staged queue and staged metadata. This collection of events is given an incremented 'id'/'ack' value based on the 'Sent ID' data. These events are frozen as is the binding of the ID value and the metadata. They won't change and the events won't appear under any other ID value. Staged Metadata. Snapshot of Pending Metadata when events were copied. Client Advanced. Internal state which records that the viewer has made forward progress and acknowledge at least one event delivery. Used to gate conditional reset operations. Sent ID. Sequential counter of bundled of events sent to viewer. Range of [1..S32_MAX]. Received Ack. Last positive acknowledgement received from viewer. Range of [0..S32_MAX]. States The States are simply derived from the combination of four items in the Data Model: Staged Queue empty/full (indicated by '0'/'T' above). Sent ID == Received Ack. Indicates whether the viewer is fully caught up or is the simulator waiting for an updated ack from the viewer. Pending Queue empty/full ('0'/'T'). A combination of three booleans gives eight possible states but two are unreachable/meaningless in this design. Of the valid six states: 0. and 1. represent reset states. The simulator quickly departs from these mostly staying in 2-5. 2. and 3. represent waiting for the viewer. Sent ID has advanced and we're waiting for the viewer to acknowledge receipt and processing. 4. and 5. represent waiting for new events to deliver. Viewer has caught up and we need to get new events moving. Events Raw inputs for the system come from two sources: requests to queue and deliver events (by the simulator) and EventQueueGet queries (by the viewers). The former map directly to the 'Send' event in the diagram. The other events are more complicated: Get_Reset. If viewer-requested reset is enabled in the simulator, requests with 'ack = 0' (or simply missing acks) are treated as conditional requests to reset before fetching new events. One special characteristic of this event is that after making its state transition, it injects a follow-up Get_Nack event which is processed immediately (the reason for all the transitions pointing into the next event column). If the reset is not enabled, this becomes a Get_Nack event, instead. Get_Ack. If the viewer sends an 'ack' value matching the Send ID data, this event is generated. This represents a positive ack and progress can be made. Get_Nack. All other combinations are considered Get_Nack events with the current Staged Queue events not being acknowledged and likely being resent. Operations Each event-driven transition leads to a box containing a sequence of one or more micro operations which then drive the machine on to a new state (or one of two possible states). Operations with the 'c_' prefix are special. Their execution produces a boolean result (0/1/false/true). That result determines the following state. The operations: Idle. Do nothing. Push. Add event to Pending Q, if possible. Move. Clear Staged Queue, increment Send ID, move Pending Queue and Metadata to Staged Queue and Metadata. Send. Send Staged Queue as response data to active request. Ack. Capture request's 'ack' value as Received Ack data. Advance. Set Client Advanced to true, viewer has made forward progress. c_Send. Conditional send. If there's a usable outstanding request idling, send Staged Queue data out and status is '1'. Otherwise, status is '0'. c_Reset. Conditional reset. If Client Advanced not true, status is '0'. Otherwise, reset state around viewer: Release Staged Queue without sending. Sent ID set to 0 Received Ack set to 0 Set Client Advanced flag to false Set status to '1'.
×
×
  • Create New...