Jump to content

Obscure question: when does the simulator send EstablishAgentCommunication to the viewer?


animats
 Share

You are about to reply to a thread that has been inactive for 67 days.

Please take a moment to consider if this thread is worth bumping.

Recommended Posts

  • Lindens
On 1/24/2024 at 5:03 AM, animats said:
  • Exactly 30 seconds later, the event poller returns a transport error, "Connection aborted / Unexpected EOF". The viewer side polls again.
  • Exactly 30 seconds after that, the event poller returns a 500 Internal Server Error status. The viewer side polls again.

I'd need accurate time and location information to track down a truthier story.  But here goes...

First bullet.  Simulator had started sending a response to the client when the 30-second simulator timeout fired.  This simply closed the apache<>simulator socket.  Apache had managed to start proxying data back to viewer (in response phase) when this happened so all it could do was close the viewer<>apache socket and give up.  This would have lost event data.

Second bullet.  New connection hits the 30-second timeout in the simulator and the apache<>simulator socket is closed.  A response could have been in flight, but not necessarily.  Apache hadn't yet started to return a response to viewer (in turnaround wait phase) so had some options:  synthesize an HTTP status and response body or just close the client<>apache socket.  It selected the former in this case.  This might have lost event data or might simply be one of the dataless glitch modes around responses.  Response headers and bodies might give insight into what really happened.

As always, either apache or simulator could elect to 5xx some query because of design/implementation choices (*cough*).

Added:  It's also possible that the 500-vs-broken socket election is entirely in the Rust environment.  In-progress response interruption (and data loss) always reports as broken socket while interruption during turnaround wait possibly becomes a synthetic 500.  I don't know the details of the Rust runtime...

Link to comment
Share on other sites

Thanks.

24 minutes ago, Monty Linden said:

I'd need accurate time and location information to track down a truthier story. 

Just ran a test. "animatsalt" in Bonifacio on the beta grid, near Bonifacio/240/5, looking towards Morris. Close to the edge, far from the center of the 4-corner area. Draw distance 120m. So only two regions should come up.

Bonifacio comes up fine, Morris goes live one minute later. Avatar never moves; just stands there. (Moving the avatar within the region doesn't

Login succeeded at 07:57:32 UTC, 11:57:32 SLT. 2024-01-27

Logout started at 07:58:46 UTC.

Detailed log from Sharpview is available. Will send link in IM.

 

Link to comment
Share on other sites

  • Lindens

Okay, grabbed some Bonifacio logs and I'll dig into those later.  One thing I didn't expect to see in the Apache log was a 200 with a 0-length body to event-get.  Not certain I believe that, so...

A recommendation since you're chin-deep in protocol...  look into getting the key log information out of the TLS support in the Rust crate doing the crypto.  You'll be able to decode the https: stream and persist the decode and this may be necessary for more digging.  And it's the only way to do inspection in the modern TLS world.

Quick question:  it looks like you're asking for very few caps in the initial see request.  Far fewer than the SL viewer.  Is this correct?

  • Thanks 1
Link to comment
Share on other sites

36 minutes ago, Monty Linden said:

One thing I didn't expect to see in the Apache log was a 200 with a 0-length body to event-get.

I see SL traffic event poller replies with no traffic in two forms - zero length body with status 200, and HTTP error 500. The spec says a no-traffic reply should be a 502 status, which I never see from SL sim servers. (The Other Simulator does send 502 items)

Something that might be worth a try: send an empty event poll reply every 10 seconds or so. Like this.

<llsd><map><key>events</key><array></array><key>id</key><integer>28</integer></map></llsd>

That's a valid zero-length array of events, something all viewers should be able to process and ignore. That would keep the event poller on the happy path, without HTTPS level timeouts. Apache and Curl timeout and retry logic would no longer be involved. That may be easier than struggling with the error cases in those libraries.

47 minutes ago, Monty Linden said:

A recommendation since you're chin-deep in protocol...  look into getting the key log information out of the TLS support in the Rust crate doing the crypto.  You'll be able to decode the https: stream and persist the decode and this may be necessary for more digging.  And it's the only way to do inspection in the modern TLS world.

Hm. Possible, although difficult. What are you looking for.

At the moment, the event poller should be using non-persistent connections, I think. The connection is dropped after each request. I've tried persistent mode, and that does exactly the same thing.

49 minutes ago, Monty Linden said:

Quick question:  it looks like you're asking for very few caps in the initial seed request.  Far fewer than the SL viewer.  Is this correct?

Yes. Haven't implemented avatar appearance, editing, voice, etc. yet. So those caps are the ones that the Sharpview viewer is using. I could fetch more, of course. Shouldn't affect this issues, since cap fetching comes after receiving the seed cap.

  • Like 1
Link to comment
Share on other sites

  • Lindens
1 hour ago, animats said:

I see SL traffic event poller replies with no traffic in two forms - zero length body with status 200, and HTTP error 500. The spec says a no-traffic reply should be a 502 status, which I never see from SL sim servers. (The Other Simulator does send 502 items)

The wiki is a liar.  :(  Fixing that is going to take some time.

The <200, empty, text/html> responses bother me.  We have a closed socket path, a zero-length event path <200, [], app/llsd+xml>, and some 499 paths.  Not certain where this is coming from but it may be what the server timeout when a response is in-flight looks like.  Ugly.

 

1 hour ago, animats said:

Hm. Possible, although difficult. What are you looking for.

Nothing for myself.  HTTP client libraries tend to 'interpret' edge cases before giving callers a chance to inspect.  So you sometimes need to capture wire protocol for naked truth analysis.  And key logging at the TLS level is required for this now.  I have a Jira to add this to the viewer at some point but that implies unpinning libcurl.  So not today but you can do it in other languages at any time.

 

1 hour ago, animats said:

Yes. Haven't implemented avatar appearance, editing, voice, etc. yet. So those caps are the ones that the Sharpview viewer is using. I could fetch more, of course. Shouldn't affect this issues, since cap fetching comes after receiving the seed cap.

Agreed, it shouldn't but have you seen our wiki?  Simulator has a lot of conditional code and some of that is conditioned on what you would normally think is an unrelated sub-system.  There's a tendency for the simulator to use SL viewer behavior as specification (which is both wrong and backwards).  If you ask for a full set of caps regardless of whether you need them or now, you might see different behavior.

  • Like 1
Link to comment
Share on other sites

39 minutes ago, Monty Linden said:

I have a Jira to add this to the viewer at some point but that implies unpinning libcurl.

We could patch libcurl (it's already a patched version that is statically linked against the viewer binary, anyway) to insert a hook in its code and recover the decyphered data...

Link to comment
Share on other sites

  • Lindens
10 minutes ago, Henri Beauchamp said:

We could patch libcurl (it's already a patched version that is statically linked against the viewer binary, anyway) to insert a hook in its code and recover the decyphered data...

Yep, another possibility.  For development work, I cheat.  I just disable encryption everywhere and work in cleartext.  That doesn't work on Aditi.

Link to comment
Share on other sites

1 hour ago, Monty Linden said:

If you ask for a full set of caps regardless of whether you need them or now, you might see different behavior.

Asked for a full set of caps, and it did change simulator behavior slightly. I started getting GenericStreamingMessage messages with PBR material override info. That makes sense. Those are new, and are only in the message template on the "DRTVWR-600-maint-A" branch. Some viewers can't handle them yet, so the simulator needs to know (guess) whether it should send them.

But the one minute delay remains.

Looking into getting more detail on what's happening down at the HTTPS level.

  • Thanks 1
Link to comment
Share on other sites

Basically would need support for SSLKEYLOGFILE - Everything curl

That should be a fairly minimal patch, if using a recent enough openssl library using https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_keylog_callback.html

There are even some hacks to enable it via runtime patching, e.g. https://github.com/wpbrown/openssl-keylog

Rust should be able to do it out of the box with:

KeyLogFile in rustls - Rust (docs.rs)

 

Btw. did someone ever create a proper Wireshark Dissector for the SecondLife protocol? 

(https://wiki.wireshark.org/Lua/Dissectors

Edited by Kathrine Jansma
Link to comment
Share on other sites

21 hours ago, Kathrine Jansma said:

Rust should be able to do it out of the box with:

Turns out I can set an environment variable and the Rust crate "ureq" will log the headers it sends. Looks like this:

[2024-01-28T19:47:58Z DEBUG ureq::unit] writing prelude: POST /cap/129b91db-ada6-194a-6852-94007e56de4c HTTP/1.1
    Host: simhost-0a383acea59d4be55.aditi.secondlife.io:12043
    User-Agent: Animats.com experimental viewer. Contact info@animats.com for problems.
    Accept: */*
    Content-Type: application/llsd+xml
    accept-encoding: gzip
    Content-Length: 143

Looks standard, but maybe SL wants something else. Sent the whole log to Monty Linden.

21 hours ago, Kathrine Jansma said:

Btw. did someone ever create a proper Wireshark Dissector for the SecondLife protocol? 

That's how I started out on Sharpview. The earliest versions just captured Ethernet traffic from Firestorm into a .pcap file, then parsed and dumped the packets. That was later hooked up to graphics so one static region could be replayed. That's how I debugged meshes, prims, and sculpts. Spent lots of time looking very closely at the Ivory Tower of Prims. Only after replay was working did Sharpview start logging in and talking to the simulators directly. In replay mode it looked like this. (Sadly, the two places in that video, the Polish Embassy and Ying Industries, are gone now.)

Replay mode was strictly UDP - didn't trace anything that went over HTTP. No capabilities, materials, events, etc. So that tool won't help here.

Edited by animats
  • Thanks 1
Link to comment
Share on other sites

  • 2 weeks later...
On 1/27/2024 at 12:56 PM, Monty Linden said:

Simulator has a lot of conditional code and some of that is conditioned on what you would normally think is an unrelated sub-system.

Yes, there are things connected to other things that should not be.

I just found and fixed a bug in Sharpview that applied only to the Other Simulator. The third region crossing always failed. The first two were fine.

It took about a week to find the cause. An admin of a grid for the Other Simulator set up some test regions for me, and gave me access to the sim-side logs. With my own viewer-side logs plus the sim-side logs, it was possible to diagnose the problem.

The viewer sends some info about how fast it is able to accept data from the simulators. That's what the "bandwidth" slider controls. I used larger values than usual for Sharpview, because, with all the concurrency, it's able to accept updates faster. Networks go much faster today than they did when that code was written, so larger bandwidth numbers are appropriate.

The throttle values sent to the simulator are floating point numbers. In the Other Simulator, they are converted into an integer without an overflow check or clamping. This can result in an integer overflow in 32 bits. The result is a negative number. Once we saw that negative number in a log, the problem was clear. The throttling system sees that negative number, throttles the message traffic down to zero bandwidth, and the viewer seems to freeze, since updates have stopped.

SL sims do not have that problem, by the way. Cranking up the numbers results in faster region loading, if the viewer can handle it. Black Dragon uses larger values than most other viewers.

This was easy to fix once understood. It's already fixed in both Sharpview and the dev version of the Other Simulator. The devs there have been very cooperative. 

I suspect that when we finally find the EstablishAgentCommunication bug, it will turn out to be something irrelevant like that. All the relevant stuff has already been tried.

Edited by animats
More info
  • Like 2
Link to comment
Share on other sites

I think I've figured this out. And it's very funny.

The short version: The sim side is broken, but viewers have a workaround for the breakage, so you don't see this. Mostly.

Details to follow.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

As I've reported before, when Sharpview is logged in at the Morris/Ahern/Dore/Bonifacio corner on the beta grid, only Morris and Ahern appear. Dore and Bonifacio appear a minute later. This is because the EstablishAgentCommunication event hasn't arrived to tell the viewer the seed capability for the other two regions. After a minute, it arrives.

The same thing happens with Firestorm. But there's a workaround, around line 493 of llviewerassetstorage.cpp, to get past that. If the needed cap isn't known yet, it just uses the one from the region where the agent is. This works for SL because all the regions use the same asset server. This allows Firestorm, and the SL viewers and TPVs based on the same code, to show a region that's not fully live yet.

Not everything will work for the other region, which is probably the cause of some other subtle bugs. The not-fully-live regions are not yet getting events; their event pollers have not yet started. This may interact badly with PBR, but I'm not sure. Anything in the half-live region that needs a capability won't work.

For the Other Simulator, where different regions can use different asset servers, EstablishAgentCommunication comes in earlier, right after the region handshake reply. So the Other Simulator works properly.

Check me on this. I may have misread something.

 

  • Thanks 1
Link to comment
Share on other sites

4 hours ago, animats said:

The same thing happens with Firestorm. But there's a workaround, around line 493 of llviewerassetstorage.cpp, to get past that. If the needed cap isn't known yet, it just uses the one from the region where the agent is.

This is indeed a trick ”we” use a lot, and not only for this capability...

And yes, it is utterly fragile and hacky: at some point, I reused this capability caching for server side baking (to work around another server bug which causes attachments to be unparented (derezzed) as a consequence of the departure region sending such a bogus unparenting message for the agent's own attachments), and while it first worked fine, it finally broke later on (details explained here)...

Edited by Henri Beauchamp
Link to comment
Share on other sites

  • Lindens
9 hours ago, animats said:

Check me on this. I may have misread something.

Oh, it's almost certainly true.  It's what you get with evolution through a series of errors rather than design.  The simulator didn't spec or implement the right thing.  The viewer made an adaptation keeping old caps around.  Simulator then had to follow the viewer keeping old caps alive per viewer spec so implements weak persistence and caching.    That cache doesn't always work as expected with undocumented lifetimes, etc., and the viewer adapts freezing the evolved spec at a new behavior.  Repeat.

The texture and mesh caps are 'local' resources that will supply grid-wide data, even before CDN, so likely showed up here first.  Inventory and other APIs have similar looks-local-acts-global scope.  But some really are local and you need those caps.  Or you end up using another simulator as proxy.

I really do tip my cap to the OS people for trying to chase this moving target with very little help from us...

  • Like 3
  • Thanks 1
Link to comment
Share on other sites

As Henri points out, this bug breaks other things. I think it will intermittently break, at least:

  • Bakes On Mesh
  • PBR material updates. Those don't turn on until the viewer has requested a new capability, so that old viewers don't get UDP messages they don't understand. And some PBR info comes in over the event channel.
  • VIVOX voice
  • Region crossings (maybe). Entering a visible but no-caps region forces it to go live, but that takes a few seconds. Possible race conditions.

This is probably observed by Second Life users as one of those features intermittently breaking.

A good test would be to go to one of the four-region sandboxes with some PBR objects and four avatars. Put one avatar and a PBR object in each region, near the corner. Then log in with another avatar, near the corner, while recording video for 90 seconds or so. See what happens. Try logging in starting at different regions. Post the video. Would someone with enough alts please try this?

So it's time to fix the problem in the simulators. Workarounds are not good enough. Thank you.

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

You are about to reply to a thread that has been inactive for 67 days.

Please take a moment to consider if this thread is worth bumping.

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...