Jump to content

Henri Beauchamp

Resident
  • Posts

    1,180
  • Joined

Posts posted by Henri Beauchamp

  1. 22 minutes ago, animats said:

    I get only server side timeouts, expressed as an HTTP completion with some status.

    This is good for the timeout part, then (and a proof that libcurl is the culprit for those silent retries we get in C++ viewers).

     

    22 minutes ago, animats said:

    Remember, there's keep-alive down at the TCP level

    Nope, not for poll requests... IIRC, only a few capabilities were configured with HTTP Keep-Alive (e.g. GetMesh2).

     

    However, and even though you get the proper server-side timeouts at your Rust code level (which is indeed a good thing), you still have the issue with the race condition occurring during the timed out HTTP poll request tear down (as explained by Monty in the first posts of this very thread): you are then still vulnerable to this race condition, unless you use the same kind of trick I implemented... or Monty fixes that race server side.... or we get a new ”reliable events” transmission channel implemented (I still think that reviving the old (for now blacklisted) UDP messages would be the simplest way to do it and would be plenty reliable enough).

     

  2. 32 minutes ago, animats said:

    I'm just doing a long 90 second poll, and the HTTP/HTTPS library (Rust's ”ureq” crate) obeys whatever timeout I give it and does not do retries on its own. So I don't have that problem.

    The timeout happens server-side after 30s without event. If you do not observe this at your viewer code level with a 90s configured timeout, then you are also the victim of ”silent retries” by your HTTP stack.

    Fire Wireshark (with a filter such as ”tcp and ip.addr == <sim_ip_here>”), launch the viewer and observe: when nothing happens in the sim (no event message) for 30s after the last poll request is launched, you will see the connection closed (FIN) by the server, and there, the rust HTTP stack is likely doing just what libcurl is doing, retrying ”silently” the request with SL's Apache server...

    Note that you won't observe this in OpenSim; I think this weird behaviour is due to the 499 or 500 errors ”in disguise” (you get a 499/500 reported in body, but a 502 in the header) we often get from SL's Apache server (you can easily observe those by enabling the ”EventPoll” debug tag in the Cool VL Viewer: errors are then logged with both header error number and body)...

  3. 10 hours ago, animats said:

    OK, so there is a workaround. Documentation would be appreciated. Does this ”restarting a poll” refer to restarting it before the Curl library retries, or restarting it before the simulator finishes the HTTP transaction with an error or empty reply? I don't have Curl in the middle, and have full control of timing. Currently I have the poll connection timeout set to 90 seconds, and the server side always times out first.

    The documentation is in the code... 😛

    OK, not so easy to get a grasp on it all, so here is how it works (do hold on to your hat ! 🤣 ) :

    1. I added a timer for event polls age measurement; this timer (one timer per LLEventPoll instance, i.e. per region) is started as soon as the viewer launches a new request, and is then free-running until a new request is started (at which point it is reset). You can visualize the agent region event poll age via the ”Advanced” -> ”HUD info” -> ”Show poll request age” toggle.
    2. For SL (OpenSim is another story), I reduced the event poll timeout to 25 seconds (configurable via the ”EventPollTimeoutForSL” debug setting), and set HTTP retries to 0 (it used to be left unconfigured, meaning the poll was previously retried ”transparently” by libcurl until it decided to timeout by itself). This allows to timeout on poll requests viewer-side, and before the server would itself timeout (like it would after 30s). Ideally, we should let the server timeout on us and never retry (this is what is done and works just fine for OpenSim), but sadly, and even when setting HTTP retries to 0, libcurl ”disobeys” us and sometimes retries ”transparently” the request once (probably because it gets a 502 error from SL's Apache server, while this should be 499 or 500, and does not understand it as a timeout, thus retrying instead), masking the server-side timeout from our viewer-side code. This also involved adding ”support” for HTTP 499/500/502 errors in the code, so that these won't be considered actual errors but just timeouts.
    3. In order to avoid sending TP requests (the only kind of event the viewer is the originator for and may therefore decide to send as it sees fit, unlike what happens with sim crossing events, for example) just as the poll request is about to timeout (causing the race condition, which prevents to receive the TeleportFinish message), I defined a ”danger window” during which the TP request by the user shall be delayed until the next poll request for the agent region is fully/stably established. This involves a delay (adjustable via the ”EventPollAgeWindowMargin” debug setting, defaulting to 600ms), which is subtracted from the configured timeout (”EventPollTimeoutForSL”) to set the expiry of the free-running event poll timer (note: expiring an LLTimer does not stop it, it just flags it as expired), and is also used after the request has been restarted as a minimum delay before which we should not either send the TP request (i.e. we account for the time it takes for the sim server to receive the new request, which depends on the ”ping” time and the delay in the Apache server); note that since the configured ”EventPollAgeWindowMargin” may be too large for a delay after a poll restart (I have seen events arriving continuously with 200ms intervals or so, e.g. when facing a ban wall), the minimum delay before we can fire a TP request is also adjusted to be less than the minimum observed poll age for this sim, and I also do take into account the current frame rendering time of the viewer (else should the viewer render slower than events come in, we would not be able to TP at all). Once everything properly accounted for, this translates into a simple boolean value returned by a new LLEventPoll::isPollInFlight() method (true meaning ready to send requests to the server; false meaning not ready, must delay the request). In the agent poll age display, an asterisk ”*” is added to the poll age whenever the poll ”is not in flight”, i.e. we are within the danger window for the race condition.
    4. I added a new TELEPORT_QUEUED state to the TP state machine, as well as code to allow queuing TP request triggered by the user whenever isPollInFlight() returns false, and to allow sending it just after it returns true again.

    With the above workaround, I could avoid around 50% of the race conditions and improve the TP success rate, but it was not bullet-proof... Then @Monty Linden suggested to start a (second) poll request before the current one would expire, in order to ”kick” the server into resync. This is what I did, this way:

    1. When the TP request needs to be queued because we are within the ”danger window”, the viewer now destroys the LLEventPoll instance for the agent region and recreates one immediately.
    2. When an LLEventPoll instance is deleted, it yet keeps its underlying ”LLEventPollImpl” instance live until the coroutine which runs within this LLEventPollImpl finishes, and it sends an abort message to the llcorehttp stack for that (suspended, since waiting for the HTTP reply for the poll request) coroutine. As it is implemented, the abortion will actually only occur on next frame, because it goes through the ”mainloop” event pump, which is checked on start of each new render frame. So, the server will not see the current poll request closed by the viewer until next viewer render frame, and as far as it is concerned, that request is still ”live”.
    3. Since a new LLEventPoll instance is created as soon as the old one is destroyed, the viewer immediately launches a new coroutine with a new HTTP request to the server: this coroutine immediately establishes a new HTTP connection with the server, then suspends itself and yields/returns back to the viewer main coroutine. Seen from the server side, this indeed results in a new event poll request arriving while the previous one is still ”live”, and this triggers the resync we need.

    With this modification done, my workaround is now working beautifully... 😜

     

    • Thanks 1
  4. The diagram is very nice, and while it brings some understanding on how things work, especially sim-server side, it does not give any clue about the various timings and potential races encountered, servers-side (sim server, Apache server, perhaps even the SQUID proxy ?)...

    You can have the best designed protocol at the sim server level, but if in the end, it suffers from races due to communications with other servers and/or because of weird network routing issues (two successive TCP packets might not take the same route) between viewer and servers, you still see bugs in the end.

    What we need is a race-resilient protocol; this will likely involve redoing the server and viewer code to implement a new ”reliable” event transmission (*), especially for essential messages such as the ones involved in sim crossing, TPs, and sims connections. I like Animat's suggestion to split message queues; we could keep the current event poll queue (for backward compatibility sake and to transmit non-essential messages such as ParcelProperties & co), and design/implement a new queue for viewers with the necessary support code, where the essential messages would be exchanged with the server (the new viewer code would simply ignore such messages transmitted over the old, unreliable queue).

    (*) One with a proper handshake, and no timeout, meaning a way to send ”keep-alive” messages to ensure the HTTP channel is never closed on timeout. Or perhaps... resuscitating the UDP messages that got blacklisted, because the viewer/server ”reliable UDP” protocol is pretty resilient and indeed reliable !

     

    44 minutes ago, Monty Linden said:

    It TPs successfully.

    Try the latest Cool VL Viewer releases (v1.30.2.32 & v1.31.0.10): they implement your idea of restarting a poll before the current poll would timeout, and use my ”danger window” and TP request delaying/queuing to ensure the request is only issued after the poll has indeed been restarted anew. It works beautifully (I did not experiment a single TP failure in the past week, even when trying to race it and TPing just as the poll times out). The toggle for the TP workaround is in the Advanced -> Network menu. 😉

  5. 38 minutes ago, Chic Aeon said:

    I am not even looking at PBR until it gets to the main grid and in Firestorm

    Well, PBR is already ”live” on the main grid (in a few test regions), and Firestorm already got an alpha viewer with PBR support... So it's time for you to look at it ! 😛

     

    38 minutes ago, Chic Aeon said:

    I thought that I read that there may (will - could) be a way to turn off PBR even after it comes out. 

    I'm afraid no... LL opted to do away entirely with the old renderer (EE ALM and forward modes alike), and there will be no way to ”turn it off”. The only settings you will be able to play with are the ones for the reflections (reflection probes are extremely costly in term of FPS rates, and won't allow ”weak” PCs to run PBR decently when turned on).

    Of course, you will be able to use the Cool VL Viewer, which already got (for its experimental branch) a dual renderer (legacy ALM+forward, and PBR, switchable on the fly with just a check box), but it will not stay forever like this (at some point in the future, everyone will have to bite the bullet and go 100% PBR, especially if LL finally implements a Vulkan renderer, which is very desirable on its own)...

    • Thanks 1
  6. Citation from the blog:

    Quote

    Today we are happy to announce a significant update for both the Second Life RC (release candidate) PBR Viewer and GLTF project server. For PBR materials to render correctly in Second Life, residents must use the latest versions listed below. 

    • Second Life RC viewer version (7.0.0.581886) or newer.

    And be on supporting GLTF server regions named ”Rumpus Room 1-4” 

    • Second Life server version (2023-09-28.6340659568) or newer.

    GLTF PBR Materials efforts were slowed recently by a potentially costly issue. PBR Materials contain a great deal of new information which is stored in the objects that use these materials. When such objects are updated, and updates happen surprisingly often, the entire set of information about the object is transmitted.  The amount of data transmitted multiplied by the frequency of transmission adds up quickly. Reducing the amount of data transmitted required a change to the protocol used for sending this information.

    Many thanks to one of our resident beta testers, animats (Joe Magarac), for spotting this issue and reporting it swiftly.

    These changes will allow creators to make richer, more realistic objects in world, so we believe that the efforts involved will pay off, and that all residents will enjoy the results of these changes.

    More GLTF PBR updates are on the way, please stay tuned!

    Well, it would be all nice and dandy (with indeed a tone mapping that is at last ”viewable” on non-HDR monitors), if there was not a ”slight” issue with the new shaders: they ate up the shadows !

    Demonstration (taken on Aditi in Morris, with Midday settings):

    First the current release viewer v6.6.15.581961:

    Shadows-ee-midday.thumb.jpg.44b29501933a8121c04428ae0735e941.jpg

     

    Second, the newest PBR RC viewer v7.0.0.581886, Midday with HDR adjustments:

    Shadows-pbr-midday.thumb.jpg.96c5c43e17e71144f8826c163a40973c.jpg

     

    And even worst for the shadows (but better for the glass roof transparency), with the same RC viewer and legacy Midday (no HDR adjustment):

    Shadows-pbr-midday-legacy.thumb.jpg.a78ce3cfe9adc93fcbc819a15254d8a5.jpg

     

    Notice all the missing (or almost wiped out) shadows (trees, avatar, in particular), as well as how bad the rest of the few shadows look now, when compared to the ”standard”...

    I raised this concern as soon as I backported the commit responsible for this fiasco to the Cool VL Viewer (and immediately reverted it), but I was met with chirping crickets... Let's see if crickets do chirp here too, and if residents at all care about shadows.

  7. On 10/5/2023 at 10:42 AM, mobiusonemasterchief Infinity said:

    if my viewer hangs or crashes after login to SL

    It should not crash in the first place: report that crash either via the JIRA for SL official viewers, or to the developer(s) for TPVs (the support channel will vary from one TPV to another, but all TPVs should have a support channel), providing the required info (crash dump/log, viewer log, etc) and repro steps where possible.

    • Thanks 1
  8. 15 hours ago, Monty Linden said:

    2.  Outer race.  Involves the lifecycle of the HTTP endpoint classes (LLEventPoll and LLAgentCommunication which retain memory of prior progress.  Instances of these classes can disappear and be recreated without past memory without informing the peer.

    Currently, such a race will pretty much never happen viewer-side in the Agent's region...

    The viewer always keeps the LLEventPoll instance it starts for a region (LLViewerRegion instance) on EventQueueGet capability URL receival, until the said region gets farther than the draw distance, at which point the simulator is disconnected, then the LLViewerRegion instance is destroyed, and the LLEventPoll instance for that region with it; as long as the LLEventPoll instance is live, it will keep the last received message ”id” on its coroutine stack (in the 'acknowledge' LLSD). However, should EventQueueGet be received a second time during the connection with the region, the existing LLEventPoll instance would be destroyed and a new one would be created with the new (or identical: no check is done) capability URL.

    For the agent's region, I so far never, ever observed a second EventQueueGet receival, and so the risk to see the LLEventPoll destroyed and replaced with a new one (with a reset ”ack” field on first request of the new instance) is pretty much inexistent.

    This could however possibly happen for neighbour regions (sim capabilities are often ”updated” or received in several ”bundles” for neighbour sims; not too sure why LL made it that way), but I am not even sure it does happen for EventQueueGet.

    I of course do not know what is the LLAgentCommunication lifespan server-side, but if race happens, it can currently only be because it does not match the lifespan of the connection between the sim server and the viewer.

     

    15 hours ago, Monty Linden said:

    As before with one change to try to address the outer race scenarios.  Simulator will retain the last 'ack' received from the viewer.  If it receives a request with 'ack = 0' and 'last_ack != 0', this will be a signal that the viewer has lost synchronization with the simulator for whatever reason.  Simulator will drop any already-sent events, advance any unsent events, and increment its 'id' value.  'last_ack' now becomes '0' and normal processing (send current or wait until events arise or timeout) continues. 

    In fact, ”ack” is very a badly chosen key name. It is not so much of an ”ack” than a ”last received message id” field: it means that unless the viewer receives a new message, the ”ack” value stays the same for each new poll request it fires and that do not result in the server sending any new message before the poll times out (this is very common for poll requests to neighbour regions).

    Note also, that as I already pointed out in my previous posts, several requests with the same ”ack” will appear server-side because these requests have simply been retried ”silently” by libcurl on the client side: the viewer code does not see these retries. For LLEventPoll, a request will not been seen timing out before libcurl retried it several times and gives up with a curl timeout: with neighbour sims, the timeout may only occur after 300s or so in LLEventPoll, while libcurl will have retried the request every 30s with the server (easily seen with Wireshark), and the latter will have seen 10 requests with the same ”ack” as a result.

    Also, be aware that with the current code, the first ”ack” sent by the viewer (on first connection to the sim server, i.e. when the LLEventPoll coroutine is created for that region, which happens when the viewer receives the EventQueueGet capability URL), will be an undefined/empty LLSD, and not a ”0” LLSD::Integer !

    Afterwards, the viewer simply repeats the ”id” field it gets in an event poll reply into the next ”ack” field of the next request.

    To summarize: viewer-side, ”ack” means nothing at all (its value is not used in any way, and the type of its value is not even checked), and can be used as the server sees fit.

    15 hours ago, Monty Linden said:

    This becomes viewer-only.  Once the race conditions are embraced, the fix for the outer race is for the viewer to keep memory of the region conversation forever.  A session-scoped map of 'region->last_ack' values is maintained by LLEventPoll static data and so any conversation can be resumed at the correct point.  If the simulator resets, all events are wiped anyway so duplicated delivery isn't possible.  Viewer just takes up the new 'id' sequence.  This should have been done from the first release.

    Easy to implement, but it will not be how the old viewers work, so... Plus, it would only be of use should the viewer restart an LLEventPoll with the sim server during a viewer-sim (not viewer-grid) connection/session, which pretty much never happens (see my explanations above).

    15 hours ago, Monty Linden said:

    Additionally, fixing the error handling to allow the full set of 499/5xx codes (as well as curl-level disconnects and timeouts) as just normal behavior.

    That hardening part is already in the Cool VL Viewer for 499, 500, 502 HTTP errors, which are considered simple timeouts (just like the libcurl timeout) and trigger an immediate relaunch of a request. All other HTTP errors are retried several times (and that retries number is doubled for the agent region: it has been of invaluable help a couple years ago, when poll requests were failing left and right with spurious HTTP errors for no reason, including in the agent region).

    15 hours ago, Monty Linden said:

    Maybe issue requests slowly while networks recover or other parts of the view decide to surrender and disconnect from the grid.

    This is already the case in current viewers code: there's a llcoro::suspendUntilTimeout(waitToRetry) call for each HTTP error, with waitToRetry increased with the number of consecutive errors.

    15 hours ago, Monty Linden said:

    Logging and defensive coding to avoid processing of duplicated payloads (which should never happen).

    Already done in the latest Cool VL Viewer releases, for duplicate TeleportFinish and duplicate/out-of-order AgentMovementComplete messages (for the latter, based on its Timestamp field).

    15 hours ago, Monty Linden said:

    There's one special case where that won't be handled correctly.  When the viewer gets to 'ack = 1', moves away from the region, then returns resetting the simulator with 'id = 0'.  In this case, the first response upon return will be 'ack = 1' which *should* be processed in this case.  May Just let this go.  Also, diagnostic logging for the new metadata fields being returned.

    Frankly, this should never be a problem... Messages received via poll requests from a neighbour region that reconnects, or a region the agent left a while ago (e.g. via TP) and comes back in, are not ”critical” messages, unlike messages received from the current Agent region the agent is leaving (e.g. TeleportFinish)...

    15 hours ago, Monty Linden said:

    3 million repeated acks.  Just want to correct this.  The 3e6 is for overflows.  When the event queue in the simulator reaches its high-water mark and additional events are just dropped.  Any request from the viewer that makes it in is going to get the full queue in one response.

    I do not even know why you bother counting those... As I already explained, you'll get repeated ”ack” fields at each timed out poll request retry. These repeats should simply be fully ignored; the only thing that matters, it that one ”ack” does not suddenly becomes different from the previous ones for no reason.

    15 hours ago, Monty Linden said:

    Starting and closing the race in the viewer.  First, a warning...  I'm not actually proposing this as a solution but it is something someone can experiment with.  Closing the race as outlined above only happens in the simulator.  BUT the viewer can initiate this by launching a second connection into the cap, say 20s after the first.  Simulator will cancel an unfinished, un-timed-out request with 200/LLSD::undef and then wait on the new request.  Viewer can then launch another request after 20s.  Managing this oscillator in the viewer would be ugly and still not deal with some cases.

    Fuzzing Technique.  You can try some service abuse by capturing the Cap URL and bringing it out to a curl command. Launch POST requests with well-formed LLSD bodies (map with 'ack = 0' element) while a viewer is connected.  This will retire an outstanding request in the simulator.  That, in turn, comes back to the viewer which will launch another request which retires the curl command request.

    That's a very interesting piece of info, and I used it to improve my experimental TP race workaround, albeit not with an added POST like you suggest: now, instead of just delaying the TP request until outside the ”danger window” (during which a race risks to happen), I also fake an EventQueueGet capability receival for the Agent's sim (reusing the same capability URL, of course), which causes LLViewerRegion to destroy the old LLEventPoll instance and recreate one immediately (the server then receives a second request while the first is in the process of closing (*), and I do get the ”cancel” from the server in the old coroutine). I will refine it (will add ”ack” field preservation between LLEventPoll instances, for example), but it seems to work very well... 😜

    (*) yup, I'm using a race condition to fight another race condition !  Yup, I'm totally perverted ! 🤣

    • Like 1
  9. 20 hours ago, Feuerblau said:

    So yes, the VRAM ran full entirely here with the viewer. No matter if dynamic memory is chosen or the vanilla 2GB option.

    See this old post of mine: AMD drivers may not be the only culprits (though, running a ”fixed” viewer regarding VRAM leaks, won't fix leaks happening at the OpenGL driver level, when the said driver got bugs).

  10. 3 hours ago, Monty Linden said:

    This is a bit of a warning...  the teleport problems aren't going to be a single-fix matter.  The problem in this thread (message loss) is very significant.  But there's more lurking ahead

    I am totally conscious about this, however we (animats & I) proposed you a ”free lunch”: implementing those dummy poll reply messages server-side (a piece of cake to implement server side, and which won't break anything, not even in old viewers) to get fully rid of HTTP-timeouts-related race conditions. Then we will see how things fare with TeleportFinish already, i.e. will it be always received by viewers ?...

    There is nothing to loose trying this, and this could possibly solve a good proportion of failed TPs... If anything, and even should it fail, it would allow to eliminate a race condition candidate (or several), and reverting the code server side would be easy and without any consequence.

    • Like 1
  11. 3 hours ago, Monty Linden said:

    There is one window:  during successful completion of a request with 'id = N + 1' and before a new request can be issued, the server's belief is that viewer is at 'ack = N'.  If the viewer walks away from the simulator (TP, RC) without issuing a request with 'ack = N + 1', viewer and server lose sync.

    This is not the issue at hand, and not what I am observing or would cause the race condition I do observe and I am now able (thanks to the ”request poll age” new debug display in the Cool VL Viewer) to reproduce at will; this is really easy with configured defaults (25s viewer side timeout, and experimental TP race workaround disabled): wait until the poll age display gets a ”*” appended, which will occur at around 24.5s of age , and immediately trigger a TP: bang, TP fails (with timeout quit) !

    The issue I am seeing in ”normal viewers” (viewers with LL's unchanged code and that my changes only allow to artificially reproduce ”reliably”), is a race at the request timeout boundary: the agent sim server (or Apache behind it) is about to time out (30s after the poll request has been started viewer side, which will cause a ”silent retry” by libcurl), and the user requests a TP just before the timeout occurs, but the TeleportFinish message is sent by the server just after the silent retry occurred or while it is occurring. The TeleportFinish is then lost, so what would happen in this case is:

    1. The sim server sent a previous message (e.g. ParcelProperties) with id=N, and the viewer replied with ack=N in the following request (with that new request not yet used, but N+1 being the next ”id” to send by the server).
    2. The user triggers a TP just as the ”server-side” (be it at the sim server or Apache server level, this I do not know) is about to time out on us, which happens 30s after it received the poll request from the viewer. At this point a Teleport*Request UDP message is sent to the sim server.
    3. The poll request started after ”ParcelProperties” receival by the viewer times out server-side and Teleport*Request (which took the faster UDP route) is also received by the sim server. What exactly happens at this point server-side is unknown to me: is there a race between Apache and the sim server, a race between the Teleport*Request and the HTTP timeout causing a failure to queue TeleportFinish, is TeleportFinish queued in the wrong request queue (the N+1 one, which the viewer did not even start, because the sim server would consider the N one dead) ?... You'll have to find out.
    4. Viewer side, libcurl gets the server timeout and retries silently the request (unknown to the viewer code in LLEventPoll), and a ”new” (actually the same request, but retried ”as is” by libcurl) request with the same ack=N is sent to the server (this is likely why you get 3 millions ”repeated acks”: each libcurl retry reusing the same request body).
    5. The viewer never receives TeleportFinish, and never started a new poll request (seen from LLEventPoll), so is still at ack=N, with the request started after ParcelProperties still live/active/valid/waiting for server reply, from its perspective (since successfully retried by libcurl).

     

    With my new code and its default settings (25s viewer-side timeout, TP race workaround OFF), the same thing as above occurs, but  the request times out at LLEventPoll level (meaning the race only reproduces after 24.5s or so of request age), instead of server-side (and then retried at libcurl level); the only difference you will see server-side is that a ”new” request (still with ack=N) by the viewer arrives before the former timed out server-side (which might not be much ”safer” either, race-condition-wise, server-side).

    This at least allows a more deterministic ”danger window”, thus the easiness to reproduce the race, and my attempt at the TP race workaround (in which the sending of the UDP message corresponding to the user TP request is delayed until outside the ”danger window”), which is sadly insufficient to prevent all TP failures.

     

    As for ack=0 issues, they are, too, irrelevant to cases when TPs and region crossings fail: in these two cases, the poll request with the agent region is live, and so it is for the region crossing to a neighbour region. There will be no reset to ack=0 from the viewer in these cases since the viewer would never kill the poll request coroutines (on which stack ack is stored) for the agent and close ( = within draw distance) neighbour regions.

     

    But I want to reiterate: all these timeout issues/races would vanish altogether, if only the server could send a dummy message when nothing else needs to be sent, before the dreaded 30s HTTP timeout barrier (say, one message every 20s, to be safe).

    • Like 1
  12. 5 hours ago, Monty Linden said:

    and the viewer needs to retry on 499/5xx receipt anyway

    LL's viewer current code considers these cases as errors, which are only retried a limited amount of times before the viewer would give up on the event polls for that sim server; these should therefore not happen in ”normal” condition, and it does not happen just because the code currently lets libcurl retry and timeout by itself, at which point the viewer gets a libcurl-level timeout, which is considered normal (not an error), and retried indefinitely.

  13. 5 hours ago, Monty Linden said:

    With these two done, a different experimental mode would be to set the viewer timeout to something that should almost never trigger:  45s or more.

    3 hours ago, animats said:

    Viewer event poll timeout bad, simulator timeout normal. Sounds good.

    You can increase the timeout to 45s with the Cool VL Viewer now, but sadly, in some regions (*) this will translate in a libcurl-level ”spurious” retry after 30s or so (i.e. a first server-side timeout gets silently retried by libcurl), before you do get a viewer-side timeout after the configured 45s delay; why this happens in unclear (*), but sadly, it does happen, meaning there is no possibility, for now, to always get a genuine server-side timeout in the agent region (the one that matters), neither to prevent a race during the first ”silent retry” by libcurl...

    (*) I would need to delve into libcurl's code and/or instrument it, but I saw cases (thus ”in some regions”) when there was no silent retries by libcurl and I did get proper server-side timeout after 30s, meaning there might be a way to fix this issue server-side, since it looks like it depends on some server(s) configuration (at Apache level, perhaps... Are all your Apache servers configured the same ?)...

    3 hours ago, animats said:

    If we have problems with duplicates, that's detectable and fixable

    I already determined that a duplicate TeleportFinish message could possibly cause a failed TP in existing viewers code, because there is no guard in process_teleport_finish() for TeleportFinish received after the TP state machine moved to another state than TELEPORT_MOVING, and process_teleport_finish() is the function responsible for setting that state machine to TELEPORT_MOVING... So, if the second TeleportFinish message (which is sent by the departure sim) is received after the AgentMovementComplete message (which is sent by the arrival sim which was itself connected on the first TeleportFinish message occurrence and sets TELEPORT_START_ARRIVAL), you get a ”roll back” in the TP state machine form TELEPORT_START_ARRIVAL to TELEPORT_MOVING, which will cause a failure by the viewer to finish the TP process properly.

     

    So, basically, a procedure must be put in place so that viewers without future hardened/modified code will not get those duplicate event poll messages. My proposal is as follow:

    • The server sends a first (normal) event poll reply with the message of interest (TeleportFinish in our example) and registers the ”id” of that poll reply for that message.
    • The viewer should receive it and restart immediately a poll request with the ”id” in the ”ack” field; if it does not, or if the ”ack” field contains an older ”id”, the viewer probably missed the message, but the server cannot know for sure, because the poll request it receives might be one started just as it was sending TeleportFinish to the viewer (request timeout race condition case).
    • To make sure, and when the ”ack” field for the new poll does not match the ”id” field of the TeleportFinish message it sent, the server can reply to the viewer new poll with an empty array of ”events”, registering the ”id” for that empty reply.
    • If the viewer next poll still does not contain the ”id” field of the ”TeleportFinish” request but does contain the ”id” field of the empty poll, obviously, it did not get the first ”TeleportFinish” message, and it is safe for the server to resend it...

     

    EDIT: but the more I think about it, the more I am persuaded that the definitive solution to prevent race conditions is to suppress entirely the risk of poll request timeouts anywhere in the chain (sim server, Apache, libcurl, viewer). This would ”simply” entail implementing the proposal made above. By ensuring a dummy/empty message is sent before any timeout would occur, we ensure there is no race at all, since the HTTP connection closing initiative is then exclusively the fact of the sim server (via the reply to the current event poll request, be it by a ”normal” message or by a ”dummy” message when there is nothing to do but preventing a timeout), and the poll request HTTP connection initiation only happens at the viewer code level.

  14. Cool VL Viewer releases (v1.30.2.28 and v1.31.0.6) published, with my new LLEventPoll code and experimental race condition (partial) workaround for TP failures.

    The new goodies work as follow:

    • LLEventPoll was made robust against 499 and 500 errors often seen in SL when letting the server time out on its side (which is not the case with LL's current code since libcurl retries long enough and times out by itself). 502 errors (that were already accepted for Open Sim) are now also treated as ”normal” timeouts for SL. It will also retry 404 errors (instead of committing suicide) when they happen for the Agent's sim (the Agent sim should never be disconnected spuriously, or at least not after many retries).
    • LLEventPoll now sets HTTP retries to 0 and a viewer-side timeout of 25 seconds by default for SL. This can be changed via the ”EventPollTimeoutForSL” debug setting, which new value would be taken into account on next start of an event poll.
    • LLEventPoll got its debug message made very explicit (with human-readable sim names, detailed HTTP error dump, etc). You can toggle the ”EventPoll” debug tag (from ”Advanced” -> ”Consoles” -> ”Debug tags”) at any time to see them logged.
    • LLEventPoll now uses an LLTimer to measure the poll requests age. The timer is started/reset just before a new request is posted. Two methods have been added to get the event poll age (getPollAge() in seconds) and a boolean (isPollInFlight()) which is true when a poll request is waiting for server events and its age is within the ”safe” window (i.e. when it is believed to be old enough for the server to have received it and not too close from the timeout). The ”safe window” is determined by the viewer-side timeout and a new ”EventPollAgeWindowMargin” debug setting: when the poll request age is larger than that margin and smaller than the timeout minus this margin, the poll is considered ”safe enough” for a TP request to be sent to the server without risking a race condition. Note that, for the ”minimum” age side of the safe window, EventPollAgeWindowMargin is automatically adjusted down if needed for each LLEventPoll instance (by measuring the minimum time taken for the server to reply a request) and the frame time is also taken into account (else you could end up never being able to TP, when the events rate equals the frame rate or is smaller than EventPollAgeWindowMargin).
    • The age of the agent region event poll can be displayed in the bottom right corner of the viewer window via the ”Advanced” -> ”HUD info” -> ”Show poll request age” toggle: the time (in seconds) gets a ”*” appended whenever the poll request age is outside the ”safe window”.
    • An experimental TP race workaround has been implemented (off by default), which can be toggled via the new ”TPRaceWorkAround” debug setting. It works by checking isPollInFlight() whenever a TP request is made, and if not in the safe window, it ”queues” the request until isPollInFlight() returns true, at which point the corresponding TP request UDP message is sent to the server. To debug TPs and log their progress, use the ”Teleport” debug tag.

     

  15. Thank you for a really useful paper !

    It indeed explains a lot of things I could observe here with my improved LLEventPoll logging and my new debug settings for playing with poll requests timeouts... As well as some so far ”unexplainable” TP failure modes, that resist my new TP queuing code (queuing till next poll has started, when too close from a timeout).

    Tomorrow's releases of the Cool VL Viewer (both the stable and experimental branches) will have all the code changes I made and will allow you to experiment with it. I will post details about the debug settings and log tags here after release.

    Looking forward for the server changes. I'll have a look at what duplicated messages could entail viewer side (especially CrossedRegion and TeleportFinish which could possibly be problematic if received twice) and whether it would mandate viewer code changes or not.

    • Like 1
  16. 22 minutes ago, MarissaOrloff said:

    I ran Superposition 3 times in Windows but the result is so far off (almost half of the Linux result) I don't trust it.

    Unigine Superposition is far from optimized for OpenGL... You'd get better results under Windows and DirectX than Linux and OpenGL, even though OpenGL Windows' performances with it are indeed abyssal. So yes, better not trusting too much its results for OpenGL.

    The results of Valley are however perfectly in line with what I get with the viewer: around +10% fps in favour of Linux.

    22 minutes ago, MarissaOrloff said:

    I'm not running Windows 8, Valley is just that old!

    In fact, you'd get better results with Windows 7/8 (less overhead than Win10 or Win11)... The problem being that you won't have valid drivers for it and such a modern GPU...

    • Like 1
  17. 1 hour ago, AmeliaJ08 said:

    Good lord, Linus Torvalds is stupid according to you

    When he is giving the finger, yes, he definitely looks like the stupidest man in the world... Linus Torvalds is no god, and while quite intelligent, he can also prove totally stupid, at times, like everyone (us included): giving the finger to people, for whatever reason, is one of the stupidest and pointless thing to do (and will likely achieve the exact opposite effect of what the person giving the finger would expect/hope) !

    1 hour ago, AmeliaJ08 said:

    to this day the Nvidia driver is full of missing and broken features even on the most recent GPUs

    Oh, and what would it be, please ?.... I have been using NVIDIA cards and their proprietary drivers for over 19 years (my first NVIDIA card was a 6600GT), and never missed a single feature !

    1 hour ago, AmeliaJ08 said:

    Settle down

    Settle down yourself, pretty please... I am not the person who is spreading FUD...

    1 hour ago, AmeliaJ08 said:

    and answer the simple question about GL performance

    I already replied this question, but of course, if you only read the first phrase of my previous post, you missed it... Read again: it was in the second phrase... 🫣

    • Like 1
  18. 1 hour ago, Anna Nova said:

    Whenever I  'attach to' I get an extra .(wherever) folder added.  Not saying you are wrong to do that, but I don't like it.  So instead of one folder with 5 things in it, I get one folder with 5 subfolders each with one thing in it.

    This is only the case in the #RLV folder, and you may disable this behaviour...

    This is strictly how RLV is supposed to work for no-mod attachments and how Marine Kelley specified it (see the text after ”HOW TO SHARE NO-MODIFY ITEMS”); the Cool VL Viewer uses my own fork of Marine's implementation, which abides strictly to her specifications. Attachments get renamed when they are in #RLV, to add the joint name to their name (this allows to avoid detaching attachments on locked joints by accident when you change outfits, and prevents the detach/auto-reattach sequence that would ensue and could break some scripts or trigger anti-detach alarms in some objects); for no-mod attachments (which cannot be renamed), RLV instead moves them into a newly created sub-folder bearing the joint name.

    However, and since some people are used to RLVa viewers' way of doing things (RLVa is a rewrite of RLV and differs in many subtle and less subtle ways from RLV) , I implemented a setting to disable the auto-renaming of attachments in #RLV (which also stops the viewer from creating sub-folders for no-mod attachments): the toggle is ”Advanced -> ”RestrainedLove” -> ”Add joint name to attachments in #RLV/”.

    A simple question on the Cool VL Viewer support forum would have given you the answer...

    • Like 1
    • Thanks 1
  19. 11 hours ago, AmeliaJ08 said:

    Curious how does Nvidia's much maligned (rightly) Linux driver perform? is it as good as on Windows?

    Maligned rightly ?... Only by stupid people, I'm afraid...

    The proprietary drivers for NVIDIA under Linux work beautifully (and around 10% faster than under Windows), with first class, and super-long time support: all bugs I reported to NVIDIA in the past 19+ years I have been using their drivers have been addressed, most of them quite promptly (first class indeed, especially when compared with AMD and ATI cards I owned in the distant past, for which Linux support was abyssal), and I today can still run my old GTX 460 (a 13 years old card !) with the latest Linux LTS kernels and the latest Xorg version.

    They are also super-stable, and adhere strictly to OpenGL specs.

    The Vulkan drivers and the CUDA stack are great too (with CUDA much faster and actually often better supported under Linux than OpenCL: e.g. with Blender, which only recently started implementing support for OpenCL when CUDA has been supported for years).

    It should also be noted that NVIDIA open-sourced their drivers for their recent GPUs and that, while AMD and Intel (used to) contribute more Open Source to Linux, they still rely on Mesa folks for their Linux driver (meaning less performances than a closed sources driver, because the Mesa maintainers do not have access to all the secret architecture details of the GPUs), and that you still need closed sources software ”blobs” to run their GPUs under Linux...

    • Like 2
  20. 11 hours ago, Anna Nova said:

    I did try @Henri Beauchamp's viewer, but there are terrible things it does to my inventory

    What things exactly ? O.O

    My viewer is in fact MUCH safer than any other viewer, since it never touches your inventory structure unless you manually trigger an action it offers (such as consolidating base folders, or recreating missing calling cards), unlike what even LL's viewer is doing in your back (consolidation and calling cards recreation are systematic at each login with LL's v2+ viewers and all the TPVs forked from it).

    It also got safe guards against essential folders deletion or move to another folder (such as the COF: deleting or moving it could get you into BIG troubles), while allowing you to delete (if and only if you so wish and do) some unnecessary folders that got introduced with v2+ viewers and are just clutter for v1 viewer old timers like me.

    As for its consolidation algorithm (only triggered on demand), it is more elaborate than LL's and also able to, sometimes, repair ”broken” inventories (inventories with duplicate base folders, for example).

    This is not because the inventory is presented differently (like a v1 viewer does), that it got ”terrible things” done to it !

  21. 2 hours ago, Dorientje Woller said:

    Good, what about people that are using ”dumb” phones for various reasons. Having a smarphone is still a free choice? How will you use those authenticators.

    In fact, you can use MFA in SL without a smartphone, but it is rather complicated and I wish LL would provide MFA via email...

    Here is the procedure I described in the opensource-dev mailing list (at the end of that archived email).

  22. This is likely due to an inventory server issue: the ”Marketplace Listings” folder is created in merchant's inventory as soon as they connect for the first time as a merchant to SL.

    In LL's original code, any failure to create that folder (which may happen, in case of inventory server issues) causes an LL_ERRS(), which ”voluntarily” crashes the viewer... Rather user-unfriendly and not very helpful either.

    Try and connect with the Cool VL Viewer instead: it won't crash, and should it also fail to create the Marketplace Listings folder, you can try and disable AISv3 (an HTTP-based inventory protocol, which sometimes goes mad and stops working for a while), from the ”Advanced” -> ”Network” menu (un-check the ”Use AISv3 protocol for inventory” entry), then relog and verify the ”Inventory” floater (there is no separate Marketplace floater: it's all v1-like UI with all inventory folders showing in the Inventory floater, even though you may choose to hide the Marketplace Listings folder via the corresponding entry in ”Folder” menu of the Inventory floater). It will also log useful diagnosis messages (CTRL SHIFT 4 to toggle the log console) when something goes wrong, instead of crashing...

    After the Marketplace Listings folder will have been successfully created, you can relog with any other viewers and won't crash with them any more (at least not at this place in that crude code 😜 )

     

    • Thanks 1
×
×
  • Create New...