Jump to content

Henri Beauchamp

Resident
  • Posts

    1,180
  • Joined

Posts posted by Henri Beauchamp

  1. 2 hours ago, Monty Linden said:

    This initial phase is going to be constrained by the http server we have in the simulator.  It's based on that pump/pipe model.  Without rework, the only timeout response it can formulate is a closed socket.  :(

    Which would be an argument in favour of @animats' suggestion to send an empty array of events instead of letting the request timing out... Of course, it means more work for the sim server (monitoring the requests timing for each connected viewer and sending an empty event when it gets close to an HTTP timeout to avoid the latter), but it should not prove too overwhelming either...

    • Like 1
  2. 31 minutes ago, JeromFranzic said:

    works well for me in Linux and Windows

    As I wrote above, it has always been extremely finicky: some drivers and monitors (*) combinations more or less work, others cause crashes (usually the stack trace points deep into the OpenGL driver), or failure to set the proper resolution or image ratio: your screen shot looks to me like if the ratio is not properly set (look at the oblong shape given to the camera control ball, for example)...

    (*) Yup, it also depends on the monitor which transmits (or not, or late) its characteristics to the driver via the EDID protocol.

     

    Here is what I get with the Cool VL Viewer in full screen mode under Linux: notice the proper aspect ratio via the circular UI elements (in the camera controls), the HUD radar at the bottom right, and the bicycles wheels.

    CoolVLViewerFullScreen.png

    • Like 1
  3. 50 minutes ago, animats said:

    That's why I suggested that event poller timeout should be server side and should consist of sending a normal LLSD reply with an empty array of events.

    That won't be a timeout, but a periodic empty message sent by the server before the request would actually timeout at the HTTP stack level. It means that, if it did not send any message to each viewer with an active request in the past 28 seconds of so (to avoid the 30s HTTP timeout, counting the ”ping” time, the frame time, and possible server side lag at next frame) it must send a reply with an empty events array.

    But yes, it would work with the current code used by viewers, and would definitely prevent some race conditions on TP (the race happening when the TP request is sent just as the server times out and libcurl silently retries the request, with TeleportFinish sent by the server too soon before libcurl could reconnect).

  4. 4 hours ago, Monty Linden said:

    Event poller.  I suspect it's a case where the simulator has tossed the LLAgentCommunication state after a region crossing and avatar drift.  But the viewer keeps its LLEventPoll object alive so viewer and sim are now desynchronized.  Haven't dug into it yet - future project.  Even the 3e6 dropped events per day are just 4 / hour / region when amortized.

    Well, the viewer indeed keeps the event poll alive after the agent has left the region, which is needed to keep region communications alive in the case when the region border was simply crossed, or in the case of a ”medium range” TP in a neighbour region and still within draw distance.

    Of course, in the ”far TP” case, the viewer will keep polling until it finds out the region is to be disconnected, so it might restart a poll after a far TP, acknowledging (again) the last received message Id (same Id as previous poll)...

    Double acks will also happen whenever a poll request ”fails” (or simply times out) for a live region, and the viewer restarts a second poll: here again, the ”id” of the last received message is repeated in the ”ack” field of the poll request.

  5. IIRC, Ubuntu got Wayland enabled by default... Firestorm (like almost all other Linux viewers) is using X11, and the Xwayland compatibility layer provided for Wayland is known to be bogus in many aspects.

    What happens if you disable Wayland usage in Ubuntu 22 ?

    Note that the full screen mode has always been extremely finicky and crashy in SL viewers. For the Cool VL Viewer, I fully reworked it so that, when enabled, the viewer goes full screen from the very start, instead of attempting to switch from windowed to full screen (i.e. fully restart GL from scratch) on login: it solves many issues (mainly OpenGL driver level crashes, but also resolution detection issues). For Linux, it also got an optional ”full desktop” mode, instead of genuine full screen (i.e. the viewer uses your current desktop resolution with a decoration-less window, the bonus being that it can also run with other managed windows on top of it).

    Finally, it may be possible, depending on the window manager in use, to add a rule in the latter so that it does not decorate the viewer window and forces it full screen; you then could run the viewer in ”windowed” mode, but full screen and border-less, similar to what you would get in ”full desktop” mode.

  6. 4 hours ago, kyte Lanley said:

    For the moment I have only had one testimonial from a 7000 series owner and he was not in favor of the AMD card due to a lack of stability. I don't think there are many testimonials. I have the impression that owners of AMD cards are disappointed by the behavior of this card in Second Life. I'm waiting for more than just one miserable testimony to give me an idea.

    I'm more under the impression you are seeking for just one favourable testimony to use it as an excuse to follow your personal feeling/belief that an AMD card would be better suited for you...

    Just go ahead, and buy whatever suits your own needs/preferences, and take your responsibilities. Just don't come back here to complain ”we” gave you a bad advice, should you find out you committed a mistake. 😜

    As for the graphics cards prices, it might be wiser/smarter to wait a little bit: NVIDIA's cards are already seeing an adjustment of their prices as a result of AMD's newest cards releases (competition is a Good Thing ™) . It will take some time to propagate to France (but you could just as well buy a card from a more reactive German supplier), but prices are going to drop a bit in the coming weeks. The second half of  October is usually a good moment to buy computer hardware (long enough after people's return from Summer vacations, soon enough before Christmas). There is also the option to wait for a sale/opportunity on the former cards generation (even a RTX 3070 is plenty powerful enough for SLing).

    • Thanks 1
  7. 3 hours ago, Monty Linden said:

    Viewer, on the other hand, may keep simulator information around for the life of the session.

    Not for the full session, no, but typically up to two minutes or so after the ”departure” (actually after the region is out of draw distance): the event poll is terminated when it time outs after the departure, then the region itself is removed from memory, within one minute, the last thing to go being the UDP data circuit, about two minutes after departure.

    Of course, should the avatar go back to the region, things still around may be reused...

  8. 1 hour ago, kyte Lanley said:

    So if I understand correctly Second Life uses Open GL 4.6 contrary to what you said.

    The viewer will use the highest possible Open GL version available from your drivers. The higher the better (better optimized, faster, more features).

    The viewer currently does not need features specifically introduced in latest OpenGL versions (this might change ”soon”: the PBR viewer already needs OpenGL v3.2 features), but drivers with v4.6 and a good core profile implementation (the core profile gets rid of old OpenGL versions cruft) are much faster.

    NVIDIA proprietary drivers in OpenGL v4.6 core profile are typically +50% to +100% faster than in compatibility profile, something that is not seen happening with AMD proprietary drivers.

    • Like 1
    • Thanks 2
  9. 20 minutes ago, Monty Linden said:

    This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

    I'd say I am not surprised, given the absence of retries from the server part, when TeleportFinish is not received by viewers...

    22 minutes ago, Monty Linden said:

    This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

    Count me in among the ”guinea pigs”. 😜

    I'll gladly help testing a fix (or several) for this years-long TP bug that is plaguing SL.

  10. 13 hours ago, animats said:

    Right. I never see a legit 502 error from Second Life. I handle timeout as described above, which does seem to work. But I have more control of the protocol stack in Sharpview, since ”curl” is not involved.

    Yes, I tried to setup a larger timeout than the SL's servers, and to take into account the bogus ”502 in disguise” I get, considering them simple poll timeouts. I then observe a strange thing, that can only be explained by a libcurl weird internal working: the timeouts occur after 61.25 seconds or so (instead of 30.25 seconds or so, which would correspond to the server timeout plus the ping time), and I do see in Wireshark libcurl retrying the connection once on first server timeout (i.e. after ~30s) instead of passing the latter to the application, like instructed to do (setRetries(0)) !... Maybe it is due to that ”502 in disguise” issue (libcurl won't recognize a ”genuine” timeout and retry once ?)...

    So in the end, the only way for me to see a genuine timeout occurring is to set the viewer-side timeout below the server one...

     

    Also, everyone, you all can stop holding your breath: after stress-testing it (and despite more refinements brought to its code), my workaround does not prove robust enough, and I can still see TP failures happening sometimes (rarer than without it, but still happening nonetheless)... I will publish it in next Cool VL Viewer releases (with debugs settings for a kill switch and several knobs to play with, and that handy poll request age debug display for easy repros of TP failures), but it is not a solution to TP failures, sadly. 😢

    So, we will have to wait for Monty to fix the server side of things... 😛

  11. 28 minutes ago, animats said:

    502 - normal server side timeout per documentation. Poll again. (The Other Simulator sends this).

    The problem is that you do not get that when logged to SL: you get a 499 or 500 error header (and ”502 error” printed in body). Meaning, somehow, the 502 error gets mutated into another, and is then not recognized as such by the viewers. Thus why you cannot let the server time out when connected to SL (everything working as expected when connected to OpenSim, where I do let the server time out in my code).

  12. Success !

    I managed to:

    1. Reproduce reliably TP failures modes related to event poll requests expiration and restart delays (race condition with the servers).
    2. Find and implement a robust work around for those.

    The problem seen is indeed due to how a TP request by the user can be sent to the server while the poll request is about to timeout, or was just closed and is being restarted as the result of an event poll message receival. If the server queues the TeleportFinish message (or any message, but this one is unique and supposed to be 100% reliable, unlike ParcelProperties & Co) while the viewer is in the process of restarting a poll request, somehow that message will never be received.

    To confirm this, I use a LLTimer which is reset just before I post (and yield) the request in LLEventPoll. I also use a 25s timeout and no libcurl-level retries for those requests, so that they always timeout on the viewer side and that the said timeout is always seen happening by the LLEventPoll code. I also implemented a debug display for that timer in the viewer window, so that I can easily manually trigger a TP just before or just after the event poll request has expired or started; doing so, I can reliably reproduce the TP failures that so far seemed to happen ”randomly”.

    As for the workaround, it is implemented in the form of a TP queuing and event poll timer window checking; whenever a TP request is done 500ms or less before the agent region poll request would time out or has been restarted, the TP is queued (via a new LLAgent::TELEPORT_QUEUED state, which allows to use the existing state machine implemented in llagent.cpp and llviewerdisplay.cpp), and the corresponding UDP message (either TeleportLocationRequest, TeleportLandmarkRequest or TeleportLureRequest) requesting the TP to the server is put on hold until the event poll request timer is again in the stable/established connection window, at which point the TP request message is sent. So far (stress-testing still in progress), it works wonders and I do not experience failed TPs any more.

    If everything runs as expected and I am satisfied with the stress-testing, this code will be implemented in the next releases of the Cool VL Viewer.

  13. 11 hours ago, Monty Linden said:

    Might include temporary change in timeout to 25 or 20 seconds.

    EEEK !

    Don't do that: viewers would see those ugly ”502 in disguise” errors, which would be considered as poll request failures in the current viewers' code, and only retried a limited amount of times !

    With the current viewer code and in SL (*), the poll request timeout must occur on the viewer side (yes, even though it is ”transparently” retried on libcurl level: the important point is that the fake 502 error is not seen by the viewer code).

    If anything, increasing the server side timeout from 30s to 65s or so (so that a ”ParcelProperties” message would make it through before each request would timeout), would reduce the opportunities for race conditions.

    (*) For OpenSim-compatible viewers, a (true) 502 error test is added, which is considered a timeout and retried like for a viewer-side libcurl timeout, but this test is only performed while connected to OpenSim servers, which do not lie on 502 errors by disguising them as 499 or 500 ones in their header.

    Quote

    Robust event transfer.  Might require viewer changes.

    Pretty please, make it so that these changes remain backward-compatible...

    One possible such change would be as follow:

    Currently, viewers acknowledge the previous poll event ”id” on restarting a request, by setting the ”ack” request field equal to the previous result ”id”. It means that, for TeleportFinish, the server would normally see the ”id” used to transmit it on its side coming back immediately in the ”ack” field of the request following its receival by the viewer. If the server does not get it (because it does not get a new request posted by the viewer), then the TeleportFinish was not received and should be resent.

    To be 100% sure that the request is not just in flight or delayed, the server could send two different commands in a row on TP: TeleportFinish first, then, for example the ParcelProperties, in a different message (different Id): then if no ”ack” for TeleportFinish has been received, re-issue it.

  14. 31 minutes ago, Monty Linden said:

    Probably my fault.  The policy group for long-poll should probably not attempt retries.  It hides things that should have been brought up to viewer awareness.

    The LLAppCoreHttp::AP_LONG_POLL policy group does not define the retry attempts, at least not in my viewer... But explicitly setting mHttpOptions->setRetries(0) causes ”502” errors in disguise (502 body, 500 or 499 header) to happen...

    However, setting mHttpOptions->setTransferTimeout(25) (25s timeouts, i.e. below the server timeout) with mHttpOptions->setRetries(0) seems to work just fine: libcurl then timeouts after 25s and the viewer fires a new poll, as expected (an no trace of retries in Wireshark)... This would eliminate a possible cause for a race condition.

    And I got an idea to avoid TP failures that would possibly be the result of a race between a received event processing, the triggering of a TP by the user just at that moment, the firing of a new poll, and the TeleportFinish transmission. I'll try to set an ”in flight” flag on starting the poll request, reset it on request return, and on TP test that flag: when not ”in flight”, yield to coroutines until the coroutine for the event poll can fire a new request (setting the flag); the TP would then be fired while the poll request is ”stable” and waiting for a server transmission.

  15. 40 minutes ago, Monty Linden said:

    You might be monitoring at the wrong level.  Try llcorehttp or wireshark/tcpdump as a second opinion.  The simulator will not keep an event-get request alive longer than 30s.  If you find it is, it is either retries masking the timeout

    Yup, you are right... Can see this with Wireshark. The retry is likely done at libcurl level... More race condition opportunities ! 😢

    Which only advocates for a return of reliable message events such as TeleportFinish to the ”reliable UDP” path provided by the viewer...

  16. 23 hours ago, Monty Linden said:

    [Viewer]  Request timeout fires.  Begins to commit to connection tear down.

    In fact, I could verify today that this scenario cannot happen at all in SL. I instrumented my viewer code with better DEBUG messages and a timer for event poll requests.

    Under normal conditions (no network issue, sim server running normally), event polls never timeout in the agent region before an event comes in. Even in an empty sim, without any neighbour regions, the ParcelProperties message is always transmitted every 60 seconds (and for an agent region with neighbours within draw distance, you also get EnableSimulator for each neighbour every minute).

    Timeouts only occur for neighbour regions, when nothing happens in the latter, and after 293.8 seconds only.

    So, when a user requests a TP, the agent region will not risk seeing the poll request timing out just at the moment TeleportFinish arrives, causing a race condition in the HTTP connection tear down sequence, like you described.

    However, what would happen if, say, a ParcelProperties message (or any other event in the agent region) arrives milliseconds before the user triggers a TP request ?... The poll request N finishes with ParcelProperties, the TP request fires, and what if TeleportFinish is sent by the server just before the viewer can initiate poll request N+1 (reminder: llcore uses a thread for HTTP requests) ?... Maybe a race condition could happen here (depending on how events are queued server side, and how the Apache delays in connection building and tear down could lag/race things; this might explain why TeleportFinish is sometimes queued but never sent, maybe ?)...

    In any case, I would suggest reconsidering the way TeleportFinish is sent to viewers: what about restoring the old UDP reliable path for it ?... Or implementing a message for viewers to re-request it, when they did not get it ”in time”...

  17. 1 hour ago, Monty Linden said:

    There are several failure scenarios including one where the TeleportFinish message is queued but the simulator refuses to send it for reasons unknown yet.  A more elaborate scenario is this:

    • [Viewer]  Request timeout fires.  Begins to commit to connection tear down.
    • [Sim]  Outside of Viewer's 'light cone', queues TeleportFinish for delivery, commits to writing events to outstanding request.
    • [Sim]  Having prepared content for outstanding request, declares events delivered and clears queue.
    • [Apache]  Adds more delay and reordering between Sim and Viewer because that's what apache does.
    • [Viewer]  Having committed to tear down, abandons request dropping connection and any TeleportFinish event, re-issues request.
    • [Viewer]  Never sees TeleportFinish, stalls, then disconnects from Sim.

    Yes, this might indeed happen... I will have to try and log one such scenario (got nice DEBUG level messages for event polls and, now, server/viewer messaging)...

    The problem here, is that we do not have a way for the viewer to acknowledge the server TeleportFinish message... The latter used to be an UDP ”reliable” message (with its own private handler), but got UDPDeprecated then UDPBlacklisted in favour of the event poll queue/processing... It was not the wisest move...

    A possible workaround would be to allow the viewer to (re)request TeleportFinish; in this case, a simple short (5 seconds or so) timeout could be implemented viewer side after a TP has started, and if TeleportFinish has not been received when it expires, then it would re-request it...

    EDIT: I'm also seeing a possible viewer-side workaround for such cases via the implementation of a ”teleport window” timer. That timer would be reset each time the viewer starts an event poll request for the agent region: when the user asks for a TP, the timer would be checked and if less than, say, 2 seconds are left before the timeout would fire (since it is set viewer side, at least in my code, for SL, it is known), the TP request would be delayed till the next poll is started...

    • Like 1
  18. 1 hour ago, Monty Linden said:

    Well....  you don't really know that unless what you've received matches what was intended to be sent.  And it turns out, they don't necessarily match and retries are part of the problem.  I already know event-get is one mode of TP/RC failure. 

    All the TP failure modes I get happen before event polls are even started: the UDP message from the arrival sim just never gets to the viewer, so the latter is ”left in the blue” and ends up timing out on the departure sim as well..

  19. 1 hour ago, Monty Linden said:

    This keeps getting worse the more I look.  So *both* viewer and simulator implement a 30-second timeout on these requests.  A guaranteed race condition.  Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout.  More variability.

    Do have a look at the comments I added in linden/indra/newview/lleventpoll.cpp, in the Cool VL Viewer sources, for the various modifications I implemented to deal with both SL and OpenSim idiosyncrasies... In particular:

    	LLAppCoreHttp& app_core_http = gAppViewerp->getAppCoreHttp();
    	// NOTE: be sure to use this policy, or to set the timeout to what it used
    	// to be before changing it; using too large a viewer-side timeout would
    	// cause to receive bogus timeout responses from the server (especially in
    	// SL, where 502 replies may come in disguise of 499 or 500 HTTP errors)...
    	// HB
    	mHttpPolicy = app_core_http.getPolicy(LLAppCoreHttp::AP_LONG_POLL);
    	if (!gIsInSecondLife)
    	{
    		// In OpenSim, wait for the server to timeout on us (will report a 502
    		// error), while in SL, we now timeout viewer-side (in libcurl) before
    		// the server would send us a bogus HTTP error (502 error report HTML
    		// page disguised with a 499 or 500 error code in the header) on its
    		// own timeout... HB
    		mHttpOptions->setTransferTimeout(90);
    		mHttpOptions->setRetries(0);
    	}

    Yes, it is indeed as bad as it looks... This said, my modified code performs just fine in both SL and OpenSim now, and the failed TP issues still seen are not related with event polls anyway (event polls are simply retried on timeouts).

  20. 15 minutes ago, kyte Lanley said:

    Oui mais le problème c'est que je ne peux pas éditer mon premier post car je n'ai pas l'option éditer :

    https://imgur.com/a/0V2r3f8

    Je n'ai la fonction éditer que sur mon dernier message.

    Il semble en effet qu'après un certain temps ((c)1955 Fernand Raynaud 😛) l'option d'édition disparaisse... 😢

    Autre suggestion: ajoutez un message à ce fil de discussion en en faisant un court résumé en Anglais, avec la question ”mise à jour”.

  21. 2 hours ago, kyte Lanley said:

    Hors on voit bien dans le tableau que j'ai posté dans mon premier message que les cartes AMD de la génération 7000 ont de très bonnes performances en Open GL.

    Un tableau tiré d'un test de performances qui n'a rien à voir avec SL et qui ne dit rien des conditions de test. Par exemple, quel était le mode de fonctionnement des pilotes ?  Profil de compatibilité (compatibility profile: i.e. un profil avec support des commandes Open GL dépréciées/caduques) ou profil strict (core profile) ?...

    Avec les pilotes NVIDIA, on observe, dans le viewer, +50 à +100% (en fonction de la scène rendue) de performances en mode core profile, au contraire des pilotes d'AMD où les perfs sont quasi-identiques. Donc, si le test a été réalisé en profil de compatibilité, les résultats pour NVIDIA apparaissent moins bons qu'il ne pourraient être en comparaison avec AMD...

    Un autre point est l'utilisation des profils partagés (shared GL profiles), dont, là encore, NVIDIA profite mieux qu'AMD; un problème de synchronisation de la queue de commandes Open GL, qui doit être faite dans le fil d'exécution (thread) principal avec les pilotes AMD, alors qu'elle peut avoir lieu dans les fils secondaires pour NVIDIA, évitant des ”hoquets” dans le taux d'images par seconde.

    De plus le test fait référence à Open GL v4.5, alors que la version qui compte vraiment est la dernière, i.e. v4.6; on peut donc se poser des questions sur l'étendue des fonctions testées, en particulier dans les nuanceurs (shaders)...

    Par ailleurs, les performances ne sont qu'un aspect des choses. La robustesse (absence de plantages avec les pilotes NVIDIA, là où AMD se vautre littéralement), et le respect du standard Open GL (*) en sont deux autres, que NVIDIA gagne haut la main.

    (*) Il y a dans le code des viewers des contournements de bogues pour les pilotes AMD (et Intel, d'ailleurs), dont les pilotes NVIDIA n'ont pas besoin grâce à leur strict respect de la spécification Open GL.

     

    2 hours ago, kyte Lanley said:

    C'est pour cela que j'aimerais avoir le témoignage de quelqu'un qui possède une 7900 (xt ou xtx) pour qu'il me dise comment sa carte se comporte sur Second Life.

    C'est pour cela que le lien que j'ai choisi dans mon précédent message pointe vers le témoignage d'un utilisateur qui a essayé AMD, a été déçu, et a finalement retourné la carte pour prendre une NVIDIA qui elle, lui a donné satisfaction...

     

    Notez que je n'ai rien contre AMD (mon dernier PC utilise même un Ryzen 7900X qui est un super CPU, dont je suis très satisfait et que je ne peux que chaudement recommander). Simplement, je me base sur mon expérience passée (certes ancienne) et sur le retour des utilisateurs des ”viewers” (le mien, et les autres), qui concordent parfaitement.

     

    2 hours ago, kyte Lanley said:

    Avis donc aux possesseurs de ces cartes, j'attends vos témoignages.

    Vous auriez plus de chance d'obtenir une réponse en posant votre question en Anglais...

×
×
  • Create New...