Jump to content

animats

Resident
  • Posts

    6,143
  • Joined

  • Last visited

Posts posted by animats

  1. 3 hours ago, Monty Linden said:

    I want to do enough eventually to detect and log that your peer is doing something unexpected.  Might just be an 'event lost' flag or a sequence number scheme.  Not certain it will go as far as a digest of failure.  Open to suggestions.

    An "event lost" indicator, for both event poller events and UDP messages, would be a good start. That would distinguish overload-related situations (busy shopping event with many avatars entering and leaving) from other kinds of problems. If there are no missing events, and the fixes mentioned above have been made, but things still reach a "stuck" situation, then there's a well defined problem to look for.

  2. 1 hour ago, Monty Linden said:

    This is going to be fun.  One of the recent discoveries:  simulator only uses the supplied 'id' value for logging.  It has no functional effect in the current event queue handling scheme.

    I kind of suspected that, since my first implementation simply sent the string "TEST" instead of the proper LLSD poll request. It still worked.

    1 hour ago, Monty Linden said:

    This is going to require TPV testing against test regions given that things like 'id' will become semantically meaningful again.

    Willing to help there.

    36 minutes ago, Monty Linden said:

    Just to keep expectations set correctly:  this isn't the only TP bug.  It's just one of the most glaring.  We won't have 100% TP/RC success after this but it should be better.

    That's OK. Having a sound communications protocol underneath will reduce the noise level, and make it easier to find the remaining intermittent bugs.

    I'd like to plug again for the idea that, in the unusual event the protocol has to drop something due to overload, put something in the event stream that tells the viewer that data was lost. Then we can distinguish between overload (as in, we need more capacity somewhere for busy event regions, or something has to be done about the disco with all the material-changing floor tiles), and missing messages (as in, viewer has been waiting for EstablishAgentCommunication for 10 seconds now.). Something like that in the UDP message protocol, should a reliable message be lost, would also be helpful. Silently losing important messages made debugging difficult.

    This is real progress.

  3. 27 minutes ago, Henri Beauchamp said:

    The problem is that you do not get that when logged to SL: you get a 499 or 500 error header (and ”502 error” printed in body). Meaning, somehow, the 502 error gets mutated into another, and is then not recognized as such by the viewers.

    Right. I never see a legit 502 error from Second Life. I handle timeout as described above, which does seem to work. But I have more control of the protocol stack in Sharpview, since "curl" is not involved.

    So, if this is being overhauled, I'd suggest that sim server side send a 200 (OK) status with the no-events LLSD body given above. That avoids problems with error case handing in Apache, Curl, AWS, VPNs, firewalls, and other not entirely transparent middle boxes.

  4. 10 hours ago, Henri Beauchamp said:

    Don't do that: viewers would see those ugly ”502 in disguise” errors, which would be considered as poll request failures in the current viewers' code, and only retried a limited amount of times !

    With the current viewer code and in SL (*), the poll request timeout must occur on the viewer side (yes, even though it is ”transparently” retried on libcurl level: the important point is that the fake 502 error is not seen by the viewer code).

    If anything, increasing the server side timeout from 30s to 65s or so (so that a ”ParcelProperties” message would make it through before each request would timeout), would reduce the opportunities for race conditions.

    (*) For OpenSim-compatible viewers, a (true) 502 error test is added, which is considered a timeout and retried like for a viewer-side libcurl timeout, but this test is only performed while connected to OpenSim servers, which do not lie on 502 errors by disguising them as 499 or 500 ones in their header.

    That's the reverse of what I'm doing in Sharpview. I let the server side time out.

    Timeout server side is well defined - you get a complete HTTP transaction with some status. Timeout viewer side is a TCP connection abort. If that happens while the server is starting to send something, the server probably won't notice. Certainly not if it goes through Apache. So I wanted to time out server side, and use a 90 second timeout. As usual, I start out by doing what the documentation says to do, then look for problems.

    Current Sharpview error status handling for the event poller:

    • 200 - normal. Zero length "events" array in LLSD is OK. LLSD parsing errors or missing expected fields are logged. ID must be in sequence and errors are logged, but no action is taken.
    • 404 - end of file. Quit polling.
    • 502 - normal server side timeout per documentation. Poll again. (The Other Simulator sends this).
    • Other 5xx errors - log error, after 3 errors wait 2 secs to avoid excessive retry on server fails, poll again. Never give up as long as region is connected.
    • Connection close - log error, poll again. This seems to be the normal SL behavior on server side timeout.

    The viewer side HTTPS client is the Rust crate "ureq". "Curl" is not used.

    I'd suggest is that the server should time out first, and it should send an empty events array with a normal 200 status. that is,

    <llsd><map><key>events</key><array></array><key>id</key><integer>28</integer></map></llsd>

    That empty LLSD array

    <array></array>

    unambiguously says "everything is fine, no new events right now." So no troublesome HTTP error paths are taken in normal operation. Normal operation stays on the happy path. Existing viewers should accept such a no-events message. This is a once or twice a minute event, if that, so it's low overhead.

  5. 3 hours ago, bigmoe Whitfield said:

    this was the wrong move by epic and tons of people are looking at their options.  

    Not Epic. Epic sells Unreal Engine and Fortnite, not Unity.

    Unity's pricing policy change is causing great turmoil in game dev right now. Is SL's mobile viewer  using Unity?

    • Thanks 1
  6. Plan sounds good. Useful technique from robotics: if you have to put something on a queue and the queue is full, put in a message that says "some messages were dropped". Then, at the receiving end, you can tell the difference between a missing message because of overload, and a missing message because it wasn't sent at all. Those errors have completely different causes, and this tells you where to look.

    • Thanks 1
  7. From the Tilia terms of service:

    10.2. California Users

    If you have complaints with respect to any aspect of the money transmission activities conducted at this location, you may contact the California Department of Financial Protection and Innovation at its toll-free telephone number, 1-866-275-2677, by e-mail at consumer.services@dbo.ca.gov, or by mail at the Department of Financial Protection and Innovation, Consumer Services, 2101 Arena Boulevard, Sacramento, CA 95834.

    Tilia is a regulated financial service and is subject to state supervision. It's worth reporting this.

  8. 9 hours ago, Monty Linden said:

    Even that requires negotiation after going that far down the TP path.

    That's a good argument for making the event polling system reliable. Making it reliable won't break existing behavior. Trying to wrap recovery around lost events gets complicated very fast.

    Quote

    The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable.  Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point.  In fact, it is still unreliable now as we'll drop queue overruns on the floor.  They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action.

    Oh, joy. Most of the events the event poller deals with are not in the "unreliable/OK to miss" category.

    • Thanks 1
  9. 2 hours ago, Monty Linden said:

    This keeps getting worse the more I look.  So *both* viewer and simulator implement a 30-second timeout on these requests.  A guaranteed race condition.  Simulator's spans a different part of the request lifecycle than does the viewer's curl timeout.  More variability.  Simulator can send at least three different responses related to its timeout actions:

    • A 200 status with empty body and no content-type header
    • A 200 status with llsd content-type and a body of:  '<llsd><undef /></llsd>'
    • A 499 status with html content-type and a body I haven't looked at but it's about 400 bytes.

    Sharpview uses a 90 second timeout viewer side, so the simulator's server always times out first. Henri has chosen to time out first viewer side. Both will probably work. Same timeout at  both ends, though, seems iffy. I wonder if the long timeout I'm using is why I see skipped ID numbers from the Other Simulator, but not SL. I'll take that up with their devs.

    Sharpview processes the 200 status if LLSD is present, stops on a 404 status per documentation, and retries on any other status. I use the Rust crate "ureq" and make routine HTTP requests. I still have a bypass to deal with that 20 year old SL self-signed SSL root cert which uses an obsolete encryption system for which there is no Rust support. (SSL's ancient root cert expires in 2025; the clock is ticking.) All this seems to work.

    (How is the SL mobile dev team coming along with this? They're re-implementing, not re-using the C++ code, correct? So they get to face all the legacy issues.)

    (For less technical readers: SL does well at not crashing, as in "region down". Most of the remaining immersion-breaking problems, from teleport fails to clothing not rezzing, involve the simulator or viewer stalled, waiting for something the other side is never going to send, or that was previously sent and lost. Thus the need for this fussy attention to detail down at the communications protocol level.)

  10. 7 hours ago, Cinnamon Mistwood said:

    I just want something fun and DIFFERENT. 

    That's what mainland and private sims are for. There are well-run estates - the Chung properties, the Grove, etc., where everything looks good but there is more imagination.

    Bellessaria is, by design, rather banal. There's a market for that. A big one. If they get away from banal, as with the fantasy regions, nobody knows what to do. Curved walls! Help!

    • Like 7
  11. deepbumpcobblestonesdemo.jpg.3b5994650d3f436b465b6efcfb0d484d.jpg

    Cobblestone texture. DeepBump generated normals on left, no normals on right.

    It now looks like real, rough cobblestones. A bit rougher than I really wanted here, but I could turn down the "strength" in Blender if I wanted.

    Now I'm tempted to go around adding normals to everything I have that looks too flat.

    This is a machine learning system trained on good textures with normals. So it knows about common objects. Bricks, cobblestones, and clothing seams all work well. If you get too far from the training set, it may guess wrong, of course.

    • Like 3
  12. This is something new - DeepBump. It's a free Blender add-on from Hugo Tini in Switzerland which takes in images and generates plausible normals using machine learning. It's smart about this; it's doing this from training sets, not just converting color to depth. Works great on easy stuff like bricks. So I tried it on something harder, one of my NPCs.

    beccatexturetest.thumb.jpg.c4551e251a05a5102c1084d9cbec4484.jpg

    Left, with DeepBump-generated normals on the body. Right, no normals. Only the clothing has been processed, not the face and hands.

    Look at the cargo pockets on the pants and the sleeve seam at the shoulder. The clothing is better defined. DeepBump recognizes seams and wrinkles, and gives them depth.

    Don't need the model in Blender for this, just the texture. I exported the texture from SL, brought the texture into Blender, mapped it onto one face of a cube, applied DeepBump to that material, exported the resulting image to a .png file, uploaded it into SL, and applied it to the appropriate face of the model. Default settings, no manual tweaking.

    This could be a useful tool for retrofitting normals into older SL content. The main texture wasn't changed; there's just an additional normal texture. Normals cost zero LI, so it's useful for animesh models such as this one, where huge triangle count means huge land impact and you can't afford to model every seam.

    • Like 5
  13. I need a .bvh file which just has all the neutral positions for a bento skeleton. Pose stance would be fine. I just need to be able to reset the extra bento joints after a bento anim has terminated and I need to run a non-bento anim. My NPCs are slightly lopsided after they shift from running (bento) to some of the non bento standing anims.

    • Thanks 1
  14. It looks like there is a state machine inside, fighting to get out. Login, connection to neighbor, and teleports all follow a pattern. So here's an attempt to express this as a state machine, to understand the constraints on what can happen when.

    A region connection state machine (first draft)

    IDLE state - no connection

    DISCOVERY state - finding out or being told about a region to connect to

    • User makes login request to login server, or
    • Home region server sends EnableSimulator to viewer to display a nearby region, or
    • Home region server sends TeleportFinish to viewer. (I think)

    CONNECTING state - actually making the connection to the new region.

    • State entry requires knowing requires knowing IP, and port of new region.
    • Seed capability comes in via EstablishAgentCommunication
    • Set up UDP circuit.
    • Do RegionHandshake and reply.
    • Fetch capabilities.
    • Start event poller.

    CAMERA_AIMING state - agreeing on the viewpoint.

    • Viewer and server have to agree on where the camera is.
    • This is the messy part where, for login and teleport we need the ObjectUpdate for the avatar so the viewer calculate the camera location, but can't get it without sending a dummy location first in an AgentUpdate to start object update sending. This state can be skipped for neighbor connections, because the camera isn't moving.

    LIVE state - fully connected and operating normally.

    • AgentUpdates going out from the viewer, ObjectUpdates coming in, region live, user interacting and happy.
    • Enjoy SL.

    FAULT state - something went wrong.

    • Invalid message or timeout in DISCOVERY, CONNECTING, or CAMERA_AIMING state.
    • Log problem, maybe send report, including event sequence that got us here, to a server.
    • For login, go back to login page. For teleport, teleport has failed, try to continue from old location. For neighbor region connection, display a region down hole in the ground.

    Discussion

    This almost fits what's happening.  Almost. There are are two known ordering problems:

    • The seed capability comes in later than we'd like viewer side. It would be nice to have the seed capability before the first thing that needs an asset. That would be the terrain textures in the RegionHandshake. But RegionHandshake must precede EstablishAgentCommunication (at least for SL; the Other Simulator sends it earlier). Some object updates and terrain info come in before we have a seed capability. So asset requests have to held until the region transitions to LIVE state. Not too bad. 
    • Timing of RegionHandshakeReply is touchy. Deliberately delaying RegionHandshakeReply by 2 seconds seems to prevent login interest list problems. This suggests it should be sent later in the sequence. Comments on how that works server side would be helpful. 

    There's an underlying model to all this. I think. What have I missed?

    Can we fit region crossings into this model?

    • Thanks 1
  15. 49 minutes ago, Monty Linden said:

    And it's not necessarily ordered-reliable.  The event queue was very deliberately designed to implement event compaction/overwrite in place and, at some point in time, it was neither ordered (between events) nor reliable.  Someone disabled that UDP-like behavior at some point and I'm not certain why and if that has contract status at this point.  In fact, it is still unreliable now as we'll drop queue overruns on the floor.  They do not get regenerated unless a higher-level application protocol times out and causes a regeneration from a new query or other re-synchronizing action.

    Ouch. Need to think about the implications of that. With reliable UDP messages, at least you get a few retries. I'm aware that UDP message dropping under overload can occur on both sender and receiver side, and that those are usually repaired by retransmission. But if the event queue drops messages under overload, there's no similar backup.

    Some messages you never want to drop. If an EnableSimulator is dropped, the viewer has no idea that it hasn't been told about a region, and will never do anything to discover it. That makes me think of those rare situations where you reach the edge of a region, and the other region isn't showing, even though it's up. That was much more common before AWS uplift.

    I haven't looked at the side of communications that involves group messages and such. Beq Janus is the expert on that and might have something to say.

    49 minutes ago, Monty Linden said:

    It should have been treated more like ETag.

    As in RSS? Right, there it's interpreted as "I've seen and processed everything through #117, send me anything from #118 and later." If you lost the TCP poll connection with data in transit, there might be loss. The sender side has to be be careful about the write completion and close status on the TCP connection for this to be airtight. The TCP protocol knows on the write side if there was a clean close with all data delivered, but whether that info makes it through the socket interface is questionable. Or whether that even still works. There's a good chance that info does not make it all the way up through the HTTP and HTTPS protocol stacks, because web servers don't care if they delivered the entire web page. So you probably do need the ability to go back one ID number.

    Incidentally, on a timed out event poll, SL does not send the documented 502 status; it just closes the connection. (The Other Simulator does send a 502). Timeout occurs around 58 seconds on both SL and the Other Simulator.

    Since asset sending over UDP is finally dead and removed, UDP traffic isn't that heavy. It's probably appropriate to go more for reliability and worry less about overload.

  16. 19 minutes ago, Monty Linden said:

    And as always, never mistake anecdotes for a contract.  You might see the same pattern ten times in a row but that doesn't necessarily mean there's a temporal constraint.

    Exactly. Situations like that probably underlie some fraction of login, teleport, and region crossing failures. When you see those viewer-side, they take the form of waiting for something that didn't show up. "Why" is usually a mystery, although with suitable logging, you can see where things went off the "happy path" and got stuck.

    I'm glad this is getting attention.

    20 minutes ago, Monty Linden said:

    Once you send the UCC into the neighbor, there will be a RegionHandshake in your future.  You may be able to delay the RHR (I mean, it's async anyway so pretend you're on Mars).  What I'd meant before was that after sending UCC (and probably better after RH receipt), send AgentUpdates into the Host region. 

    Ah. AgentUpdate messages are already being regularly sent to the host region simulator at this point, since that region is already live and the agent is present there. No problem there.

    What starts object updates from the neighbor region? Sending it an AgentUpdate? If so, that's good, because that can be put off until  the region handshake and capability fetch exchanges are out of the way. (The chicken-and-egg problem with the first AgentUpdate seen at login and teleport doesn't apply here; we know where the camera is when connecting to a neighbor region.)

    Quote

    You may not need quite as much deferred logic as hinted at.

    Might, might not. What else server side sends an asset UUID to the viewer and wants an asset fetch started before we know the seed capability? Land textures have UUIDs, for example. And they seem to come in before object updates start. I probably should bite the bullet here and be prepared for arbitrary delays in getting seed capabilities and the caps obtained from them. I've seen Firestorm logs with errors in capability fetches, so this isn't a theoretical risk.

    All of this is much easier if it goes 1) connect up new region and get all info needed for communication, 2) tell it to start sending content. When those overlap, as they do at login and teleport, it's harder.

    1 hour ago, Monty Linden said:

    And yet the viewer has the ability to tear down this entity via arguments to 'EventQueueGet' requests.

    Those are used for something? For a while, I was sending the string "TEST" instead of the proper LLSD message with ID number, and it still worked. Sequentially numbered events showed up. (SL has consistent event ascending numbers from 1; the Other Simulator will sometimes skip a number, and starts from some arbitrary large integer.)

    The event queue has ordered reliable delivery, which is useful where order matters.

    • Thanks 1
  17. Back to connecting to a region.

    OK, so I have to do more before I get to find out the seed capability. The headache is that if the viewer sends an AgentUpdate and a RegionHandshakeReply to the new  region, it should start getting ObjectUpdate messages. But it can't start processing those fully until it has a seed capability and the neighbor region's capabilities, because it can't start the asset fetches without those. So those have to be deferred until the capabilities are fetched. I wasn't expecting that constraint, and it's a pain, involving a defer queue.

    Can I avoid getting object updates from the new region before I've received the seed capability? If I have to handle the hard case above, so be it. But do I have to?

    The way the Other Simulator does this is much cleaner. You do all the communication with the current region before you start talking to the neighbor region. There, you get EstablishAgentCommunication from the current region before doing anything with the neighbor region.

    (Why does EstablishAgentCommunication even exist, anyway? All it contains is the IP address and port, which was already delivered with EnableSimulator, and the seed capability. Just putting the seed capability in EnableSimulator would have been much simpler. Sending them at different points in the sequence hints that it sent to prevent the viewer from doing something too soon. If so, what? Starting the EventQueueGet poller too early?)

    • Like 1
  18. 25 minutes ago, Zalificent Corvinus said:

    thus making say wet meat on a market stall look like an red-anodised aluminium sculpture of meat

    meat1.thumb.jpg.dc9f8843a1e18ce1524754bd8d5b705a.jpg

    Wet meat in market stall, Port Babbage. Sharpview.

    fsmeat1.thumb.jpg.0377e2b7667111b088d56b4f88013341.jpg

    Wet meat in market stall, Port Babbage. Firestorm.

    • Like 2
    • Thanks 1
  19. 3 hours ago, Zalificent Corvinus said:

    Texture Thrashing" ... JPEG2000

    The main problem with JPEG 2000 is that the decode process is really slow. (JPEG decoders are either slow and buggy, or expensive to license.) And the way to read part of the file and get a lower rez version is badly designed. But that has nothing to do with VRAM utilization or texture thrashing. Those are separate problems.

    LL started to address texture loading in "Project Interesting" some years back, and there's an abandoned project viewer where you can see the beginnings of that.

    Here's my Sharpview tech demo of that problem, solved. It's a speed run through Port Babbage. There isn't enough VRAM to put the whole region in the GPU. Textures are being frantically loaded and unloaded by background threads. If you look closely, you'll see that, especially where the camera approaches and enters "Savory Street Market". Note the sign being blurry, then snapping into focus. Some of the vendors in the stores are blurry at first, then load at higher resolution. It's not perfect, but it's pretty good. (If that transition was a 3/4 second dissolve, like it is in GTA V, you probably wouldn't notice at all. Blink and you'll miss it.)

    There's no reason the mainstream SL viewers should be stalling for 30 seconds to get some textures upgraded to the needed resolution. That's just a bug.

    Anyway, no, it's not JPEG 2000. And it's unrelated to ray tracing.

    The rendering feature I think SL needs is subsurface scattering for skin. This is the effect where light hits skin, bounces around inside, and a bit of red light comes back out.

    14fig10.jpg

    Light goes into skin, comes out somewhere else with a color change. And we humans react "It's alive."

    Humans are hard-wired to detect this as "alive". Without that, you're stuck where we are now, in the uncanny valley, with a range of choices between "dead" and "plastic". The theory of this is complicated, but it becomes one shader and one more material layer for skins. This layer should have been in the glTF material standard. (It's an extension.) It's in Blender and Unreal Engine.

    There's realism, and there's "ooh, shiny thing". Both have their uses. People love shiny thing features. Go visit "Materials 1" on the beta grid. So much shiny. Very soon, even shinier, with mirrors! If you saw the '70s disco in test at Rumpus Room 1, you saw how far this can go. Ray tracing is really good for shiny things. But you get a lot more bang for the GPU buck with good normal maps.

    Enough rendering for one day.

    • Like 2
    • Thanks 1
    • Haha 1
  20. I'm impressed with how well Linden Lab has been able to make the society work.

    Everybody else, from Meta Horizon to Roblox, has an army of low-wage moderators armed with ban hammers. The Second Life property rights system gives users enough authority to deal with minor problems. There are few ways to annoy people at long range. So SL Governance is a small operation that handles only major problems at a slow pace.

    This is a remarkable and under-rated achievement.

     

    • Like 11
    • Thanks 2
  21. I want to set up some more NPCs. I have a good scripting system for that, but need more models.

    I'm looking for generic background characters useful for making a place look busy. Like this.

    becca.thumb.png.b02c286c0065ce539738a5051858a280.png

    Generic child NPC. Suitable as a background character for fantasy roleplay sims, etc. 22 LI.

    • Generic human background street characters.
    • At least as good a model as the above.
    • Less than 30 LI dressed.
    • Mod or full perm. No scripts needed; I'll do that.
    • Usable LODs, so they don't disappear at distance.
    • Fully rigged and ready to animate. Prefer Bento.
    • Not ripped off from some Disney/Marvel franchise, Game of Thrones, etc. (This excludes a sizable fraction of what's on Marketplace.)
    • Thanks 1
×
×
  • Create New...