Jump to content

Character set for "Notation" LLSD.


You are about to reply to a thread that has been inactive for 285 days.

Please take a moment to consider if this thread is worth bumping.

Recommended Posts

An obscure question which becomes more relevant as glTF and puppetry, both of which use LLSD "Notation" format, move forward: What's the string representation inside LLSD?

LLSD, SL's internal serialization format, comes in three flavors - "Binary", which is well defined but not human readable, "XML", which is reasonable but bulky XML, and something called "Notation",which is something like JSON. The new Materials system sends glTF encapsulated inside Notation encapsulated inside XML LLSD to the viewer. (Yes, really).

The spec for LLSD is here: https://wiki.secondlife.com/wiki/LLSD

The LL implementation in C++ is here: https://github.com/secondlife/viewer/blob/ec4135da63a3f3877222fba4ecb59b15650371fe/indra/llcommon/llsdserialize.cpp#L789

The trouble spot involves storing "binary" data inside Notation LLSD in the format described as

'b(' str(size) ')"' raw_data '"' | 'b' base '"' encoded_data '"'

What that says is read the number inside b(NNN), and then read that many bytes, or chars, or something, then interpret them as binary bytes. There are other binary formats, using hex or Base64 encoding, but this third one is supported. It has problems.

There's an example in the wiki:

b(158)"default
{
    state_entry()
    {
        llSay(0, "Hello, Avatar!");
    }

    touch_start(integer total_number)
    {
        llSay(0, "Touched.");
    }
}",

Now this is kind of weird, because it is string data in a string format represented as binary. So the encoding (ASCII, UTF-8, or UTF-16) matters. The C++ code just reads 158 bytes and treats that as a string. For ASCII, this works fine. It's not clear what will happen for non-ASCII characters. The character count and the byte count will not agree when multi-byte characters are present. Parts of SL code use UTF-16, which was a thing 20 years ago.

I suspect that something will break if you try to put an emoji in a script.

There's are some relevant notes in the wiki:

 

Quote

 

Questions & Things To Do

Would Binary be more convenient as usigned char* buffer semantics?

Should Binary be convertable to/from String, and if so how?

  • as UTF8 encoded strings (making not like UUID<->String)
  • as Base64 or Base96 encoded (making like UUID<->String)

Conversions to std::string and LLUUID do not result in easy assignment to std::string, LLString or LLUUID due to non-unique conversion paths.

 

and a deleted comment by Oz Linden removed in 2015:

Quote

 

== Notation versus JSON ==
 
 
 
 
The plan is to eventually move to JSON, after its spec in ECMAScript 5th edition is finalized.  There are three reasons this is not a current priority:
 
 
 
# The binary and XML serialization formats work for our use cases, so there's no driving need for JSON.
 
# The notation format has not been useful in our experience and we expect JSON to fill a similar use niche.
 
# JSON was not in wide, common use when LLSD was invented.

 

 

 
 
 

So this is clearly a known headache.

I ran into this implementing Sharpview, where I have to be bug-compatible with the Linden code. Because this is an area where new LL code is being written, it's worth clearing up this ambiguity.

   

LLSD has several good ways of representing pure binary data - binary LLSD, which is efficient, and base 64 in XML and Notation LLSD, which is unambiguous. It also has good ways of representing strings. Storing strings as binary data in counted LLSD notation format, though, is something likely to break for some characters. I'd suggest deprecating that, and if it can't go away (it may be used for uploading scripts) document the encoding. Thanks.

 

 

     
     
     
   

 

Edited by animats
  • Thanks 1
Link to comment
Share on other sites

I'm not very surprised at all.

For example, LSL is not so great at being consistent with allowable characters / when to require escaping. 

Just yesterday, I discovered that any quoted "[" or "]" in a JSON_ARRAY breaks the ability of the LSL JSON functions to parse the JSON.  Not "{" or "}", just "[" and "]". The fix is to escape the string if it contains a "problem" character.

It makes me wonder if some of these issues are due to bad coding in the LL code itself, or in the "hopefully standard" libraries they used.

While not emojis, I used Extended UTF characters in scripts years ago as "data markers" without issue (different range than emojis) but haven't done that since.

Link to comment
Share on other sites

Filed JIRA.

I'm writing code in Rust, which makes a strong distinction between strings, which can represent all of Unicode and must be valid UTF-8, and arrays of bytes, which have no structure but print as an array of numbers. Notation LLSD mixes the two concepts, and Rust code won't even compile with an error like that.

  • Thanks 1
Link to comment
Share on other sites

15 hours ago, animats said:

and a deleted comment by Oz Linden removed in 2015:

Quote

 

== Notation versus JSON ==
 
 
 
 
The plan is to eventually move to JSON, after its spec in ECMAScript 5th edition is finalized.  There are three reasons this is not a current priority:
 
 
 
# The binary and XML serialization formats work for our use cases, so there's no driving need for JSON.
 
# The notation format has not been useful in our experience and we expect JSON to fill a similar use niche.
 
# JSON was not in wide, common use when LLSD was invented.

 

That's too bad, they COULD use JSON with suitable encoding, at the cost of size.

Link to comment
Share on other sites

27 minutes ago, Love Zhaoying said:

That's too bad, they COULD use JSON with suitable encoding, at the cost of size.

LL is using JSON-encoded glTF, for describing materials, and soon, meshes.

There's an incredible amount of encapsulation of one format in another format. SL meshes have XML LLSD, binary LLSD, and their very own mesh format. Plus, some inner parts are compressed with ZIP compression. This all works out OK and doesn't take all that much code. But once "Notation LLSD" gets involved, things tend to go downhill. Notation LLSD is one of those things which is supposed to be simple, not too formal, and human friendly, and turns out to be none of those.

Link to comment
Share on other sites

You are about to reply to a thread that has been inactive for 285 days.

Please take a moment to consider if this thread is worth bumping.

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...