SOLVED: Convert HTML from llHTTPRequest into plain text? (HTML and XML coded characters converted to Unicode)

AndreRush · February 12, 2022

Edit: Solved below!

Hey ya'll, first time poster, short time lurker!

I'm currently using llHTTPRequest to fetch profile information from https://world.secondlife.com, eg https://world.secondlife.com/resident/3b1b8cdf-6768-44e7-88a7-669f88d8a864

However, what is being returned is using HTML for punctuation. 0

Eg

I'm new, god help me

is returned as

I'm new, god help me

I understand that it's just fetching the raw HTML. But is there a way to either....

Convert that into normal text?
Tell it to fetch the normal text in the first place?

Thanks in advance!

Edited February 13, 2022 by AndreRush
Updated with solution

Ron Khondji · February 12, 2022

You need to llUnescapeURL() the returned text I think.

Jenna Huntsman · February 12, 2022

17 minutes ago, Ron Khondji said:

You need to llUnescapeURL() the returned text I think.

That doesn't work because the response is using HTML coded characters, not URL escaped characters.

Example:

Unicode: I'm new, god help me

HTML coded: I'm new, god help me

Escaped: I%26%2339%3Bm%20new%2C%20god%20help%20me

Your best bet might be to find another API to request info from a user's profile, or you could either try to find a function which can replace coded characters with their unicode equivalents, or write one yourself.

Edited February 12, 2022 by Jenna Huntsman

AndreRush · February 12, 2022

I went through llUnescapeURL() earlier hoping/praying, but no dice.
Are you aware of any other ways to access profile text? From what I saw, there's no direct access in world, but feel free to tell me there is llGetProfileTextAlsoAndreHowDidYouMissThat haha

Xiija · February 12, 2022

there is a nice replace function ... Combined_Library#Replace

here is a small example ..

string strReplace(string str, string search, string replace) {
    return llDumpList2String(llParseStringKeepNulls((str = "") + str, [search], []), replace);
}
default
{
    state_entry()
    {       
    }
    touch_start(integer total_number)
    {  string old =  "I&#39;m new, god help me";     
       string redo =  strReplace(old, "&#39;", "'");
       llOwnerSay(" fixed: " + redo);
    }
}

KT Kingsley · February 12, 2022

Would I be right in thinking that llChar returns the character specified in the numerical part of the HTML code? If so this could be used in a relatively simple conversion routine that doesn't require its own conversion table.

Edited February 12, 2022 by KT Kingsley

Jenna Huntsman · February 12, 2022

7 hours ago, KT Kingsley said:

Would I be right in thinking that llChar returns the character specified in the numerical part of the HTML code? If so this could be used in a relatively simple conversion routine that doesn't require its own conversion table.

You are right!

Here's a function which replaces the HTML coded characters with unicode characters:

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    return s_rawHTML; //pass the new string back
}

Edited February 12, 2022 by Jenna Huntsman
Added Quistess' suggestion

Quistess Alpha · February 12, 2022

3 hours ago, Jenna Huntsman said:

Here's a function which replaces the HTML coded characters with unicode characters:

Nice! the only thing I'm confused about is why you duplicate the string at the beginning. LSL allows you to modify arguments to a function directly, which saves memory.

I.E.

// changed variable name : (or search-replace in the main body of the funtion s_Unicode->s_rawHtml)
string HTMLreplaceCodedChars(string s_Unicode)
{	// remove this line:
    //string s_Unicode = s_rawHTML; //New string to store the unicode version in.
  	//everything else the same
    while(llSubStringIndex(s_Unicode,"&#") != -1) //Loop until no more encoded characters exist in the string.
    { // . . .

Jenna Huntsman · February 12, 2022

1 hour ago, Quistess Alpha said:
Nice! the only thing I'm confused about is why you duplicate the string at the beginning. LSL allows you to modify arguments to a function directly, which saves memory.

I.E.
// changed variable name : (or search-replace in the main body of the funtion s_Unicode->s_rawHtml)
string HTMLreplaceCodedChars(string s_Unicode)
{	// remove this line:
    //string s_Unicode = s_rawHTML; //New string to store the unicode version in.
  	//everything else the same
    while(llSubStringIndex(s_Unicode,"&#") != -1) //Loop until no more encoded characters exist in the string.
    { // . . .

Oops!

I originally put that in there so I could return an error code should something go wrong, but changed my approach while I was writing it which made an error code redundant, then forgot about that value.

Lucia Nightfire · February 13, 2022

There are also 2000+ named entities with unicode values.

LSL simply needs encode() and decode() equivalent support with HTML version specification.

AndreRush · February 13, 2022

Thanks so much gang

I added this in this afternoon, and it caught most of the problematic characters. Thanks especially Jenna Huntsman for handling the initial heavy lifting!

The only thing I added was handling for XML characters mentioned here: https://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_references
It seems like all the characters that were being encoded are ones that might have an impact on HTML/XML.
Other than that, 𝓾𝓷𝓬𝓸𝓶𝓶𝓸𝓷 𝓬𝓱𝓪𝓻𝓪𝓬𝓽𝓮𝓻𝓼 𝓲𝓷 𝓹𝓻𝓸𝓯𝓲𝓵𝓮𝓼 𝓬𝓸𝓶𝓮 𝓽𝓱𝓻𝓸𝓾𝓰𝓱 𝓳𝓾𝓼𝓽 𝓯𝓲𝓷𝓮 ♥●•٠·˙

There's probably a better way to structure them, but here's my edits anyhow

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML/XML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    //Below handles various XML characters that are encoded differently. Since we know exactly what character it needs to be,
    //we can just replace it without needing to look it up. 
    while (llSubStringIndex(s_rawHTML,"&quot;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&quot;");
        string i_charID = "\"";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&amp;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&amp;");
        string i_charID = "&";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&gt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&gt;");
        string i_charID = ">";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&lt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&lt;");
        string i_charID = "<";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&apos;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&apos;");
        string i_charID = "'";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    return s_rawHTML; //pass the new string back
}

Edited February 13, 2022 by AndreRush

Xiija · February 13, 2022

@AndreRush

some code that may be useful to someone... just need to add to the replacer list to cover all characters

( no error handling for multiple instances etc )

// https://www.rapidtables.com/web/html/html-codes.html

string replacer(string raw)
{  
    integer begin = llSubStringIndex(raw,"&");
    integer end   = llSubStringIndex(raw,";");
    string  chars = llGetSubString(raw, begin, end);
    string  new;
   
    integer index = llListFindList(xml, [chars]);
    if (index != -1)
    { new = llList2String( xml, index + 1);         
    }
    else
    { llOwnerSay("newp: "); 
    }
    return llDumpList2String(llParseStringKeepNulls(  raw , [chars], []), new);  
}

list xml = [

    // xml codes
    "&amp;","&",
    "&lt;","<",
    "&gt;",">",
    "&quot;","\"",
    "&apos;","'" ,
      
    // html codes
    "&#38;", "&",
    "&#39;", "'" 
];

default
{
    state_entry()
    {
    }
    touch_start(integer total_number)
    {  
        string old =  "I&#39;m new, god help me";  
        string old2 = "hello &amp; good morning";
      
        string returned = replacer( old );
        llOwnerSay( "Data: \n\n" + returned);
      
        string returned2 = replacer( old2 );
        llOwnerSay( "Data: \n\n" + returned2 );
    }
}

Edited February 13, 2022 by Xiija

Jenna Huntsman · February 13, 2022

9 hours ago, AndreRush said:

Thanks so much gang

I added this in this afternoon, and it caught most of the problematic characters. Thanks especially Jenna Huntsman for handling the initial heavy lifting!

The only thing I added was handling for XML characters mentioned here: https://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_references
It seems like all the characters that were being encoded are ones that might have an impact on HTML/XML.
Other than that, 𝓾𝓷𝓬𝓸𝓶𝓶𝓸𝓷 𝓬𝓱𝓪𝓻𝓪𝓬𝓽𝓮𝓻𝓼 𝓲𝓷 𝓹𝓻𝓸𝓯𝓲𝓵𝓮𝓼 𝓬𝓸𝓶𝓮 𝓽𝓱𝓻𝓸𝓾𝓰𝓱 𝓳𝓾𝓼𝓽 𝓯𝓲𝓷𝓮 ♥●•٠·˙

There's probably a better way to structure them, but here's my edits anyhow

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML/XML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    //Below handles various XML characters that are encoded differently. Since we know exactly what character it needs to be,
    //we can just replace it without needing to look it up. 
    while (llSubStringIndex(s_rawHTML,"&quot;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&quot;");
        string i_charID = "\"";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&amp;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&amp;");
        string i_charID = "&";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&gt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&gt;");
        string i_charID = ">";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&lt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&lt;");
        string i_charID = "<";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&apos;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&apos;");
        string i_charID = "'";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    return s_rawHTML; //pass the new string back
}

You can actually shorten that up by quite a bit, like this:

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,(i_hit + 2 + llStringLength((string)i_charID)))], []), llChar(i_charID)); //Replace coded character with unicode equiv.
    }

    //Replace XML characters with Unicode equivalents.
    list l_XMLchars = ["amp","&","lt","<","gt",">","quot","\"","apos","'"];
    integer i;
    for(; i < llGetListLength(l_XMLchars); i = i+2)
    { //iterate through the list to find each of the characters present in the raw string.
        string s_CurChar = "&" + llList2String(l_XMLchars,i) + ";"; //create the full character to look for.
        while(llSubStringIndex(s_rawHTML,s_CurChar) != -1)
        { //Loop until all instances of the current character have been replaced.
            integer i_hit = llSubStringIndex(s_rawHTML,s_CurChar);
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+(llStringLength(s_CurChar)-1))], []), llList2String(l_XMLchars, i+1));
        }
    }
    return s_rawHTML; //pass the new string back
}

Edited February 13, 2022 by Jenna Huntsman
More opts!

SOLVED: Convert HTML from llHTTPRequest into plain text? (HTML and XML coded characters converted to Unicode)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Please sign in to comment

Linden Lab

Tilia

Second Life

Connect With Us

Partner With Us