Jump to content

SOLVED: Convert HTML from llHTTPRequest into plain text? (HTML and XML coded characters converted to Unicode)


AndreRush
 Share

You are about to reply to a thread that has been inactive for 958 days.

Please take a moment to consider if this thread is worth bumping.

Recommended Posts

Edit: Solved below! 

Hey ya'll, first time poster, short time lurker! 

I'm currently using llHTTPRequest to fetch profile information from https://world.secondlife.com, eg https://world.secondlife.com/resident/3b1b8cdf-6768-44e7-88a7-669f88d8a864

However, what is being returned is using HTML for punctuation. 0

Eg 

I'm new, god help me

is returned as

I'm new, god help me

I understand that it's just fetching the raw HTML. But is there a way to either.... 

  • Convert that into normal text?
  • Tell it to fetch the normal text in the first place? 

Thanks in advance! :D

Edited by AndreRush
Updated with solution
Link to comment
Share on other sites

17 minutes ago, Ron Khondji said:

You need to llUnescapeURL() the returned text I think. 

That doesn't work because the response is using HTML coded characters, not URL escaped characters.

Example:

Unicode: I'm new, god help me

HTML coded: I'm new, god help me

Escaped: I%26%2339%3Bm%20new%2C%20god%20help%20me

 

Your best bet might be to find another API to request info from a user's profile, or you could either try to find a function which can replace coded characters with their unicode equivalents, or write one yourself.

Edited by Jenna Huntsman
  • Thanks 1
Link to comment
Share on other sites

I went through llUnescapeURL() earlier hoping/praying, but no dice. 
Are you aware of any other ways to access profile text? From what I saw, there's no direct access in world, but feel free to tell me there is llGetProfileTextAlsoAndreHowDidYouMissThat haha

Link to comment
Share on other sites

there is a nice replace function ... Combined_Library#Replace

here is a small example ..

string strReplace(string str, string search, string replace) {
    return llDumpList2String(llParseStringKeepNulls((str = "") + str, [search], []), replace);
}
default
{
    state_entry()
    {       
    }
    touch_start(integer total_number)
    {  string old =  "I'm new, god help me";     
       string redo =  strReplace(old, "'", "'");
       llOwnerSay(" fixed: " + redo);
    }
}

 

  • Like 1
Link to comment
Share on other sites

7 hours ago, KT Kingsley said:

Would I be right in thinking that llChar returns the character specified in the numerical part of the HTML code? If so this could be used in a relatively simple conversion routine that doesn't require its own conversion table.

You are right!

Here's a function which replaces the HTML coded characters with unicode characters:

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    return s_rawHTML; //pass the new string back
}

 

Edited by Jenna Huntsman
Added Quistess' suggestion
  • Like 2
  • Thanks 1
Link to comment
Share on other sites

3 hours ago, Jenna Huntsman said:

Here's a function which replaces the HTML coded characters with unicode characters:

Nice! the only thing I'm confused about is why you duplicate the string at the beginning. LSL allows you to modify arguments to a function directly, which saves memory.

I.E.

// changed variable name : (or search-replace in the main body of the funtion s_Unicode->s_rawHtml)
string HTMLreplaceCodedChars(string s_Unicode)
{	// remove this line:
    //string s_Unicode = s_rawHTML; //New string to store the unicode version in.
  	//everything else the same
    while(llSubStringIndex(s_Unicode,"&#") != -1) //Loop until no more encoded characters exist in the string.
    { // . . .
  • Like 1
  • Thanks 1
Link to comment
Share on other sites

1 hour ago, Quistess Alpha said:

Nice! the only thing I'm confused about is why you duplicate the string at the beginning. LSL allows you to modify arguments to a function directly, which saves memory.

I.E.

// changed variable name : (or search-replace in the main body of the funtion s_Unicode->s_rawHtml)
string HTMLreplaceCodedChars(string s_Unicode)
{	// remove this line:
    //string s_Unicode = s_rawHTML; //New string to store the unicode version in.
  	//everything else the same
    while(llSubStringIndex(s_Unicode,"&#") != -1) //Loop until no more encoded characters exist in the string.
    { // . . .

Oops!

I originally put that in there so I could return an error code should something go wrong, but changed my approach while I was writing it which made an error code redundant, then forgot about that value.

  • Like 2
Link to comment
Share on other sites

Thanks so much gang :D

I added this in this afternoon, and it caught most of the problematic characters. Thanks especially Jenna Huntsman for handling the initial heavy lifting!

The only thing I added was handling for XML characters mentioned here: https://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_references
It seems like all the characters that were being encoded are ones that might have an impact on HTML/XML.
Other than that, 𝓾𝓷𝓬𝓸𝓶𝓶𝓸𝓷 𝓬𝓱𝓪𝓻𝓪𝓬𝓽𝓮𝓻𝓼 𝓲𝓷 𝓹𝓻𝓸𝓯𝓲𝓵𝓮𝓼 𝓬𝓸𝓶𝓮 𝓽𝓱𝓻𝓸𝓾𝓰𝓱 𝓳𝓾𝓼𝓽 𝓯𝓲𝓷𝓮 ♥●•٠·˙

There's probably a better way to structure them, but here's my edits anyhow :)

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML/XML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    //Below handles various XML characters that are encoded differently. Since we know exactly what character it needs to be,
    //we can just replace it without needing to look it up. 
    while (llSubStringIndex(s_rawHTML,""") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,""");
        string i_charID = "\"";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&");
        string i_charID = "&";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,">") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,">");
        string i_charID = ">";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"<") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"<");
        string i_charID = "<";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&apos;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&apos;");
        string i_charID = "'";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    return s_rawHTML; //pass the new string back
}

 

Edited by AndreRush
Link to comment
Share on other sites

  • AndreRush changed the title to SOLVED: Convert HTML from llHTTPRequest into plain text? (HTML and XML coded characters converted to Unicode)

@AndreRush

some code that may be useful to someone...  just need to add to the replacer list to cover all characters :)

( no error handling for multiple instances etc )

// https://www.rapidtables.com/web/html/html-codes.html

string replacer(string raw)
{  
    integer begin = llSubStringIndex(raw,"&");
    integer end   = llSubStringIndex(raw,";");
    string  chars = llGetSubString(raw, begin, end);
    string  new;
   
    integer index = llListFindList(xml, [chars]);
    if (index != -1)
    { new = llList2String( xml, index + 1);         
    }
    else
    { llOwnerSay("newp: "); 
    }
    return llDumpList2String(llParseStringKeepNulls(  raw , [chars], []), new);  
}

list xml = [

    // xml codes
    "&amp;","&",
    "&lt;","<",
    "&gt;",">",
    "&quot;","\"",
    "&apos;","'" ,
      
    // html codes
    "&#38;", "&",
    "&#39;", "'" 
];

default
{
    state_entry()
    {
    }
    touch_start(integer total_number)
    {  
        string old =  "I&#39;m new, god help me";  
        string old2 = "hello &amp; good morning";
      
        string returned = replacer( old );
        llOwnerSay( "Data: \n\n" + returned);
      
        string returned2 = replacer( old2 );
        llOwnerSay( "Data: \n\n" + returned2 );
    }
}

 

Edited by Xiija
Link to comment
Share on other sites

9 hours ago, AndreRush said:

Thanks so much gang :D

I added this in this afternoon, and it caught most of the problematic characters. Thanks especially Jenna Huntsman for handling the initial heavy lifting!

The only thing I added was handling for XML characters mentioned here: https://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_references
It seems like all the characters that were being encoded are ones that might have an impact on HTML/XML.
Other than that, 𝓾𝓷𝓬𝓸𝓶𝓶𝓸𝓷 𝓬𝓱𝓪𝓻𝓪𝓬𝓽𝓮𝓻𝓼 𝓲𝓷 𝓹𝓻𝓸𝓯𝓲𝓵𝓮𝓼 𝓬𝓸𝓶𝓮 𝓽𝓱𝓻𝓸𝓾𝓰𝓱 𝓳𝓾𝓼𝓽 𝓯𝓲𝓷𝓮 ♥●•٠·˙

There's probably a better way to structure them, but here's my edits anyhow :)

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML/XML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        if(llGetSubString(s_rawHTML,i_hit+5,i_hit+5) == ";") //If it's a 3 value coded character
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
        else //Nope, it's a 2 value character.
        {
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), llChar(i_charID)); //Replace coded character with unicode equiv.
        }
    }
    //Below handles various XML characters that are encoded differently. Since we know exactly what character it needs to be,
    //we can just replace it without needing to look it up. 
    while (llSubStringIndex(s_rawHTML,"&quot;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&quot;");
        string i_charID = "\"";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&amp;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&amp;");
        string i_charID = "&";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+4)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&gt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&gt;");
        string i_charID = ">";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&lt;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&lt;");
        string i_charID = "<";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+3)], []), i_charID);
    }
    while (llSubStringIndex(s_rawHTML,"&apos;") != -1)
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&apos;");
        string i_charID = "'";
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+5)], []), i_charID);
    }
    return s_rawHTML; //pass the new string back
}

 

You can actually shorten that up by quite a bit, like this:

string HTMLreplaceCodedChars(string s_rawHTML)
{
    //Replace coded HTML characters with Unicode equivalents. Credit: Jenna Huntsman, KT Kingsley, Quistess Alpha, Haravikk Mistral, AndreRush
    //Uses code from http://wiki.secondlife.com/wiki/Combined_Library#Replace
    while(llSubStringIndex(s_rawHTML,"&#") != -1) //Loop until no more encoded characters exist in the string.
    {
        integer i_hit = llSubStringIndex(s_rawHTML,"&#"); //Find coded character start.
        integer i_charID = (integer)llGetSubString(s_rawHTML,i_hit+2,i_hit+5); //Find the character number to convert to unicode character.
        s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,(i_hit + 2 + llStringLength((string)i_charID)))], []), llChar(i_charID)); //Replace coded character with unicode equiv.
    }

    //Replace XML characters with Unicode equivalents.
    list l_XMLchars = ["amp","&","lt","<","gt",">","quot","\"","apos","'"];
    integer i;
    for(; i < llGetListLength(l_XMLchars); i = i+2)
    { //iterate through the list to find each of the characters present in the raw string.
        string s_CurChar = "&" + llList2String(l_XMLchars,i) + ";"; //create the full character to look for.
        while(llSubStringIndex(s_rawHTML,s_CurChar) != -1)
        { //Loop until all instances of the current character have been replaced.
            integer i_hit = llSubStringIndex(s_rawHTML,s_CurChar);
            s_rawHTML = llDumpList2String(llParseStringKeepNulls(s_rawHTML, [llGetSubString(s_rawHTML,i_hit,i_hit+(llStringLength(s_CurChar)-1))], []), llList2String(l_XMLchars, i+1));
        }
    }
    return s_rawHTML; //pass the new string back
}

 

Edited by Jenna Huntsman
More opts!
  • Like 2
Link to comment
Share on other sites

You are about to reply to a thread that has been inactive for 958 days.

Please take a moment to consider if this thread is worth bumping.

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
 Share

×
×
  • Create New...