Note that there are some explanatory texts on larger screens.

plurals
  1. POutf-8 convertion doesn't work always
    primarykey
    data
    text
    <p>I searched into other stack before to type here and I didn't find anything similar. I have to scrape different utf-8 webpages which contain text like</p> <p>"Oggi è una bellissima giornata"</p> <p>the problem is on the character "è"</p> <p>I extract this text with jtidy and xpath query expression and I convert it with </p> <pre><code>byte[] content = filteredEncodedString.getBytes("utf-8"); String result = new String(content,"utf-8"); </code></pre> <p>where filteredEncodedString contains the text "Oggi è una bellissima giornata". This procedures works on the most webpages analyzed so far but in some case it doesn't extract a utf-8 string. Page encoding is always the same as the text is similar.</p> <p><strong>Edit on September, 14th</strong></p> <p>I modified my code as follow to get pages in utf-8 encoding:</p> <pre><code>URL url = new URL(currentUrl); URLConnection conn = url.openConnection(); conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"); BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), getEncode())); String line=""; String domString=""; while((line = in.readLine()) != null) { domString+=line.toString(); } byte[] bytes = domString.getBytes("UTF-8"); in.close(); return bytes; //return text.getBytes(); </code></pre> <p>where <strong>getEncode()</strong> returns page encoding, utf-8 in this case. But I still noticed that ì or é are not read correctly. there is something wrong with this code? thanks again!</p> <p><strong>Edited on October, 2nd</strong></p> <p>This code seems to work. The problem was into a Dom Document creation I didn't posted (sorry about this!) with bytes returned from method above.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload