Note that there are some explanatory texts on larger screens.

plurals
  1. POHow do I convert web content to a consistant character set when crawling the web?
    primarykey
    data
    text
    <p>I've done a lot of research on this and a lot of testing.</p> <p>As I understand it, HTTP headers are only set if the web server is setup to do so, and may default to a particular encoding even if developers didn't intend this. Meta headers are only set if the developer decided to do so in their code... this may also be set automatically by some development frameworks ( which is problematic if the developer didn't consider this ).</p> <p>I've found that if these are set at all, they often conflict with each other. eg. the HTTP header says the page is <code>iso-8859-1</code> while the meta tag specifies <code>windows-1252</code>. I could assume one supersedes the other ( likely the meta tag ), but that seems fairly unreliable. It also seems like very few developers consider this when dealing with their data, so dynamically generated sites are often mixing encodings or using encodings that they don't intend to via different encodings coming from their database.</p> <p><strong>My conclusion has been to do the following:</strong></p> <ol> <li>Check the encoding of every page using <code>mb_detect_encoding()</code>.</li> <li>If that fails, I use the meta encoding ( <code>http-equiv="Content-Type"...</code> ).</li> <li>If there is no meta content-type, I use the HTTP headers ( <code>content_type</code> ).</li> <li>If there is no http content-type, I assume UTF-8.</li> <li>Finally, I convert the document using mb_convert_encoding(). Then I scrape it for content. ( I've purposely left out the encoding to convert to, to avoid that discussion here. )</li> </ol> <p>I'm attempting to get as much accurate content as possible, and not just ignore webpages because the developers didn't set their headers properly.</p> <p><strong>What problems do you see with this approach?</strong></p> <p><strong>Am I going to run into problems using the mb_detect_encoding() and mb_convert_encoding() methods?</strong></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload