Note that there are some explanatory texts on larger screens.

plurals
  1. POJava store crawleds page to mysql in a unified encoding
    primarykey
    data
    text
    <p>I am crawling webpages to MySQL database using Java.</p> <p>These webpages are in various encoding(e.g. GBK, UTF8 ...) and may contain none ASCII characters, however, I managed to detect each page's encoding and get the readable string(readable string means it displays the same in <code>Eclipse console</code> as in <code>Web Browser</code>). </p> <p>I get webpage encoding, defaults to <code>UTF-8</code> if not found, from <code>&lt;meta&gt;</code> tag. See the following snippet:</p> <pre><code>InputStream is = hconn.getInputStream(); ByteArrayOutputStream baos = new ByteArrayOutputStream(); int b = -1; while (-1 != (b = is.read())) { baos.write((byte) b); } String charset = "UTF-8"; Document doc = Jsoup.parse(baos.toString()); Elements metas = doc.select("meta[http-equiv=Content-Type]"); Pattern p = Pattern.compile("charset=([0-9a-zA-Z_\\-]+)"); Matcher m; for (Element meta : metas) { m = p.matcher(meta.toString()); if (m.find()) charset = m.group(1); } String str = new String(baos.toByteArray(), charset); </code></pre> <p>Then, I store it to MySQL. The MySQL connection url is <code>jdbc:mysql://localhost:3306/db?characterEncoding=gbk</code>, and the column to store text to is of <code>GBK</code> encoding. </p> <p>Things happened that strings well displayed in <code>Eclipse console</code> turned out to be none recognizable sequence in MySQL and sometimes may raise SQLException. Observationally, none <code>GBK</code> strings will go wrong.</p> <p>I think converting <code>Non-GBK</code> strings to <code>GBK</code> will work, but how to? And are there any work around approaches? My final goal is construct an inverted index.</p> <p>Answers to encoding converting is preferred.</p> <p>Any help will be grateful. Thanks in advance.</p> <p><hr> <strong>Add:</strong></p> <p>Create table SQL:</p> <pre><code>CREATE TABLE `indexer`.`pages` ( `content` TEXT CHARACTER SET gbk COLLATE gbk_chinese_ci, `url` VARCHAR(512) NOT NULL, `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (`id`) ) ENGINE = InnoDB; </code></pre> <p>Error Message:</p> <p><code>You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'é”??μ¢Wé”??μ?é”??μ—é”??–¤??·DPIyé”????é”??–¤??·é”????0")Sé”????&lt;é”????cé”??–¤??' at line 1</code></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload