Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.</p> <p>I don't know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.</p> <p>This may be the only way to do it, actually. In general, GB2312-encoded text won't be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:</p> <blockquote> <p>In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object.</p> </blockquote> <p>This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has <code>contains_replacement_characters == True</code>), then there is no way to get the original data back from document once it's been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload