Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Firstly, a clarification: your approach doesn't convert from <code>GB2312</code> to ASCII - and nor would you want it to, since ASCII can't represent the string <code>'╠µ╔Ý╩²¥¦┤ª└Ý│╠ð‗'</code>. What <code>decode</code> returns is a sequence of abstract characters that can't be directly represented on disk - the encoding is a serialisation rule. This type is called <code>unicode</code> in Python 2 and <code>str</code> in Python 3; the type of <code>stdout</code> will be <code>str</code> in Python 2, and <code>bytes</code> in Python 3.</p> <p>Passing raw bytes into <code>json.loads</code> tries to deserialise (decode) the input into a character string using utf-8. This gives the error you see since your input is serialised using a different, incompatible, encoding. Decoding it yourself first is the right approach - and in newer versions of Python, <code>json.loads</code> requires you to do this anyway (it strictly wants a character sequence rather than a byte sequence).</p> <p>There is one caveat: guessing the encoding, the way chardet does, is <em>hard</em>, and potentially error prone. It happens to work in this particular case, but you have no guarantee that it will work if you need to do something similar with other files. It <em>may</em> be the best approach available to you - usually, you would expect to see the encoding mentioned early in the file's metadata, but it doesn't seem to be in this case. But you should always try to find some authoritative information on it before resorting to guesswork.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    1. CO@ Ivc. Thanks! This is great! I'm not sure I understand all of what you have just said but I think I have extracted some useful stuff. It seems the real problem is that the source data doesn't contain the encoding information required to parse it correctly. Hence, I am forced to guess the encoding. As it stands, it seems my solution is a necessary evil.
      singulars
    2. COI am a bit unsure about the clarification part of your response but perhaps I left out some crucial information. The abstract characters you refer to are output by the ffprobe command but once captured as stdout, they actually look like this: `Apple \xcc\xe6\xc9\xed\xca\xfd\xbe\xdd\xb4\xa6\xc0\xed\xb3\xcc\xd0\xf2`. In other words, the stdout is of type `str`. When I use chardet to detect the encoding of this string, it returns `GB2312` and when I use `str.decode("GB2312")` I get an `ascii` encoding. Can you please explain the first part of your response again?
      singulars
    3. CO@Yani check `type(stdout)` and `type(stdout.decode("GB2312"))` - they should be different. If you've read the article linked to under the question, then it should make sense that Py2 `unicode`/Py3 `str` is a sequence of unencoded unicode codepoints, while Py2 `str`/Py3 `bytes` is a sequence of raw bytes. `decode` doesn't *change* the encoding (to ASCII), it *removes* it (giving a sequence of codepoints, rather than bytes). This situation is a bit clearer in Python 3 than Python 2, because of the better type names, and other improvements. Consider upgrading if you can.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload