Note that there are some explanatory texts on larger screens.

plurals
  1. POPython mailbox encoding errors
    text
    copied!<p>First, let me say that I'm a complete beginner at Python. I've never learned the language, I just thought "how hard can it be" when Google turned up nothing but Python snippets to solve my problem. :)</p> <p>I have a bunch of mailboxes in Maildir format (a backup from the mail server on my old web host), and I need to extract the emails from these. So far, the simplest way I've found has been to convert them to the mbox format, which Thunderbird supports, and it seems Python has a few classes for reading/writing both formats. Seems perfect.</p> <p>The Python docs even have this little code snippet doing exactly what I need:</p> <pre><code>src = mailbox.Maildir('maildir', factory=None) dest = mailbox.mbox('/tmp/mbox') for msg in src: #1 dest.add(msg) #2 </code></pre> <p><em>Except</em> it doesn't work. And here's where my complete lack of knowledge about Python sets in. On a few messages, I get a UnicodeDecodeError during the iteration (that is, when it's trying to read <code>msg</code> from <code>src</code>, on line <code>#1</code>). On others, I get a UnicodeEncodeError when trying to add <code>msg</code> to <code>dest</code> (line <code>#2</code>).</p> <p>Clearly it makes some wrong assumptions about the encoding used. But I have no clue how to specify an encoding on the mailbox (For that matter, I don't know what the encoding should be either, but I can probably figure that out once I find a way to actually specify an encoding). </p> <p>I get stack traces similar to the following:</p> <pre><code> File "E:\Python30\lib\mailbox.py", line 102, in itervalues value = self[key] File "E:\Python30\lib\mailbox.py", line 74, in __getitem__ return self.get_message(key) File "E:\Python30\lib\mailbox.py", line 317, in get_message msg = MaildirMessage(f) File "E:\Python30\lib\mailbox.py", line 1373, in __init__ Message.__init__(self, message) File "E:\Python30\lib\mailbox.py", line 1345, in __init__ self._become_message(email.message_from_file(message)) File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file return Parser(*args, **kws).parse(fp) File "E:\Python30\lib\email\parser.py", line 68, in parse data = fp.read(8192) File "E:\Python30\lib\io.py", line 1733, in read eof = not self._read_chunk() File "E:\Python30\lib\io.py", line 1562, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "E:\Python30\lib\io.py", line 1295, in decode output = self.decoder.decode(input, final=final) File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to &lt;undefined&gt; </code></pre> <p>And on the UnicodeEncodeErrors:</p> <pre><code> File "E:\Python30\lib\email\message.py", line 121, in __str__ return self.as_string() File "E:\Python30\lib\email\message.py", line 136, in as_string g.flatten(self, unixfrom=unixfrom) File "E:\Python30\lib\email\generator.py", line 76, in flatten self._write(msg) File "E:\Python30\lib\email\generator.py", line 108, in _write self._write_headers(msg) File "E:\Python30\lib\email\generator.py", line 141, in _write_headers header_name=h, continuation_ws='\t') File "E:\Python30\lib\email\header.py", line 189, in __init__ self.append(s, charset, errors) File "E:\Python30\lib\email\header.py", line 262, in append input_bytes = s.encode(input_charset, errors) UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16: ordinal not in range(128) </code></pre> <p>Anyone able to help me out here? (Suggestions for completely different solutions not involving Python are obviously welcome too. I just need a way to access get import the mails from these Maildir files.</p> <p><em>Updates:</em></p> <p>sys.getdefaultencoding returns 'utf-8'</p> <p>I uploaded sample messages which cause both errors. <a href="http://jalf.dk/python_problem/1187691008.H199308P14265.c1p.hostingzoom.com_2,S" rel="nofollow noreferrer">This one</a> throws UnicodeEncodeError, and <a href="http://jalf.dk/python_problem/1219438193.H364790P13554.c1p.hostingzoom.com_2,S" rel="nofollow noreferrer">this</a> throws UnicodeDecodeError</p> <p>I tried running the same script in Python2.6, and got TypeErrors instead:</p> <pre><code> File "c:\python26\lib\mailbox.py", line 529, in add self._toc[self._next_key] = self._append_message(message) File "c:\python26\lib\mailbox.py", line 665, in _append_message offsets = self._install_message(message) File "c:\python26\lib\mailbox.py", line 724, in _install_message self._dump_message(message, self._file, self._mangle_from_) File "c:\python26\lib\mailbox.py", line 220, in _dump_message raise TypeError('Invalid message type: %s' % type(message)) TypeError: Invalid message type: &lt;type 'instance'&gt; </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload