StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython: special characters giving me problems (from PDFminer)
text
Body
copied!<p>I used pdf2text from PDFminer to reduce a PDF to text. Unfortunately it contains special characters. Let me show output from my console</p> <pre><code>>>>a=pdf_to_text("ap.pdf") </code></pre> <p>heres a sample of it, a little truncated</p> <pre><code>>>>a[5000:5500] 'f one architect. Decades ...... but to re\xef\xac\x82ect\none set of design ideas, than to have one that contains many\ngood but independent and uncoordinated ideas.\n1 Joshua Bloch, \xe2\x80\x9cHow to Design a Good API and Why It Matters\xe2\x80\x9d, G......=-3733' </code></pre> <p>I understood that I must encode it</p> <pre><code>>>>a[5000:5500].encode('utf-8') Traceback (most recent call last): File "<interactive input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 237: ordinal not in range(128) </code></pre> <p>I searched around a bit and tried them, notably <a href="https://stackoverflow.com/questions/4705793/replace-special-characters-in-python">Replace special characters in python</a> . The input comes from PDFminer, so its tough (AFAIK) to control that. What is the way to make proper <strong>plaintext</strong> from this output?</p> <p>What am I doing wrong?</p> <p><strong>--A quick fix: change PDFminer's codec to ascii- but it's not a lasting solution--</strong></p> <p><strong>--Abandoned the quick fix for the answer- changing the codec removes information --</strong></p> <p><strong>--A relavent topic as mentioned by Maxim <a href="http://en.wikipedia.org/wiki/Windows-1251" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Windows-1251</a> --</strong></p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload