Note that there are some explanatory texts on larger screens.

plurals
  1. POPython, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
    primarykey
    data
    text
    <p>pyPdf throws this exception:</p> <p>pyPdf.utils.PdfReadError: EOF marker not found</p> <p>I don't need to fix pyPdf, I just need to get the EOF error to cause an "except" block to execute and skip over the file, but it doesn't work. It still causes the program to stop running.</p> <p>Background:</p> <p><a href="https://stackoverflow.com/questions/6026287/batch-ocr-program-for-pdfs">Batch OCR Program for PDFs</a></p> <p><a href="https://stackoverflow.com/questions/6053064/python-pypdf-adobe-pdf-ocr-error-unsupported-filter-lzwdecode">Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode</a></p> <p>... the saga continues.</p> <p>I got 10,000 pdfs in a folder. Some OCRd, some not. Can't tell 'em apart. Step 1 was to figure out which ones are not OCRd and OCR only those (see other threads for details). </p> <p>So i'm using pyPdf. I get some exceptions related to unrecognized characters and unsupported filters when I try to Read the text. So I guestimated that if it throws an exception, it's got some text in it and then it doens't go in the list. Problem solved, right? Like so:</p> <pre><code> from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re path = 'C:\Users\Homer\Documents\My Pdfs' filelist = os.listdir(path) has_text_list = [] does_not_have_text_list = [] for pdf_name in filelist: pdf_file_with_directory = os.path.join(path, pdf_name) pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb')) print pdf_name for i in range(0, pdf.getNumPages()): try: pdf.write("%%EOF") content = pdf.getPage(i).extractText() does_it_have_text = re.findall(r'\w{2,}', content) if does_it_have_text == []: does_not_have_text_list.append(pdf_name) print pdf_name else: has_text_list.append(pdf_name) except: has_text_list.append(pdf_name) print does_not_have_text_list </code></pre> <p>But then I get this error:</p> <p>pyPdf.utils.PdfReadError: EOF marker not found</p> <p>Seems like it comes up a lot (from google):</p> <p><a href="http://pdfposter.origo.ethz.ch/node/31" rel="nofollow noreferrer">http://pdfposter.origo.ethz.ch/node/31</a></p> <p>I think it means that pyPdf opened the file, did its attempt at text processing, raised whatever exception, did the except: block, but is now unable to go to the next step b/c it doesn't know that the file has eneded.</p> <p>There are other threads like this and they allege that this has been fixed, but it doesn't seem to have been.</p> <p>Then someone has a function here where they write the EOF character to the .pdf first. </p> <p><a href="http://code.activestate.com/lists/python-list/589529/" rel="nofollow noreferrer">http://code.activestate.com/lists/python-list/589529/</a> </p> <p>I stuck in the "pdf.write("%%EOF")" line to try to mimick this, but no dice.</p> <p>So I how do I get that error to run the except block? I'm also using wing IDE so if there's a way to use the debugger to just skip over these files, that would be possible too. Thx.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload