StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
primarykey
Id
6079593
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-05-21T04:28:45.667
FavoriteCount
0
LastActivityDate
2016-08-09T12:39:52.123
LastEditDate
2017-05-23T12:10:36.947
LastEditorUserId
-1
OwnerUserId
703251
ParentId
0
PostTypeId
1
Score
2
ViewCount
3520
LastEditorDisplayName
text
Body
pyPdf throws this exception: pyPdf.utils.PdfReadError: EOF marker not found I don't need to fix pyPdf, I just need to get the EOF error to cause an "except" block to execute and skip over the file, but it doesn't work. It still causes the program to stop running. Background: <a href="https://stackoverflow.com/questions/6026287/batch-ocr-program-for-pdfs">Batch OCR Program for PDFs</a> <a href="https://stackoverflow.com/questions/6053064/python-pypdf-adobe-pdf-ocr-error-unsupported-filter-lzwdecode">Python, pyPdf, Adobe PDF OCR error: unsupported filter /lzwdecode</a> ... the saga continues. I got 10,000 pdfs in a folder. Some OCRd, some not. Can't tell 'em apart. Step 1 was to figure out which ones are not OCRd and OCR only those (see other threads for details). So i'm using pyPdf. I get some exceptions related to unrecognized characters and unsupported filters when I try to Read the text. So I guestimated that if it throws an exception, it's got some text in it and then it doens't go in the list. Problem solved, right? Like so: <pre><code> from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re path = 'C:\Users\Homer\Documents\My Pdfs' filelist = os.listdir(path) has_text_list = [] does_not_have_text_list = [] for pdf_name in filelist: pdf_file_with_directory = os.path.join(path, pdf_name) pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb')) print pdf_name for i in range(0, pdf.getNumPages()): try: pdf.write("%%EOF") content = pdf.getPage(i).extractText() does_it_have_text = re.findall(r'\w{2,}', content) if does_it_have_text == []: does_not_have_text_list.append(pdf_name) print pdf_name else: has_text_list.append(pdf_name) except: has_text_list.append(pdf_name) print does_not_have_text_list </code></pre> But then I get this error: pyPdf.utils.PdfReadError: EOF marker not found Seems like it comes up a lot (from google): <a href="http://pdfposter.origo.ethz.ch/node/31" rel="nofollow noreferrer">http://pdfposter.origo.ethz.ch/node/31</a> I think it means that pyPdf opened the file, did its attempt at text processing, raised whatever exception, did the except: block, but is now unable to go to the next step b/c it doesn't know that the file has eneded. There are other threads like this and they allege that this has been fixed, but it doesn't seem to have been. Then someone has a function here where they write the EOF character to the .pdf first. <a href="http://code.activestate.com/lists/python-list/589529/" rel="nofollow noreferrer">http://code.activestate.com/lists/python-list/589529/</a> I stuck in the "pdf.write("%%EOF")" line to try to mimick this, but no dice. So I how do I get that error to run the except block? I'm also using wing IDE so if there's a way to use the debugger to just skip over these files, that would be possible too. Thx.
Tags
<python><exception><eof><pypdf>
Title
Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USPatentDeathSquad
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTDownMod
2. VO
 singulars
 PostPostId
 POPython, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POPython, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.