StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POBatch OCR Program for PDFs
primarykey
Id
6026287
data
AcceptedAnswerId
0
AnswerCount
5
ClosedDate
2016-08-15T05:45:13.723
CommentCount
0
CommunityOwnedDate
CreationDate
2011-05-17T04:36:23.210
FavoriteCount
14
LastActivityDate
2016-08-15T05:13:26.707
LastEditDate
2017-05-23T12:09:36.360
LastEditorUserId
-1
OwnerUserId
756720
ParentId
0
PostTypeId
1
Score
16
ViewCount
19123
LastEditorDisplayName
text
Body
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible. It seems like people have recommended: <ol> <li>adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)</li> <li>ocropus (I can't figure out how to use this thing),</li> <li>Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).</li> <li>wowocr </li> <li>pdftextstream (that was from 2009)</li> <li>ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).</li> </ol> Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences. BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit. Here are the other threads: <a href="https://stackoverflow.com/questions/1561635/batch-ocring-pdfs-that-havent-already-been-ocrd">Batch OCRing PDFs that haven't already been OCR'd</a> <a href="https://stackoverflow.com/questions/5151798/open-source-ocr">Open source OCR</a> <a href="https://stackoverflow.com/questions/778145/pdf-text-extraction-approach-using-ocr">PDF Text Extraction Approach Using OCR</a> <a href="https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred">https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred</a>
Tags
<pdf><ocr><batch-processing>
Title
Batch OCR Program for PDFs
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USAquat33nfan
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
3. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
3. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POBatch OCR Program for PDFs
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POBatch OCR Program for PDFs
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POBatch OCR Program for PDFs
 UserUserId
 USWavyx
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.