Note that there are some explanatory texts on larger screens.

plurals
  1. POBatch OCR Program for PDFs
    primarykey
    data
    text
    <p>This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible. </p> <p>It seems like people have recommended:</p> <ol> <li>adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)</li> <li>ocropus (I can't figure out how to use this thing),</li> <li>Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).</li> <li>wowocr </li> <li>pdftextstream (that was from 2009)</li> <li>ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).</li> </ol> <p>Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.</p> <p>BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit. </p> <p>Here are the other threads:</p> <p><a href="https://stackoverflow.com/questions/1561635/batch-ocring-pdfs-that-havent-already-been-ocrd">Batch OCRing PDFs that haven&#39;t already been OCR&#39;d</a></p> <p><a href="https://stackoverflow.com/questions/5151798/open-source-ocr">Open source OCR</a></p> <p><a href="https://stackoverflow.com/questions/778145/pdf-text-extraction-approach-using-ocr">PDF Text Extraction Approach Using OCR</a></p> <p><a href="https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred">https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred</a></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload