StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with <a href="http://www.atalasoft.com" rel="nofollow">dotImage</a> (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:</p> <pre><code>string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf"); PdfDecoder decoder = new PdfDecoder(); foreach (string path in candidatePDFs) { using (FileStream stm = new FileStream(path, FileMode.Open)) { if (decoder.IsValidFormat(stm)) { ProcessPdf(path, stm); } } } </code></pre> <p>This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:</p> <pre><code>public void ProcessPdf(string path, Stream stm) { using (Document doc = new Document(stm)) { int i=0; foreach (Page p in doc.Pages) { if (p.SingleImageOnly) { ProcessWithOcr(path, stm, i); } else { ProcessWithTextExtract(path, stm, i); } i++; } } } </code></pre> <p>This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:</p> <pre><code>public void ProcessWithOcr(string path, Stream pdfStm, int page) { using (Stream textStream = GetTextStream(path, page)) { PdfDecoder decoder = new PdfDecoder(); using (AtalaImage image = decoder.Read(pdfStm, page)) { ImageCollection coll = new ImageCollection(); coll.Add(image); ImageCollectionImageSource source = new ImageCollectionImageSource(coll); OcrEngine engine = GetOcrEngine(); engine.Initialize(); engine.Translate(source, "text/plain", textStream); engine.Shutdown(); } } } </code></pre> <p>what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.</p> <p>You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.</p> <p>Finally, you would need the code to extract text from the pages that already have perfectly good text on them:</p> <pre><code>public void ProcessWithTextExtract(string path, Stream pdfStream, int page) { using (Stream textStream = GetTextStream(path, page)) { StreamWriter writer = new StreamWriter(textStream); using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) { PdfTextPage page = doc.GetPage(i); writer.Write(page.GetText(0, page.CharCount)); } } } </code></pre> <p>This extracts the text from the given page and writes it to the output stream.</p> <p>Finally, you need GetTextStream():</p> <pre><code>public Stream GetTextStream(string sourcePath, int pageNo) { string dir = Path.GetDirectoryName(sourcePath); string fname = Path.GetFileNameWithoutExtension(sourcePath); string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo)); return new FileStream(finalPath, FileMode.Create); } </code></pre> <p>Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload