Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I've used <a href="http://pdftohtml.sourceforge.net/" rel="noreferrer">pdftohtml</a> to successfully strip tables out of PDF into CSV. It's based on <a href="http://www.foolabs.com/xpdf/portsntools.html" rel="noreferrer">Xpdf</a>, which is a more general purpose tool, that includes <a href="http://en.wikipedia.org/wiki/Pdftotext" rel="noreferrer">pdftotext</a>. I just wrap it as a Process.Start call from C#.</p> <p>If you're looking for something a little more DIY, there's the <a href="http://itextsharp.sourceforge.net/" rel="noreferrer">iTextSharp</a> library - a port of Java's <a href="http://www.1t3xt.com/products/index.php" rel="noreferrer">iText</a> - and <a href="http://www.pdfbox.org/" rel="noreferrer">PDFBox</a> (yes, it says Java - but they have a .NET version by way of <a href="http://www.ikvm.net/" rel="noreferrer">IKVM.NET</a>). Here's some CodeProject articles on using <a href="http://www.codeproject.com/KB/cs/PDFToText.aspx" rel="noreferrer">iTextSharp</a> and <a href="http://www.codeproject.com/KB/string/pdf2text.aspx" rel="noreferrer">PDFBox</a> from C#.</p> <p>And, if you're <em>really</em> a masochist, you could call into Adobe's <a href="http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611" rel="noreferrer">PDF IFilter</a> with COM interop. The <a href="http://msdn.microsoft.com/en-us/library/ms691105.aspx" rel="noreferrer">IFilter specs</a> is pretty simple, but I would guess that the interop overhead would be significant.</p> <p>Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with <em>images</em> in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine. </p> <p>I've used <a href="http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging" rel="noreferrer">MODI</a> interactively before, with decent results. It's COM, so calling it from C# via interop is also <a href="http://secure.codeproject.com/KB/office/OCRSampleApplication.aspx" rel="noreferrer">doable</a> and pretty <a href="http://msdn.microsoft.com/en-us/library/aa167607.aspx" rel="noreferrer">simple</a>:</p> <pre><code>' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging Dim inputFile As String = "C:\test\multipage.tif" Dim strRecText As String = "" Dim Doc1 As MODI.Document Doc1 = New MODI.Document Doc1.Create(inputFile) Doc1.OCR() ' this will ocr all pages of a multi-page tiff file Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results strRecText &amp;= Doc1.Images(imageCounter).Layout.Text ' this puts the ocr results into a string Next File.AppendAllText("C:\test\testmodi.txt", strRecText) ' write the OCR file out to disk Doc1.Close() ' clean up Doc1 = Nothing </code></pre> <p>Others like <a href="http://code.google.com/p/tesseract-ocr/" rel="noreferrer">Tesseract</a>, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload