StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
158824
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2008-10-01T17:43:25.157
FavoriteCount
0
LastActivityDate
2008-10-01T17:59:47.940
LastEditDate
2008-10-01T17:59:47.940
LastEditorUserId
2199
OwnerUserId
2199
ParentId
158479
PostTypeId
2
Score
22
ViewCount
0
LastEditorDisplayName
Mark Brackett
text
Body
I've used <a href="http://pdftohtml.sourceforge.net/" rel="noreferrer">pdftohtml</a> to successfully strip tables out of PDF into CSV. It's based on <a href="http://www.foolabs.com/xpdf/portsntools.html" rel="noreferrer">Xpdf</a>, which is a more general purpose tool, that includes <a href="http://en.wikipedia.org/wiki/Pdftotext" rel="noreferrer">pdftotext</a>. I just wrap it as a Process.Start call from C#. If you're looking for something a little more DIY, there's the <a href="http://itextsharp.sourceforge.net/" rel="noreferrer">iTextSharp</a> library - a port of Java's <a href="http://www.1t3xt.com/products/index.php" rel="noreferrer">iText</a> - and <a href="http://www.pdfbox.org/" rel="noreferrer">PDFBox</a> (yes, it says Java - but they have a .NET version by way of <a href="http://www.ikvm.net/" rel="noreferrer">IKVM.NET</a>). Here's some CodeProject articles on using <a href="http://www.codeproject.com/KB/cs/PDFToText.aspx" rel="noreferrer">iTextSharp</a> and <a href="http://www.codeproject.com/KB/string/pdf2text.aspx" rel="noreferrer">PDFBox</a> from C#. And, if you're really a masochist, you could call into Adobe's <a href="http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611" rel="noreferrer">PDF IFilter</a> with COM interop. The <a href="http://msdn.microsoft.com/en-us/library/ms691105.aspx" rel="noreferrer">IFilter specs</a> is pretty simple, but I would guess that the interop overhead would be significant. Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine. I've used <a href="http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging" rel="noreferrer">MODI</a> interactively before, with decent results. It's COM, so calling it from C# via interop is also <a href="http://secure.codeproject.com/KB/office/OCRSampleApplication.aspx" rel="noreferrer">doable</a> and pretty <a href="http://msdn.microsoft.com/en-us/library/aa167607.aspx" rel="noreferrer">simple</a>: <pre><code>' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging Dim inputFile As String = "C:\test\multipage.tif" Dim strRecText As String = "" Dim Doc1 As MODI.Document Doc1 = New MODI.Document Doc1.Create(inputFile) Doc1.OCR() ' this will ocr all pages of a multi-page tiff file Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results strRecText &= Doc1.Images(imageCounter).Layout.Text ' this puts the ocr results into a string Next File.AppendAllText("C:\test\testmodi.txt", strRecText) ' write the OCR file out to disk Doc1.Close() ' clean up Doc1 = Nothing </code></pre> Others like <a href="http://code.google.com/p/tesseract-ocr/" rel="noreferrer">Tesseract</a>, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POProgrammatically recognize text from scans in a PDF File
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USMark Brackett
UserOwnerUserId
1. USMark Brackett
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POProgrammatically recognize text from scans in a PDF File
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COThis was an excellent list of resources.. thanks
 singulars
 PostPostId
 PO
 UserUserId
 UStorial

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.