StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
3722606
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
9
CommunityOwnedDate
CreationDate
2010-09-15T23:25:40.527
FavoriteCount
0
LastActivityDate
2010-09-15T23:25:40.527
LastEditDate
LastEditorUserId
0
OwnerUserId
359307
ParentId
3650957
PostTypeId
2
Score
31
ViewCount
0
LastEditorDisplayName
text
Body
Since today I know it: the best thing for text extraction from PDFs is <a href="http://www.pdflib.com/newsticker/single-news/article/pdflib-tet-4-product-family-available/" rel="noreferrer">TET, the text extraction toolkit</a>. TET is part of the PDFlib.com family of products. PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible". TET's first incarnation is <a href="http://www.pdflib.com/products/tet/" rel="noreferrer">a library</a>. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces. pdflib.com also offers another incarnation of this technology, the <a href="http://www.pdflib.com/products/tet-plugin/" rel="noreferrer">TET plugin for Acrobat</a>. And the third incarnation is the <a href="http://www.pdflib.com/products/tet-pdf-ifilter/" rel="noreferrer">PDFlib TET iFilter</a>. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes. And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only. I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction. This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements. TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters... Give it a try.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow to extract text from a PDF?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USKurt Pfeifle
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COThere is no trial version, and $440 is a bit much to "Give it a try."
 singulars
 PostPostId
 PO
 UserUserId
 USRok Strniša
2. CO@Darthenius: You must have missed this sentence: "[PDFlib TET can be evaluated without a license, but will only process PDF documents with up to 10 pages and 1 MB size unless a valid license key is applied](http://www.pdflib.com/download/tet/)".
 singulars
 PostPostId
 PO
 UserUserId
 USKurt Pfeifle

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.