StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
6189489
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
8
CommunityOwnedDate
CreationDate
2011-05-31T14:55:18.733
FavoriteCount
0
LastActivityDate
2018-01-20T22:35:52.080
LastEditDate
2018-01-20T22:35:52.080
LastEditorUserId
26054
OwnerUserId
359307
ParentId
6187250
PostTypeId
2
Score
85
ViewCount
0
LastEditorDisplayName
text
Body
Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only. <h2>First: <a href="http://www.ghostscript.com/releases/" rel="noreferrer">Ghostscript's</a> <code>txtwrite</code> output device (not so good)</h2> <pre><code> gs \ -dBATCH \ -dNOPAUSE \ -sDEVICE=txtwrite \ -dFirstPage=3 \ -dLastPage=5 \ -sOutputFile=- \ /path/to/your/pdf </code></pre> This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use <code>-sOutputFile=textfilename.txt</code>. <hr> <code>gs</code> Update: Recent versions of Ghostscript have seen major improvements in the <code>txtwrite</code> device and bug fixes. See <a href="http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/doc/History9.htm;hb=HEAD" rel="noreferrer">recent Ghostscript changelogs</a> (search for txtwrite on that page) for details. <hr> <h2>Second: Ghostscript's <a href="http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/lib/ps2ascii.ps;hb=HEAD" rel="noreferrer"><code>ps2ascii.ps</code> PostScript utility</a> (better)</h2> This one requires you to download the latest version of the file ps2ascii.ps from the <a href="http://git.ghostscript.com/?p=ghostpdl.git;a=tree;f=gs/lib" rel="noreferrer">Ghostscript Git source code repository</a>. You'd have to convert your PDF to PostScript, then run this command on the PS file: <pre><code>gs \ -q \ -dNODISPLAY \ -P- \ -dSAFER \ -dDELAYBIND \ -dWRITESYSTEMDICT \ -dSIMPLE \ /path/to/ps2ascii.ps \ input.ps \ -c quit </code></pre> If the <code>-dSIMPLE</code> parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used. If you replace that parameter by <code>-dCOMPLEX</code>, you'll get additional infos about colors and images used. Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it.... <h2>Third: <a href="http://www.foolabs.com/xpdf/download.html" rel="noreferrer">XPDF's</a> <code>pdftotext</code> CLI utility (more comfortable than Ghostscript)</h2> A more comfortable way to do text extraction: use <code>pdftotext</code> (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try: <pre><code> pdftotext \ -f 13 \ -l 17 \ -layout \ -opw supersecret \ -upw secret \ -eol unix \ -nopgbrk \ /path/to/your/pdf - |less </code></pre> This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less... <code>pdftotext -h</code> displays all available commandline options. Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-) <hr> <code>pdftotext</code> Update: Recent versions of Poppler's <code>pdftotext</code> have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are: <ul> <li><code>-x <int></code> : top left corner's x-coordinate of crop area</li> <li><code>-y <int></code> : top left corner's y-coordinate of crop area</li> <li><code>-W <int></code> : crop area's width in pixels (defaults to 0)</li> <li><code>-H <int></code> : crop area's height in pixels (defaults to 0)</li> </ul> Best, if used with the <code>-layout</code> parameter. <hr> <h2>Fourth: MuPDF's <code>mutool draw</code> command can also extract text</h2> The cross-platform, open source <a href="http://mupdf.com/" rel="noreferrer">MuPDF</a> application (made by the same company that also develops Ghostscript) has bundled a command line tool, <code>mutool</code>. To extract text from a PDF with this tool, use: <pre><code>mutool draw -F txt the.pdf </code></pre> will emit the extracted text to <code><stdout></code>. Use <code>-o filename.txt</code> to write it into a file. <h2>Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)</h2> <a href="http://www.pdflib.com/products/tet/" rel="noreferrer">TET</a>, the Text Extraction Toolkit from the <a href="http://www.pdflib.com/" rel="noreferrer">pdflib</a> family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website: <blockquote> Geometry TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins. </blockquote> In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it... <hr> And there are even more options: <ol> <li><a href="http://podofo.sf.net/" rel="noreferrer"><code>podofotxtextract</code></a> (CLI tool) from the PoDoFo project (Open Source)</li> <li><a href="http://calibre-ebook.com/download/" rel="noreferrer"><code>calibre</code></a> (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs</li> <li><a href="http://www.abisource.com/download/" rel="noreferrer"><code>AbiWord</code></a> (a GUI word processor, Open Source) can import PDFs and save its files as .txt: <code>abiword --to=txt --to-name=output.txt input.pdf</code></li> </ol>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POPDF TEXT Extraction
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USEric Smith
UserOwnerUserId
1. USKurt Pfeifle
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.