StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POExtract text from PDF document based on position c++
text
Body
copied!<p>I am trying to extract a text from a PDF document based on it's coordinates, so I have came across two notions in the <a href="http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf" rel="nofollow">Adobe PDF Reference</a> (chap. 5.3):</p> <ol> <li>Text positioning operators</li> <li>Text showing operators</li> </ol> <p>For now I am interested in Td & Tm positioning operators, while using <strong>Td</strong> you have <em>tx</em> and <em>ty</em>, relative to start of the current line which is clearly specified in a PDF document: <code>tx ty Td</code>, I have used this method to extract text by the <em>tx</em> and <em>ty</em> coordinates. The problem is that I don't know how to extract text from a PDF based on its position, while supplying only <em>tx</em> and <em>ty</em>.</p> <pre><code>a b c d e f Tm </code></pre> <p>this is the 'formula for' Tm usage. What does the a-f values represent ? This would be my input for Tm:</p> <pre><code>BT /F1 8.88 Tf 0 0 0 rg 0.9998 0 0 1 401.52 448.08 Tm [<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ ET </code></pre> <p>Why does each group of four have a leading 00 ? is this in hex? should I convert it from hex to int and corresponding character?</p> <p>this would be my input for Td:</p> <pre><code>BT 43.20 421.90 Td 0 Tw /C001 10.00 Tf 0.00 Tw <BlablablaTextInHexThatICanProcess>Tj ET </code></pre> <p>This is much clearer, the coordinates are clearer. How could extract the text from a Tm positioned PDF text object based on simple X and Y coordinates? I am using c++ and PoDoFo library</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload