Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>In the uncompressed pdf document, there are "stream objects" of following uncertain format ((0,0) being bottom left) in pseudo regexp form:</p> <pre><code>(x1 y1 m x2 y2 l [whitespace or blank or newline seperator symbol])* S (BT .* ET)* </code></pre> <p>where</p> <pre><code>x1, y1, x2, y2 are coordinates l probably for "draw line" m move to, "from to" or "merge" S is the command for "draw" or the like BT Begin Text ET End Text </code></pre> <p>all commands postfix.</p> <p>EDIT:</p> <p>one possible Java regexp is (ref PDF32000_2008.pdf), after replacing newlines by blanks in the uncompressed pdf source: </p> <pre><code>((\s+\d+(\.\d+)?){2}(\s+m|\s+l|(\s+\d+(\.\d+)?){2}(\s+re|\s+y|\s+v|(\s+\d+(\.\d+)?){2}\s+c))\s+)+([SsFn]|[fBb](\*)?) </code></pre> <p>There are other elements like "W*" or "Q q" in the stream which seem to adjust line thickness or font properties at general. Since I was not able to find a language spec ad hoc, this is what I infer from experiments.</p> <p>Using this information and the coordinates of the text tokens (between ET and BT), one can infer table cell widths, table starting end ending positions (for identifying different tables). </p> <p>The problem remains uncompressing streams of any kind. With pdftk I was able to uncompress pdf files created from openoffice writer, but arbitrary pdf files have still cryptic symbols in them.</p> <p>Further information: </p> <p><a href="http://www.gnupdf.org/Introduction_to_PDF" rel="nofollow noreferrer">http://www.gnupdf.org/Introduction_to_PDF</a></p> <p><a href="http://blog.idrsolutions.com/2011/05/understanding-the-pdf-file-format-%E2%80%93-carriage-returns-spaces-and-other-gaps/" rel="nofollow noreferrer">http://blog.idrsolutions.com/2011/05/understanding-the-pdf-file-format-%E2%80%93-carriage-returns-spaces-and-other-gaps/</a></p> <p><a href="http://blog.idrsolutions.com/2012/03/understanding-the-pdf-file-format-names-locations/" rel="nofollow noreferrer">http://blog.idrsolutions.com/2012/03/understanding-the-pdf-file-format-names-locations/</a></p> <p><a href="http://blog.idrsolutions.com/2011/05/understanding-the-pdf-file-format-%E2%80%93-pdf-xref-tables-explained/" rel="nofollow noreferrer">http://blog.idrsolutions.com/2011/05/understanding-the-pdf-file-format-%E2%80%93-pdf-xref-tables-explained/</a></p> <p><a href="https://stackoverflow.com/questions/4150069/pdf-page-stream-optimizer-library">PDF page-stream optimizer library?</a></p> <p><a href="http://www.gnupdf.org/Stream" rel="nofollow noreferrer">http://www.gnupdf.org/Stream</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload