Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Unfortunately, I did not working with java and you have to implement it in java code by yourself. Now I'll tell you, how finally I did it:</p> <p><strong>1)</strong> I took the file by your link. PHP is doing it by <a href="http://www.php.net/manual/en/wrappers.php" rel="nofollow">@fopen("http://...")</a>.</p> <p><strong>2)</strong> I opened it as a <strong>binary</strong> (it is important) and extracted two parts:</p> <p><strong>2.1)</strong> Data 3 0 obj part, which represents creation and modification dates. I did it by <a href="http://www.php.net/manual/en/ref.pcre.php" rel="nofollow">regex</a>. It was simple and I mention it above.</p> <p><strong>2.1)</strong> Data stream from 5 0 obj, which represents the deflated data. IMPORTANT! Microsoft Excel inserts two bytes <code>0D 0A</code> as a line break. Do not forget it, when you filtering the content by regexp. This bytes in the start and in the end have not to be included in extracted string.</p> <p><strong>3)</strong> I inflate a coded stuff by function <a href="http://php.net/manual/en/function.gzuncompress.php" rel="nofollow">$uncompressed = @gzuncompress($compressed)</a> and put it in external file. You can see results <a href="https://docs.google.com/open?id=0B1YEM-11PerqVEFRSXM0M1ZVZG8" rel="nofollow">there</a> </p> <p><strong>4)</strong> Funniest part. The raw data inside the file in textual format. It looks like <code>[(V)-4(RI)16(J)] TJ</code>, and means <code>VRIJ</code>. You can read about texts in PDF in the <a href="http://www.google.com/search?client=safari&amp;rls=en&amp;q=PDF+Reference+v1.7&amp;ie=UTF-8&amp;oe=UTF-8#hl=ru&amp;safe=off&amp;client=safari&amp;rls=en&amp;sclient=psy-ab&amp;q=PDF+Reference+v1.7+file%3Apdf&amp;oq=PDF+Reference+v1.7+file%3apdf&amp;gs_l=serp.3...7017.13704.0.14310.11.10.1.0.0.0.132.1000.6j4.10.0...0.0...1c.7e0s-Abrluo&amp;pbx=1&amp;bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&amp;fp=c5c0acba9dbab366&amp;biw=774&amp;bih=694" rel="nofollow">PDF Reference v1.7</a>, part 5.</p> <p><strong>5)</strong> I believe, the regular expressions can help you extract or/and transform the data.</p> <p>IMPORTANT: I said "data stream from 5 0 obj", but number of the object "is subject of change". You must control the reference to the object from dictionary->pages->page->content chain. Description of the "bread crumbs" you can find in the manual I mentioned above.</p> <p>Unfortunately, Excel do not embed any table structure in the PDF, but you can find the coordinates of the text portions and interprete it. Anyway it is a mess.</p> <p>Do you think, dear Merlin, it is hard? No, dear, it is not. It is not hard, because there is no unicode symbols. The unicode in the PDF is THE REAL SUCK!</p> <p>Good luck! </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload