Note that there are some explanatory texts on larger screens.

plurals
  1. POScraping large pdf tables which span across multiple pages
    text
    copied!<p>I am trying to scrape <a href="http://www20.gencat.cat/docs/meteocat/Continguts/Climatologia/Anuaris/Estacions%20meteorologiques/static_files/EMAtaules2012.pdf" rel="nofollow noreferrer">PDF tables which span across multiple pages</a>. I tried many things but the best seems to be <code>pdftotext -layout</code> as <a href="https://stackoverflow.com/a/857800/684229">advised here</a>. The problem is that the resultant text file is not easy to work with, as the table layout differs across pages, so the columns are not aligned. Also note missing values in lines beginning with "Solsonès":</p> <pre><code> TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT N Alt Camp VY Nulles 7,5 5,5 10,9 12,3 16,7 21,6 22,3 24,4 20,1 15,9 Alt Camp DQ Vila-rodona 7,9 5,6 11,0 12,0 16,6 21,6 22,0 24,3 19,9 15,8 Alt Empordà U1 Cabanes 8,2 6,5 11,7 12,6 17,5 22,0 23,1 24,4 20,4 16,6 Alt Empordà W1 Castelló d'Empúries 8,1 6,4 11,6 12,9 17,0 21,1 22,0 23,4 20,1 16,4 [...] TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT Baix Empordà DF la Bisbal d'Empordà 6,6 5,3 10,9 12,6 17,2 21,9 22,9 24,6 20,3 16 Baix Empordà UB la Tallada d'Empordà 6,1 5,2 10,7 12,3 16,6 21,3 22,2 23,8 19,7 15 Baix Empordà UC Monells 6,1 4,6 9,9 11,4 16,5 21,7 23,0 24,5 19,6 15 [...] TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012 COMARCA CODI i NOM EMA GEN FEB MAR ABR MAI JUN JUL AGO SET OCT [...] Solsonès CA Clariana de Cardener 4,6 3,3 10,3 10,2 16,7 22,3 d.i. Solsonès Z8 el Port del Comte (2.316 m) -0,9 -6,3 -0,2 -2,0 5,3 10,5 10,9 13,8 7,8 4,2 Solsonès VO Lladurs 3,0 2,6 9,5 9,0 15,3 21,4 21,6 24,3 17,5 13,0 Solsonès VP Pinós 3,0 1,6 8,9 9,2 15,4 21,1 21,3 23,8 17,6 13,3 Solsonès XT Solsona d.i. 24,3 18,0 13,5 Tarragonès VQ Constantí 7,9 6,0 11,2 13,1 17,1 21,9 22,6 24,6 20,6 16,6 Tarragonès XE Tarragona - Complex Educatiu 10,2 7,8 12,3 14,6 18,3 23,0 24,2 26,2 23,0 * 18,4 Tarragonès DK Torredembarra 9,7 7,7 12,3 14,3 17,9 22,8 24,3 26,2 22,7 18,5 Terra Alta WD Batea 6,3 5,0 11,2 12,1 18,3 23,0 23,3 25,5 20,2 15,9 Terra Alta XP Gandesa 6,6 5,2 11,2 12,2 18,1 22,9 23,4 25,6 20,4 16,0 </code></pre> <p><a href="http://artax.karlin.mff.cuni.cz/~ttel5535/pub/so/EMAtaules2012.txt" rel="nofollow noreferrer">complete file for download - UTF8</a></p> <p><strong>So, this output is not very easy to parse. What other approach is available?</strong></p> <p>It seems that every tool I use is only capable to extract information about <em>layout</em> of the table cells, but it doesn't extract the information of belonging to particular column. This is very much apparent if the cells are empty - the empty cells are not in the output, you only get non-empty "cells" with their layout. <strong>Does the PDF itself contain this tabular information?</strong> If not, it doesn't make sense to search for tool that will extract it.</p> <p><strong>Paid solutions are not out of question</strong>, as it might in the end be cheaper than invest several working days of my time...</p> <hr> <p>What I have tried:</p> <ul> <li>copy paste - makes problems with missing values (pg 5)</li> <li>save as text from Acrobat (even worse result than copy-paste)</li> <li>open in Excel as external data source - will not recognize the table</li> <li><a href="https://www.pdftoexcelonline.com/" rel="nofollow noreferrer">https://www.pdftoexcelonline.com/</a> - results in error</li> <li><a href="http://www.pdftoexcel.org/" rel="nofollow noreferrer">http://www.pdftoexcel.org/</a> as well as their trial of Able2Extract - <a href="https://stackoverflow.com/questions/17638090/redirect-system2-stdout-to-a-file-on-windows">they messed up some columns</a>. They recognized the columns correctly in the preview but in the excel output they were messed up</li> <li><a href="http://www.pdftoword.com/" rel="nofollow noreferrer">http://www.pdftoword.com/</a> - just takes my email and never sends anything</li> <li>using python on <strong>scraperwiki</strong> <a href="http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/" rel="nofollow noreferrer">http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/</a> <strong>seems very complicated especially for non-python users</strong> and <a href="https://scraperwiki.com/" rel="nofollow noreferrer">https://scraperwiki.com/</a> is not free</li> <li><p>I have encountered several python libraries like <a href="http://blog.scraperwiki.com/2013/07/29/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/" rel="nofollow noreferrer">pdftables</a> but they are not easy to use for non-python developer like me (I was not even able to run these things). Is there any easier way to accomplish the task?</p></li> <li><p>I am trying to use <code>tm</code> library in R as <a href="https://stackoverflow.com/q/9185831/684229">recommended here</a>, but <a href="https://stackoverflow.com/q/18080378/684229">I have encountered some problems</a></p></li> </ul> <p>EDIT: the Cloud SDK recommended by Ian. I registered but I absolutely don't know where to go from here - how to upload pages, recognize them etc:</p> <p><img src="https://i.stack.imgur.com/qTAMH.jpg" alt="enter image description here"></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload