Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>This is called PDF scraping, and is very hard because:</p> <ul> <li>PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed in the paper is often random).</li> <li>There are tons of software generating PDFs, many are defective.</li> </ul> <p>Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).</p> <p>An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.</p> <p>So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.</p> <p>I would really like to be proven wrong.</p> <p>[update]</p> <p>The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:</p> <ol> <li><p>Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.</p></li> <li><p>If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like <a href="https://linux.die.net/man/1/pdftotext" rel="nofollow noreferrer">pdftotext</a> and a Bayesian filter (same kind of algorithm used to classify SPAM). </p></li> </ol> <p>So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload