Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I know it is poor taste to answer your own question, but I think I may have figured this out and I don't want anyone else to waste their time looking for a solution to my problem.</p> <p>I followed the suggestion in a one of the links posted in my question and re-purposed the current pdf2txt.py script included with pdfminer. Here is the function in case it is useful to anyone else. Thanks to the user skyl for posting that answer, all I had to to was make a couple of changes to make it work with the current version of pdfminer. </p> <p>This function take a pdf and creates a .txt file in the same directory with the same name.</p> <pre><code>def convert_pdf(path, outtype='txt', opts={}): import sys from pdfminer.pdfparser import PDFDocument, PDFParser from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf from pdfminer.pdfdevice import PDFDevice, TagExtractor from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.cmapdb import CMapDB from pdfminer.layout import LAParams import getopt outfile = path[:-3] + outtype outdir = '/'.join(path.split('/')[:-1]) # debug option debug = 0 # input option password = '' pagenos = set() maxpages = 0 # output option # ?outfile = None # ?outtype = None outdir = None #layoutmode = 'normal' codec = 'utf-8' pageno = 1 scale = 1 showpageno = True laparams = LAParams() for (k, v) in opts: if k == '-d': debug += 1 elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') ) elif k == '-m': maxpages = int(v) elif k == '-P': password = v elif k == '-o': outfile = v elif k == '-n': laparams = None elif k == '-A': laparams.all_texts = True elif k == '-V': laparams.detect_vertical = True elif k == '-M': laparams.char_margin = float(v) elif k == '-L': laparams.line_margin = float(v) elif k == '-W': laparams.word_margin = float(v) elif k == '-F': laparams.boxes_flow = float(v) elif k == '-Y': layoutmode = v elif k == '-O': outdir = v elif k == '-t': outtype = v elif k == '-c': codec = v elif k == '-s': scale = float(v) # #PDFDocument.debug = debug #PDFParser.debug = debug CMapDB.debug = debug PDFResourceManager.debug = debug PDFPageInterpreter.debug = debug PDFDevice.debug = debug # rsrcmgr = PDFResourceManager() outtype = 'text' if outfile: outfp = file(outfile, 'w') else: outfp = sys.stdout device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams) fp = file(path, 'rb') process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password, check_extractable=True) fp.close() device.close() outfp.close() return </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload