Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>The <code>PDFTextStripper</code> class you use is documented (cf. its JavaDoc comment) like this:</p> <pre><code>* This class will take a pdf document and strip out all of the text and ignore the * formatting and such. </code></pre> <p>To get specific font information, therefore, you have to change it somewhat.</p> <p>The font information is available in this class all along and only discarded when outputting a line, have a look at its <a href="http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java" rel="nofollow">source</a>:</p> <pre><code>protected void writePage() throws IOException { [...] for( int i = 0; i &lt; charactersByArticle.size(); i++) { [...] List&lt;TextPosition&gt; line = new ArrayList&lt;TextPosition&gt;(); [...] while( textIter.hasNext() ) { [...] if( lastPosition != null ) { [...] if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine)) { writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); line.clear(); [...] } ............ </code></pre> <p>The <code>TextPosition</code> instances in that list <code>line</code> still have all formatting information available, among them the font used, only while "normalizing" <code>line</code> it is reduced to pure characters.</p> <p>To keep font information, therefore, you have different options, depending on how you want to retrieve the font information:</p> <ul> <li><p>If you want to continue retrieving all page content information (including fonts) in a single String via <code>getText</code>: You change the method </p> <pre><code>private List&lt;String&gt; normalize(List&lt;TextPosition&gt; line, boolean isRtlDominant, boolean hasRtl) </code></pre> <p>to include some font tags (e.g. <code>[Arial]</code>) of your choice whenever the font changes. Unfortunately this method is private. Thus, you have to copy the whole <a href="http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java" rel="nofollow"><code>PDFTextStripper</code></a> class and change the code of the copy.</p></li> <li><p>If you want to retrieve the specificfont information in a different structure (e.g. as <code>List&lt;List&lt;TextPosition&gt;&gt;</code>) you can derive your own stripper class from <a href="http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java" rel="nofollow"><code>PDFTextStripper</code></a>, add some variable of your desired type, and override the <code>protected</code> method <code>writePage</code> mentioned above, copying it and only enhancing it right before or after the line</p> <pre><code>writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); </code></pre> <p>with code adding the information to your new variable. E.g.</p> <pre><code>public class MyPDFTextStripper extends PDFTextStripper { public List&lt;List&lt;TextPosition&gt;&gt; myLines = new ArrayList&lt;List&lt;TextPosition&gt;&gt;(); [...] if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine)) { writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); myLines.add(new ArrayList&lt;TextPosition&gt;(line)); line.clear(); [...] } </code></pre> <p>Now you can call <code>getText</code> for an instance of your <code>MyPDFTextStripper</code>, retrieve the plain text as result, and access the additional data via the new variable</p></li> </ul>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload