Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn't handle color information but <a href="https://stackoverflow.com/questions/5872051/how-to-get-text-with-a-certain-color-from-a-pdf-c/5873831#5873831">according to @Mark Storer it might not be too hard to implement yourself</a>.</p> <p><strong>BEGIN EDIT</strong></p> <p>I started work on implementing color information. See <a href="http://chrishaas.wordpress.com/2011/07/31/getting-color-information-from-itextsharps-textrenderinfo-and-itextextractionstrategy/" rel="noreferrer">my blog post here</a> for more details. (Sorry for the bad formatting, heading off to dinner now.)</p> <p><strong>END EDIT</strong></p> <p>The code below combines several questions and answers here including <a href="https://stackoverflow.com/questions/2375674/itextsharp-how-to-get-the-position-of-word-on-a-page/4866110#4866110">this one to get the font height</a> (although its not exact) as well as another one (that for the life of me I can't seem to find anymore) that shows how to detect for faux bold.</p> <p>The <code>PostscriptFontName</code> returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.</p> <p>Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.</p> <p><strong>Screenshot of sample PDF</strong></p> <p><strong><img src="https://i.stack.imgur.com/zNNk7.png" alt="Screenshot of sample PDF"></strong></p> <p><strong>Sample text extracted as HTML</strong></p> <pre><code>&lt;span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407"&gt;Hello &lt;/span&gt; &lt;span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407"&gt;w&lt;/span&gt; &lt;span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201"&gt;o&lt;/span&gt; &lt;span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407"&gt;rl&lt;/span&gt; &lt;span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407"&gt;d &lt;/span&gt; &lt;br /&gt; &lt;span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407"&gt;Test &lt;/span&gt; </code></pre> <p><strong>Code</strong></p> <pre><code>using System; using System.Collections.Generic; using System.Text; using System.Windows.Forms; using iTextSharp.text.pdf.parser; using iTextSharp.text.pdf; namespace WindowsFormsApplication2 { public partial class Form1 : Form { public Form1() { InitializeComponent(); } private void Form1_Load(object sender, EventArgs e) { PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf")); TextWithFontExtractionStategy S = new TextWithFontExtractionStategy(); string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S); Console.WriteLine(F); this.Close(); } public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy { //HTML buffer private StringBuilder result = new StringBuilder(); //Store last used properties private Vector lastBaseLine; private string lastFont; private float lastFontSize; //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html private enum TextRenderMode { FillText = 0, StrokeText = 1, FillThenStrokeText = 2, Invisible = 3, FillTextAndAddToPathForClipping = 4, StrokeTextAndAddToPathForClipping = 5, FillThenStrokeTextAndAddToPathForClipping = 6, AddTextToPaddForClipping = 7 } public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) { string curFont = renderInfo.GetFont().PostscriptFontName; //Check if faux bold is used if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText)) { curFont += "-Bold"; } //This code assumes that if the baseline changes then we're on a newline Vector curBaseline = renderInfo.GetBaseline().GetStartPoint(); Vector topRight = renderInfo.GetAscentLine().GetEndPoint(); iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]); Single curFontSize = rect.Height; //See if something has changed, either the baseline, the font or the font size if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont)) { //if we've put down at least one span tag close it if ((this.lastBaseLine != null)) { this.result.AppendLine("&lt;/span&gt;"); } //If the baseline has changed then insert a line break if ((this.lastBaseLine != null) &amp;&amp; curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) { this.result.AppendLine("&lt;br /&gt;"); } //Create an HTML tag with appropriate styles this.result.AppendFormat("&lt;span style=\"font-family:{0};font-size:{1}\"&gt;", curFont, curFontSize); } //Append the current text this.result.Append(renderInfo.GetText()); //Set currently used properties this.lastBaseLine = curBaseline; this.lastFontSize = curFontSize; this.lastFont = curFont; } public string GetResultantText() { //If we wrote anything then we'll always have a missing closing tag so close it here if (result.Length &gt; 0) { result.Append("&lt;/span&gt;"); } return result.ToString(); } //Not needed public void BeginTextBlock() { } public void EndTextBlock() { } public void RenderImage(ImageRenderInfo renderInfo) { } } } } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload