Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I've been down the same road as you, with even much more complex tasks. </p> <p>After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp. </p> <p>Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)</p> <p>To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.</p> <pre><code>int pdf_page = 5; string page_text = ""; PdfReader reader = new PdfReader("path/to/pdf/file.pdf"); PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page)); while(token.NextToken()) { if(token.TokenType == PRTokeniser.TokType.STRING) { page_text += token.StringValue; } else if(token.StringValue == "Tj") { page_text += " "; } } </code></pre> <p>Do a <code>Console.WriteLine(token.StringValue)</code> on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.</p> <p><strong>Addition:</strong></p> <p>Given the task you are required to do, I have a suggestion for you:</p> <p>Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.</p> <p>This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload