Note that there are some explanatory texts on larger screens.

plurals
  1. POAn "Empty" Character Extracted from a PDF
    primarykey
    data
    text
    <p>I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:</p> <pre><code>...what it would be It’ll be important later on... </code></pre> <p>It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:</p> <ul> <li>Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.</li> <li>Press the right arrow key again, you stay right where you are.</li> <li>Press the right arrow key one more time, you continue to the other side of the space.</li> <li>Continuing to press the right arrow key behaves as expected</li> </ul> <p>It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:</p> <pre><code>String oldText = text; text = text.replace('\u0000', '.'); //Unicode null text = text.replace('\0', '.'); //C null System.out.println(oldText.equals(text)); //Returns true //Also tried text.replace(null, '.'), but it doesn't compile </code></pre> <p>What is this "empty character" and how can I replace it with the text that is supposed to be there?</p> <p>EDIT: <a href="https://stackoverflow.com/a/3396779/531762">This answer</a> suggested that the character might be a character such as <code>\uFEFF</code>, but trying to replace it with a regex as suggested did not work.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload