Note that there are some explanatory texts on larger screens.

plurals
  1. POCidfonts and mapping
    primarykey
    data
    text
    <p>Ok, I've done some research on the subject but as the title indicates I'm no expert. So here's the problem: I'm extracting some text from pdfs using python and the lib pdfminer. </p> <p>I've only tried documents with latin characters and it works well in most cases, except if the font is not latin/western. The document that bugs me now is using latin characters from a japanese font. Adobe tells me the encoding is <code>Adobe-Identity</code>. All I get is the cid of the char and I can't find the cidmap related.</p> <p>I know I'm not using the right terms, I mean the pdf tells me <code>cid=3</code> and I know the char is a space. I've manually written a map for the chars in the range <code>0x00-0xFF</code>. Some sources tells it matches the "mac-roman" encoding, other disagrees. Other sources says it match OpenType mapping but I couldn't find anything beyond <code>0xFF</code>. And I've got cids >3000.</p> <p>You can tell I'm very confused, so you're invited to correct my terminology but what I'd want is a map that matches my own but extended for the range <code>0x0100-0xFFFF</code>.</p> <p>ETA: the link to the bugging pdf <a href="http://www.sas.upenn.edu/~jtigay/JapanVol.pdf" rel="nofollow">http://www.sas.upenn.edu/~jtigay/JapanVol.pdf</a><br> ETA2: I found this <a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z" rel="nofollow">ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z</a> the cid2code.txt within the archive is the kind of map I'm looking for. But for all those fonts the cid column seems "shifted" by two: cid1 maps to space.<br> ETA3: corrected encoding</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload