StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
20022119
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-11-16T18:31:28.293
FavoriteCount
0
LastActivityDate
2013-11-16T18:31:28.293
LastEditDate
LastEditorUserId
0
OwnerUserId
2846289
ParentId
19999809
PostTypeId
2
Score
12
ViewCount
0
LastEditorDisplayName
text
Body
<h3>A very short and simplified introduction</h3> Fonts in PDF are PDF objects - <code>Font</code> dictionaries, containing numerous parameters and sub-dictionaries, necessary to select glyphs, show them and translate character codes to logical (Unicode) representation for content extraction. Fonts in layman terms -- as we see them as *.ttf or *.pfb files -- are called font programs, either embedded or external, and are referred to by one of sub-dictionaries of <code>Font</code> objects. <code>Fonts</code> are divided into two groups: <ul> <li>Simple fonts (Type1, Type3 or TrueType), in which glyphs are selected by single-byte character codes obtained from a string that is shown by the text-showing operators. The mapping from codes to glyphs is called the font’s encoding, it can be either built-in into font program or defined by <code>Font</code> object (by predefined name or explicitly) or, under special circumstances, constructed according to defined rules by viewer application. </li> </ul> The file in question doesn't contain simple fonts, and we won't discuss them any further -- but, note, over-simplistic description doesn't even start to reflect any of real-life complexity. <ul> <li>Composite fonts (Type0), used to show text in which character codes can have variable length (up to 4 bytes), and which, therefore, isn't restricted to 256 code-points. Type0 font always has one descendant which is a font-like object called <code>CIDFont</code>, and, similar to encoding for simple fonts, a <code>CMap</code> object, that maps character codes to character selectors, which, in PDF, are always <code>CIDs</code> -- integers up to 65536.</li> </ul> Now, character selector (<code>CID</code>) is not, in general, directly used to select glyphs from font program. For <code>CIDFont</code> of <code>CIDFontType2</code> type, its dictionary contains <code>CIDToGIDMap</code> entry, that, obviously, maps <code>CID</code> to glyph identifiers. Those <code>GIDs</code> are, at last, used to select glyphs from embedded font program (which, for <code>CIDFontType2</code> font, is a TrueType font program (do not confuse with <code>Font</code> object of TrueType <code>Subtype</code>)). <code>Font</code> object can have <code>ToUnicode</code> resource, that maps CIDs to Unicode values for indexing, searching and extraction. It's called <code>ToUnicode Cmap</code> (as it follows similar syntax), but it should not to be confused with <code>CMap</code> object, mentioned above. In what I call a simple case (and, I think, sensible decision), <code>CMap</code> is predefined Identity-H name, <code>CIDToGIDMap</code> is a predefined Identity name, and, therefore, character codes extracted from a string (argument to text showing operator) are always 2-byte numbers that, effectively, directly select glyphs from embedded TrueType program. From my experience, it's most common scenario, and as it appears, that's the case, against which common software is tested. But, it's not the case with file in question. <h3>(The end of a short and simplified introduction)</h3> In our file, text showing operator, effectively, gets this string: <pre><code>0x000a 0x000a 0x000a 0x20 0x0020 0x0020 0x0020 0x20 0x0025 0x0025 0x0025 </code></pre> Of course there are no 'groups', they are here because I made them, based on <code>CMap</code> that contains 2 ranges: <pre><code><20> <20> <0000> <19FF> </code></pre> To make a long story short, if we look up character codes in <code>CMap</code> and get CIDs, then look up CIDs in <code>CIDToGIDMap</code> and get GIDs, then look up GIDs in embedded David-Bold font and get Unicode values, here's the table <pre><code>Code CID GID Unicode Name 0x000a 10 180 05EA tav 0x0020 32 159 05D5 vav 0x0025 37 154 05D0 alef 0x20 228 03 0020 space </code></pre> Now we have enough information to speculate, what confuses viewer applications <hr> In my first attempt, I suggested it's <code>32</code> code (and <code>CID</code>) that's used for non-space character (see comment above). This assumption was based on a case, several years ago, when (older version of) Acrobat didn't show character with <code>0x20</code> code, when it's at the end of a string -- assuming it to be <code>space</code>, when in fact, according to encoding vector (of a simple font), it was another character. I changed this: <ul> <li><code>0x0020</code> to <code>0x0004</code> in content stream; </li> <li>bytes 08 and 09 in <code>CIDToGIDMap</code> to GID=159; </li> <li>value in <code>Widths</code> array of CID=4 to 'vav' width; </li> <li><code>ToUnicode cmap</code> was adjusted accordingly. </li> <li>(+ later I tried to remove <code><0020> 32</code> string from <code>CMAP</code> - not reflected in a file, linked in comment)</li> </ul> Well, it did help, but unfortunately, some of viewers still rejected to comply to specification. <hr> Then I thought, that maybe variable character code width was the issue. I returned to the original file and changed this: <ul> <li><code>0x20</code> to <code>0x00e4</code> in content stream;</li> <li><code><20> 228</code> to <code><00e4> 228</code> in <code>CMAP</code>;</li> <li><code>codespacerange</code> <code><20> <20></code> in <code>CMAP</code> deleted;</li> <li><code>codespacerange</code> <code><20> <20></code> in <code>ToUnicode Cmap</code> deleted.</li> </ul> <a href="https://www.dropbox.com/s/5myq23zd2w4w1k5/problematic_fixed.pdf" rel="noreferrer">This</a> file appears to open perfectly in all viewers, mentioned in original question and comments below. Miraculously, <code>0x0020</code> code and <code>32</code> <code>CID</code> do not interfere. <hr> The conclusion, I think, can be this: Given current state of affairs, PDF-creators are NOT advised to mix single and double byte codes in font encoding (<code>CMAP</code>).
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POPDF doc text shows differently in IE / Firefox / Chrome
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USuser2846289
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POPDF doc text shows differently in IE / Firefox / Chrome
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. CO+1; nice analysis
 singulars
 PostPostId
 PO
 UserUserId
 USmkl
2. CONo doubt - you are a PDF expert. I thank you **deeply** for your efforts on this great answer. It really made me understand.
 singulars
 PostPostId
 PO
 UserUserId
 USuser1028741

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.