Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I do not know the library you use to access the inner structures of a PDF file but the problem at hand will have tree distinct subproblems:</p> <ol> <li>Find all images in the PDF file</li> <li>Decode the images to their components</li> <li>Convert the decoded image to a DIB</li> </ol> <p><em>Find all Images</em></p> <p>Images can occur inside content streams or in streams attached to dictionaries. To find all images in content streams, you need to find all content streams in either Pages, XObjects or Patterns. Each of those can have a Resources -> XObject dictionary that references all XObjects (and an XObject can be an Image). </p> <p>If you avoid the inline images you might simply scan the PDF file and each dectionary that is of type XObject subtype Image can be decoded.</p> <p><em>Decode</em></p> <p>All streams (inline in content streams) of in separate objects in the PDF file are encoded and mught need post processing using the Decode arrays. There are several filters that you need to be able to perform for decoding. Flate decode (ZLIB), JPEG and CCITT (fax G3/G4) are probable the most used for images. Hopefully the PDF library you use will know how to decode the streams..</p> <p>Next there are Decode arrays (a bit rare) where each color component can be scaled from an input value to an output value. This is a linear transformation.</p> <p><em>To DIB</em></p> <p>Next in line is the conversion of the decoded image to a DIB. This means you need to convert the color components to something Windows can 'get' (eg, Palette, grayscale (special palette) of RGB. PDF supports a very very large variety of color spaces and converting them to RGB is no sinecure. You best hope here is that the PDFs you need to process only use a select subset (like RGB and palette). Now a DIB can be simply created by creating the bitmap header (BITMAPINFO), fill in all data and call the DIB creation function CreateDIBSection and them process the DIB the way you application needs.</p> <p><em>Epilogue</em></p> <p>All in all: to be able to process all PDF files and find all images is quite a daunting task, if you control the source if teh PDFs and you know they are always in DeviceRGB format and always JPEG etc and never inlined into the content stream it is do-able.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. CO1. Find all iamges I want to convert PDEImage objects which are XObjects right? I loop through the contents of each PDF page and acquire them which I can do fine. 2. Decode This is where I think I have my problem. I was under the impression that PDEImageGetData(image, 0, data) will fill data buffer with the color data in the image. I'm very much new to PDFs so I've not sure what you mean by decode. Do you happen to have a code sample perhaps? 3. To DIB I figure that I can use CreateDIBSection for this purpose. So, won't it work if I simply pass the data array obtained above to it?
      singulars
    2. CO@Sach: I do not know what decoding your PDF library already does. but let's assume it does not decode the streams. Now the two most used encodings for color/palette images are Flate (ZLib) and JPEG decode. So if you know how to use ZLib and have a JPEG decoder your code might look like: open PDF File -> scan all objects in the file. If object is a dictionary and has a stream and type = XObject ans subtype is Image -> decode stream if needed. Now you have the raw bytes you can feed to a DIB IF and only IF the colorspace used is supported by windows. What if it is Lab or Separated colorspace...
      singulars
    3. COI didn't understand the bit where you say "decode stream if needed". What I do is, I extract PDEImage objects, then I want to convert them to DIBs so I can send those DIBs (in BYTE array form) to a separate library which do some color editing/correction stuff. I'm not familiar with ZLib and how to decode images. I have the luxury of knowing that colorspace will only be either one of sRGB or AdobeRGB.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload