Note that there are some explanatory texts on larger screens.

plurals
  1. PODetecting empty pages in scanned documents
    primarykey
    data
    text
    <p>So we need to detect whether an image, created by a scanner, represents an empty page. I'm way out of my depth when it comes to image processing, so I have to run this by the community. </p> <p>Here's what I have come up with so far: </p> <ul> <li><p>Empty pages can be glaringly white, gray recycled paper, or yellowed old paper. The current idea is to create a histogram for a page, look for a steep increase of the curve, and get the percentage of pixels are darker than that. If that exceeds a threshold, the page is likely not empty. </p></li> <li><p>Since this would likely classify a page containing a single line of text at the top as empty, we would tile the page and gather statistics about each tile. </p></li> <li><p>We would need to detect scanned staplers and holes from binding (likely only in certain tiles), but this can be put off to some later stage. However, if you have an idea what to look out for besides these two, please mention it in a comment. </p></li> <li><p>This needs to be fast. It's part of a document processing workflow that processes (tens of) thousands of pages per day. If processing a page takes ten seconds longer, than our customers will have to tell their customers that they'll have to wait several days longer for their results. (If this results in more false positives, some customers would rather have someone check a few dozen found "empty" pages, than have their customer wait one more day.)</p></li> </ul> <p>So here's my questions: </p> <ol> <li><p>Is it a good idea to take this route or is there something better? </p></li> <li><p>If we do it this way, how would I do this? What's a good (cheap) algorithm for finding a threshold for a page? Could we gain significant speed by assuming a similar threshold for a batch of documents? To which precision could brightness values be rounded, before getting logged? What quirks could we expect? </p></li> </ol>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload