Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use <code>pdfimages</code> to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.</p> <p>The flow is as follows:</p> <ol> <li>Use <code>pdfimages</code> (apt-get install poppler-utils) to get a set of pbm files (not shown below). </li> <li>For each file: <ol> <li>Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).</li> <li>OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date. </li> <li>From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.</li> </ol></li> </ol> <p>YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using <code>-despeckle</code> during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.</p> <p>Here's the implementation in a simple bash script:</p> <pre><code>#!/bin/bash # Rotates a pbm file in place. # Pass a .pbm as the only arg. file=$1 TMP="/tmp/rotation-calc" mkdir $TMP # Dependencies: # convert: apt-get install imagemagick # ocrad: sudo apt-get install ocrad ASPELL="/usr/bin/aspell" AWK="/usr/bin/awk" BASENAME="/usr/bin/basename" CONVERT="/usr/bin/convert" DIRNAME="/usr/bin/dirname" HEAD="/usr/bin/head" OCRAD="/usr/bin/ocrad" SORT="/usr/bin/sort" WC="/usr/bin/wc" # Make copies in all four orientations (the src file is north; copy it to make # things less confusing) file_name=$(basename $file) north_file="$TMP/$file_name-north" east_file="$TMP/$file_name-east" south_file="$TMP/$file_name-south" west_file="$TMP/$file_name-west" cp $file $north_file $CONVERT -rotate 90 $file $east_file $CONVERT -rotate 180 $file $south_file $CONVERT -rotate 270 $file $west_file # OCR each (just append ".txt" to the path/name of the image) north_text="$north_file.txt" east_text="$east_file.txt" south_text="$south_file.txt" west_text="$west_file.txt" $OCRAD -f -F utf8 $north_file -o $north_text $OCRAD -f -F utf8 $east_file -o $east_text $OCRAD -f -F utf8 $south_file -o $south_text $OCRAD -f -F utf8 $west_file -o $west_text # Get the word count for each txt file (least 'words' == least whitespace junk # resulting from vertical lines of text that should be horizontal.) wc_table="$TMP/wc_table" echo "$($WC -w $north_text) $north_file" &gt; $wc_table echo "$($WC -w $east_text) $east_file" &gt;&gt; $wc_table echo "$($WC -w $south_text) $south_file" &gt;&gt; $wc_table echo "$($WC -w $west_text) $west_file" &gt;&gt; $wc_table # Take the bottom two; these are likely right side up and upside down, but # generally too close to call beyond that. bottom_two_wc_table="$TMP/bottom_two_wc_table" $SORT -n $wc_table | $HEAD -2 &gt; $bottom_two_wc_table # Spellcheck. The lowest number of misspelled words is most likely the # correct orientation. misspelled_words_table="$TMP/misspelled_words_table" while read record; do txt=$(echo $record | $AWK '{ print $2 }') misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w) echo "$misspelled_word_count $record" &gt;&gt; $misspelled_words_table done &lt; $bottom_two_wc_table # Do the sort, overwrite the input file, save out the text winner=$($SORT -n $misspelled_words_table | $HEAD -1) rotated_file=$(echo $winner | $AWK '{ print $4 }') mv $rotated_file $file # Clean up. if [ -d $TMP ]; then rm -r $TMP fi </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload