Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>If you have examples of scanned texts with both the as-scanned (raw) version, and the corrected version, it should be relatively simple to generate a list of the character corrections. Gather this data from enough texts, then sort it by frequency. Decide how frequent a correction has to be for it to be "common," then leave only the common corrections in the list.</p> <p>Turn the list into a map keyed by the correct letter; the value being an array of the common mis-scans for that letter. Use a recursive function to take a word and generate all of its variations.</p> <p>This example, in Ruby, shows the recursive function. Gathering up the possible mis-scans is up to you:</p> <pre><code>VARIATIONS = { 'l' =&gt; ['1'], 'b' =&gt; ['8'], } def variations(word) return [''] if word.empty? first_character = word[0..0] remainder = word[1..-1] possible_first_characters = [first_character] | VARIATIONS.fetch(first_character, []) possible_remainders = variations(remainder) possible_first_characters.product(possible_remainders).map(&amp;:join) end p variations('Alphabet') # =&gt; ["Alphabet", "Alpha8et", "A1phabet", "A1pha8et"] </code></pre> <p>The original word is included in the list of variations. If you want <em>only</em> possible mis-scans, then remove the original word:</p> <pre><code>def misscans(word) variations(word) - [word] end p misscans('Alphabet') # =&gt; ["Alpha8et", "A1phabet", "A1pha8et"] </code></pre> <hr> <p>A quick-and-dirty (and untested) version of a command-line program would couple the above functions with this "main" function:</p> <pre><code>input_path, output_path = ARGV File.open(input_path, 'r') do |infile| File.open(output_path, 'w') do |outfile| while word = infile.gets outfile.puts misscans(word) end end end </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload