StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p><em>I'm happy to say I just found "<a href="https://github.com/gfx/ruby-regexp_trie" rel="nofollow noreferrer">RegexpTrie</a>" which is a useable replacement to the code, and need for, Perl's Regexp::Assemble.</em></p> <p>Install it, and give it a try:</p> <pre><code>require 'regexp_trie' foo = %w(miss misses missouri mississippi) RegexpTrie.union(foo) # => /miss(?:(?:es|ouri|issippi))?/ RegexpTrie.union(foo, option: Regexp::IGNORECASE) # => /miss(?:(?:es|ouri|issippi))?/i </code></pre> <p>Here's a comparison of the outputs. The first, commented outputs in the array, are from Regexp::Assemble and the trailing output is from RegexpTrie:</p> <pre><code>require 'regexp_trie' [ 'how now brown cow', # /(?:[chn]ow|brown)/ 'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/ 'jackdaws love my giant sphinx of quartz', # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/ 'fu foo bar foobar', # /(?:f(?:oo(?:bar)?|u)|bar)/ 'ms miss misses missouri mississippi' # /m(?:iss(?:(?:issipp|our)i|es)?|s)/ ].each do |s| puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source] end # >> how now brown cow # /(?:how|now|brown|cow)/ # >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/ # >> jackdaws love my giant sphinx of quartz # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/ # >> fu foo bar foobar # /(?:f(?:oo(?:bar)?|u)|bar)/ # >> ms miss misses missouri mississippi # /m(?:iss(?:(?:es|ouri|issippi))?|s)/ </code></pre> <p>Regarding how to use the Wikipedia link and misspelled words:</p> <pre><code>require 'nokogiri' require 'open-uri' require 'regexp_trie' URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines' doc = Nokogiri::HTML(open(URL)) corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s| a, b = s.chomp.split('->', 2) [a, b.split(/,\s+/) ] }.to_h # {"abandonned"=>["abandoned"], # "aberation"=>["aberration"], # "abilityes"=>["abilities"], # "abilties"=>["abilities"], # "abilty"=>["ability"], # "abondon"=>["abandon"], # "abbout"=>["about"], # "abotu"=>["about"], # "abouta"=>["about a"], # ... # } misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i # => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)... </code></pre> <p>At this point you can use <code>gsub(misspelled_words_regex, corrections)</code>, however, the values in <code>corrections</code> contain some arrays because multiple words or phrases could have been used to replace the misspelled word. You'll have to do something to determine which of the choices to use.</p> <hr> <p>Ruby is missing a very useful module found in Perl, called <a href="http://search.cpan.org/dist/Regexp-Assemble/Assemble.pm" rel="nofollow noreferrer">Regexp::Assemble</a>. Python has <a href="https://pypi.python.org/pypi/hachoir-regex" rel="nofollow noreferrer">hachoir-regex</a> which appears to do the same sort of thing.</p> <p>Regexp::Assemble creates a very efficient regular expression, based on lists of words and simple expressions. It's really remarkable ... or ... diabolical?</p> <p>Check out the example for the module; It's extremely simple to use in its basic form:</p> <pre><code>use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add( 'ab+c' ); $ra->add( 'ab+-' ); $ra->add( 'a\w\d+' ); $ra->add( 'a\d+' ); print $ra->re; # prints a(?:\w?\d+|b+[-c]) </code></pre> <p>Notice how it's combining the patterns. It'd do the same with regular words, only it would be even more efficient because common strings will be combined:</p> <pre><code>use Regexp::Assemble; my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'; my $ra = Regexp::Assemble->new('flags' => 'i'); $lorem =~ s/[^a-zA-Z ]+//g; $ra->add(split(' ', lc($lorem))); print $ra->anchor_word(1)->as_string, "\n"; </code></pre> <p>Which outputs:</p> <pre><code>\b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b </code></pre> <p>This code ignores case and honors word boundaries. </p> <p>I'd recommend writing a little Perl app that can take a list of words and use that module to output the stringified version of the regex pattern. You should be able to import that pattern into Ruby. That would let you very quickly find misspelled words. You could even have it output the pattern to a YAML file, then load that file into your Ruby code. Periodically parse the misspelled word pages, run the output through the Perl code, and your Ruby code would have an updating pattern.</p> <p>You could use that pattern against a chunk of text just to see if there are misspelled words. If so, then you break the text down into sentences or words and check against the regex again. Don't immediately test against words because most words will be spelled correctly. It's almost like a binary search against your text - test the whole thing, if there's a hit then break into smaller blocks to narrow the search until you've found the individual misspellings. How you break down the chunks depends on the amount of incoming text. A regex pattern can test the entire text block and return a nil or index value, in addition to individual words the same way, so you gain a lot of speed doing big chunks of the text. </p> <p>Then, if you know you have a misspelled word you can do a hash lookup for the correct spelling. It would be a big hash, but the task of sifting out the good vs. bad spellings is what will take the longest. The lookup would be extremely fast.</p> <hr> <p>Here's some example code:</p> <p>get_words.rb</p> <pre><code>#!/usr/bin/env ruby require 'open-uri' require 'nokogiri' require 'yaml' words = {} ['0-9', *('A'..'Z').to_a].each do |l| begin print "Reading #{l}... " html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read puts 'ok' rescue Exception => e puts "got \"#{e}\"" next end doc = Nokogiri::HTML(html) doc.search('div#bodyContent > ul > li').each do |n| n.content =~ /^(\w+) \s+ \(([^)]+)/x words[$1] = $2 end end File.open('wordlist.yaml', 'w') do |wordfile| wordfile.puts words.to_yaml end </code></pre> <p>regex_assemble.pl</p> <pre><code>#!/usr/bin/env perl use Regexp::Assemble; use YAML; use warnings; use strict; my $ra = Regexp::Assemble->new('flags' => 'i'); my %words = %{YAML::LoadFile('wordlist.yaml')}; $ra->add(map{ lc($_) } keys(%words)); print $ra->chomp(1)->anchor_word(1)->as_string, "\n"; </code></pre> <p>Run the first, then run the second piping its output to a file to capture the emitted regex.</p> <hr> <p>And more examples of words and the generated output:</p> <pre><code>'how now brown cow' => /\b(?:[chn]ow|brown)\b/ 'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/ 'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/ 'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/ 'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/ </code></pre> <p>Ruby's <code>Regexp.union</code> is nowhere close to the sophistication of <code>Regexp::Assemble</code>. After capturing the list of misspelled words, there are 4225 words, consisting of 41,817 characters. After running Perl's Regexp::Assemble against that list, a 30,954 character regex was generated. I'd say that's efficient.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload