Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.</p> <h2>Solution #1: Using Regex</h2> <p>This is working Ruby code for this problem:</p> <pre><code>def query(str, data) r = Regexp.new("^#{str.gsub("?", ".")}$") idx = 0 begin idx = data.index(r, idx) if idx yield data[idx, str.size] idx += str.size + 1 end end while idx end start_time = Time.now query("?r?te", File.read("wordlist.txt")) do |w| puts w end puts Time.now - start_time </code></pre> <p>The file <code>wordlist.txt</code> contains 45425 words (downloadable <a href="http://martin.ankerl.com/2008/08/09/two-word-anagram-finder-algorithm/" rel="nofollow noreferrer">here</a>). The program's output for query <code>?r?te</code> is:</p> <pre><code>brute crate Crete grate irate prate write wrote 0.013689 </code></pre> <p>So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:</p> <p>query <code>????????????????e</code></p> <pre><code>counterproductive indistinguishable microarchitecture microprogrammable 0.018681 </code></pre> <p>query <code>?h?a?r?c?l?</code></p> <pre><code>theatricals 0.013608 </code></pre> <p>This looks fast enough for me.</p> <h2>Solution #2: Regex with Prepared Data</h2> <p>If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:</p> <pre><code>def query_split(str, data) query(str, data[str.length]) do |w| yield w end end # prepare data data = Hash.new("") File.read("wordlist.txt").each_line do |w| data[w.length-1] += w end # use prepared data for query start_time = Time.now query_split("?r?te", data) do |w| puts w end puts Time.now - start_time </code></pre> <p>Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):</p> <ul> <li><code>?r?te</code> 0.001112 sec</li> <li><code>?h?a?r?c?l?</code> 0.000852 sec</li> <li><code>????????????????e</code> 0.000169 sec</li> </ul> <h2>Solution #3: One Big Hashtable (Updated Requirements)</h2> <p>Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.</p> <p>Here I create one big hashtable, where each possible query maps to a list of its results:</p> <pre><code>def create_big_hash(data) h = Hash.new do |h,k| h[k] = Array.new end data.each_line do |l| w = l.strip # add all words with one ? w.length.times do |i| q = String.new(w) q[i] = "?" h[q].push w end # add all words with two ?? (w.length-1).times do |i| q = String.new(w) q[i, 2] = "??" h[q].push w end end h end # prepare data t = Time.new h = create_big_hash(File.read("wordlist.txt")) puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash" # use prepared data for query t = Time.new h["?ood"].each do |w| puts w end puts (Time.new - t) </code></pre> <p>Output is</p> <pre><code>4.960255 sec preparing data 616745 entries in big hash food good hood mood wood 2.0e-05 </code></pre> <p>The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the <a href="http://goog-sparsehash.sourceforge.net/" rel="nofollow noreferrer">Google Sparse Hash</a> which is extremely memory efficient, and fast.</p> <h2>Solution #4: Get Really Serious</h2> <p>All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:</p> <ul> <li><a href="https://scholar.google.at/scholar?q=Tries+for+Approximate+String+Matching" rel="nofollow noreferrer">Tries for Approximate String Matching</a> - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.</li> <li><a href="https://scholar.google.at/scholar?q=Agrep+-+A+Fast+Approximate+Pattern-Matching+Tool" rel="nofollow noreferrer">Agrep - A Fast Approximate Pattern-Matching Tool</a> - Agrep is based on a new efficient and flexible algorithm for approximate string matching.</li> <li><a href="https://scholar.google.at/scholar?hl=en&amp;q=approximate+string+matching+dictionary" rel="nofollow noreferrer">Google Scholar search for approximate string matching</a> - More than enough to read on this topic.</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload