StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
1955136
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
13
CommunityOwnedDate
2009-12-31T09:04:22.217
CreationDate
2009-12-23T20:29:00.603
FavoriteCount
0
LastActivityDate
2016-05-31T12:17:15.790
LastEditDate
2016-05-31T12:17:15.790
LastEditorUserId
48181
OwnerUserId
48181
ParentId
1953080
PostTypeId
2
Score
66
ViewCount
0
LastEditorDisplayName
text
Body
<p>I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.</p> <h2>Solution #1: Using Regex</h2> <p>This is working Ruby code for this problem:</p> <pre><code>def query(str, data) r = Regexp.new("^#{str.gsub("?", ".")}$") idx = 0 begin idx = data.index(r, idx) if idx yield data[idx, str.size] idx += str.size + 1 end end while idx end start_time = Time.now query("?r?te", File.read("wordlist.txt")) do |w| puts w end puts Time.now - start_time </code></pre> <p>The file <code>wordlist.txt</code> contains 45425 words (downloadable <a href="http://martin.ankerl.com/2008/08/09/two-word-anagram-finder-algorithm/" rel="nofollow noreferrer">here</a>). The program's output for query <code>?r?te</code> is:</p> <pre><code>brute crate Crete grate irate prate write wrote 0.013689 </code></pre> <p>So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:</p> <p>query <code>????????????????e</code></p> <pre><code>counterproductive indistinguishable microarchitecture microprogrammable 0.018681 </code></pre> <p>query <code>?h?a?r?c?l?</code></p> <pre><code>theatricals 0.013608 </code></pre> <p>This looks fast enough for me.</p> <h2>Solution #2: Regex with Prepared Data</h2> <p>If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:</p> <pre><code>def query_split(str, data) query(str, data[str.length]) do |w| yield w end end # prepare data data = Hash.new("") File.read("wordlist.txt").each_line do |w| data[w.length-1] += w end # use prepared data for query start_time = Time.now query_split("?r?te", data) do |w| puts w end puts Time.now - start_time </code></pre> <p>Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):</p> <ul> <li><code>?r?te</code> 0.001112 sec</li> <li><code>?h?a?r?c?l?</code> 0.000852 sec</li> <li><code>????????????????e</code> 0.000169 sec</li> </ul> <h2>Solution #3: One Big Hashtable (Updated Requirements)</h2> <p>Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.</p> <p>Here I create one big hashtable, where each possible query maps to a list of its results:</p> <pre><code>def create_big_hash(data) h = Hash.new do |h,k| h[k] = Array.new end data.each_line do |l| w = l.strip # add all words with one ? w.length.times do |i| q = String.new(w) q[i] = "?" h[q].push w end # add all words with two ?? (w.length-1).times do |i| q = String.new(w) q[i, 2] = "??" h[q].push w end end h end # prepare data t = Time.new h = create_big_hash(File.read("wordlist.txt")) puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash" # use prepared data for query t = Time.new h["?ood"].each do |w| puts w end puts (Time.new - t) </code></pre> <p>Output is</p> <pre><code>4.960255 sec preparing data 616745 entries in big hash food good hood mood wood 2.0e-05 </code></pre> <p>The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the <a href="http://goog-sparsehash.sourceforge.net/" rel="nofollow noreferrer">Google Sparse Hash</a> which is extremely memory efficient, and fast.</p> <h2>Solution #4: Get Really Serious</h2> <p>All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:</p> <ul> <li><a href="https://scholar.google.at/scholar?q=Tries+for+Approximate+String+Matching" rel="nofollow noreferrer">Tries for Approximate String Matching</a> - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.</li> <li><a href="https://scholar.google.at/scholar?q=Agrep+-+A+Fast+Approximate+Pattern-Matching+Tool" rel="nofollow noreferrer">Agrep - A Fast Approximate Pattern-Matching Tool</a> - Agrep is based on a new efficient and flexible algorithm for approximate string matching.</li> <li><a href="https://scholar.google.at/scholar?hl=en&q=approximate+string+matching+dictionary" rel="nofollow noreferrer">Google Scholar search for approximate string matching</a> - More than enough to read on this topic.</li> </ul>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POGood algorithm and data structure for looking up words with missing letters?
  singulars
  PostTypePostTypeId
  PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USmartinus
UserOwnerUserId
1. USmartinus
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POGood algorithm and data structure for looking up words with missing letters?
  singulars
  PostTypePostTypeId
  PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
2. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
3. VO
  singulars
  PostPostId
  PO
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.