Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Given the amount of redundancy, you may get good compression using a <a href="http://en.wikipedia.org/wiki/Trie" rel="nofollow">Trie</a>, which collapses all common prefixes.</p> <p>It looks like you're aiming in that direction with your nested dictionary structure, but a hash map isn't necessarily the most space-efficient store for this (you don't mention any requirement for lookup performance).</p> <p>For example, consider storing it as nested lists, with a 1-tuple to distinguish the leaf value (although we can probably come up with something cleaner):</p> <pre><code>trie = [ [1, (2,), [2, (1,)], [3, [2, (2,)]], [2, (1,)], [4, (1,)] ] </code></pre> <p>With average length 3, this should be pretty shallow, and hopefully there is less slack at each level using a list instead of a dictionary.</p> <p>Note if some branches are long and bare, you can compress redundant depth at the cost of code complexity. Consider the subtree</p> <pre><code>[1, [2, [3, ... [n, (1,)]...]]] </code></pre> <p>could be coded instead as</p> <pre><code>[1, 2, 3, ..., n, (1,)] </code></pre> <p>so long as you distinguish between flat sequences and siblings (here, a <em>leaf</em> is always a 1-tuple, a <em>subtree</em> is always a list, and anything else is the next element in a flat sequence).</p> <hr> <p>The new subset requirement is more complex. I can think of some possible approaches:</p> <ol> <li>naive/brute-force search: depending again on your access pattern, you could return a generator instead of an eager full-depth copy of the matching subtrees (sub-tries)</li> <li>store a bloom filter at the top of each subtree, indicating the values found within. Your search works the same as (1), except you can quickly skip subtrees that cannot match. <ul> <li>Each level's filter is just a logical OR of the filters in its children</li> </ul></li> <li><p>if you need to do the lookup quickly, you could a reverse lookup. For example this could be a set of each sequence containing a key (so you'd have one million keys). After taking the union of the sequences containing <em>x</em> and those containing <em>y</em>, you'd need to iterate over the results to see which had <em>x</em> and <em>y</em> in the right order.</p> <p>The difficulty with this scheme is <em>representing</em> each sequence. You either need a link to the leaf node in your Trie (and be able somehow to iterate from leaf to root), or you need a separate flat copy of each sequence, which seems wasteful</p></li> </ol>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload