StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Are you willing to spend a significant time preprocessing the string?</p> <p>If you are, what you can do is build a list of n-grams with offsets.</p> <p>Suppose your alphabet is hex bytes and you are using 1-grams. </p> <p>Then for 00-ff, you can create a dictionary that looks like this(perlese, sorry)</p> <pre><code>$offset_list{00} = @array_of_offsets $offset_list{01} = #...etc </code></pre> <p>where you walk down the string and build the @array_of_offsets from all points where bytes happen. You can do this for arbitrary n-grams. </p> <p>This provides a "start point for search" that you can use to walk. </p> <p>Of course, the downside is that you have to preprocess the string, so that's your tradeoff.</p> <p>edit:</p> <hr> <p>The basic idea here is to match prefixes. This may bomb out badly if the information is super-similar, but if it has a fair amount of divergence between n-grams, you should be able to match prefixes pretty well.</p> <p>Let's quantify divergence, since you've not discussed the kind of info you're analyzing. For the purposes of this algorithm, we can characterize divergence as a distance function: you need a <em>decently</em> high <a href="http://en.wikipedia.org/wiki/Hamming_distance" rel="nofollow noreferrer">Hamming distance</a>. If the hamming distance between n-grams is, say, 1, the above idea won't work. But if it's n-1, the above algorithm will be much easier.</p> <p>To improve on my algorithm, let's build an algorithm that does some successive elimination of possibilities:</p> <p>We can invoke <a href="http://en.wikipedia.org/wiki/Entropy_(information_theory)" rel="nofollow noreferrer">Shannon Entropy</a> to define information of a given n-gram. Take your search string and successively build a prefix based upon the first m characters. When the entropy of the m-prefix is 'sufficiently high', use it later.</p> <ol> <li>Define p to be an m-prefix of the search string</li> <li>Search your 1 GB string and create an array of offsets that match p.</li> <li>Extend the m-prefix to be some k-prefix, k > m, entropy of k-prefix higher than m-prefix.</li> <li>Keep the elements offset array defined above, such that they match the k-prefix string. Discard the non-matching elements.</li> <li>Goto 4 until the entire search string is met. </li> </ol> <p>In a sense, this is like reversing Huffman encoding.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload