Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Here's a fast and easy method, then a variant with better performance at the cost of more memory.</p> <p>In: array Uint X[], e.g. 1M 32-bit words<br> Wanted: a function <code>near( Uint q )</code> --> j with small <code>hammingdist( q, X[j] )</code><br> Method: binary search <code>q</code> in sorted <code>X</code>, then linear search a block around that.<br> Pseudocode:</p> <pre><code>def near( q, X, Blocksize=100 ): preprocess: sort X Uint* p = binsearch( q, X ) # match q in leading bits linear-search Blocksize words around p return the hamming-nearest of these. </code></pre> <p>This is fast -- <a href="http://www.sgi.com/tech/stl/lower_bound.html" rel="nofollow">Binary search</a> 1M words + nearest hammingdist in a block of size 100 takes &lt; 10 us on my Mac ppc. (This is highly cache-dependent &mdash; your mileage <em>will</em> vary.)</p> <p>How close does this come to finding the true nearest X[j] ? I can only experiment, can't do the math:<br> for 1M random queries in 1M random words, the nearest match is on average 4-5 bits away, vs. 3 away for the true nearest (linear scan all 1M):</p> <pre><code>near32 N 1048576 Nquery 1048576 Blocksize 100 binary search, then nearest +- 50 7 usec distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0 near32 N 1048576 Nquery 100 Blocksize 1048576 linear scan all 1048576 38701 usec distance distribution: 0 0 7 58 35 0 </code></pre> <p>Run your data with blocksizes say 50 and 100 to see how the match distances drop.</p> <p><hr> To get even nearer, at the cost of twice the memory,<br> make a copy <code>Xswap</code> of <code>X</code> with upper / lower halfwords swapped, and return the better of</p> <pre><code>near( q, X, Blocksize ) near( swap q, Xswap, Blocksize ) </code></pre> <p>With lots of memory, one can use many more bit-shuffled copies of <code>X</code>, e.g. 32 rotations.<br> I have no idea how performance varies with Nshuffle and Blocksize &mdash; a question for LSH theorists.</p> <p><hr> (Added): To near-match bit strings of say 320 bits, 10 words, make 10 arrays of pointers, sorted on word 0, word 1 ... and search blocks with binsearch as above:</p> <pre><code>nearest( query word 0, Sortedarray0, 100 ) -&gt; min Hammingdist e.g. 42 of 320 nearest( query word 1, Sortedarray1, 100 ) -&gt; min Hammingdist 37 nearest( query word 2, Sortedarray2, 100 ) -&gt; min Hammingdist 50 ... -&gt; e.g. the 37. </code></pre> <p>This will of course miss near-matches where no single word is close, but it's very simple, and sort and binsearch are <em>blazingly</em> fast. The pointer arrays take exactly as much space as the data bits. 100 words, 3200 bits would work in exactly the same way.<br> But: this works only if there are roughly equal numbers of 0 bits and 1 bits, not 99 % 0 bits.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload