Note that there are some explanatory texts on larger screens.

plurals
  1. POSorting strings so that hamming distance is low between adjacent strings
    primarykey
    data
    text
    <p><strong>Problem:</strong></p> <p>I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn't have to be the best possible but closer the better. </p> <p><strong>Example</strong></p> <pre><code>N=4 D=5 //initial strings 1. aaacb 2. bacba 3. acacb 4. cbcba //sorted so that hamming distance between adjacent strings is low 1. aaacb 3. acacb (Hamming distance 1-&gt;3 = 1) 4. cbcba (Hamming distance 3-&gt;4 = 4) 2. bacba (Hamming distance 4-&gt;2 = 2) </code></pre> <p><strong>Thoughts about the problem</strong></p> <p>I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the <a href="http://en.wikipedia.org/wiki/Canadian_traveller_problem" rel="noreferrer">Canadian Traveller Problem</a>.</p> <p>At the moment my solution has been to use a <a href="http://en.wikipedia.org/wiki/Vp-tree" rel="noreferrer">VP tree</a> to find a greedy nearest neighbour type solution to the problem</p> <pre><code>curr_string = a randomly chosen string from full set while(tree not empty) found_string = find nearest string in tree tree.remove(found_string) sorted_list.add(curr_string) curr_string = found_string </code></pre> <p>but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size. </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload