StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIn-Place Radix Sort
text
Body
copied!<p>This is a long text. Please bear with me. Boiled down, the question is: <strong>Is there a workable in-place radix sort algorithm</strong>?</p> <hr> <h2>Preliminary</h2> <p>I've got a huge number of <em>small fixed-length</em> strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: <a href="https://en.wikipedia.org/wiki/DNA" rel="noreferrer">DNA</a>) that I want to sort.</p> <p>At the moment, I use <code>std::sort</code> which uses <a href="https://en.wikipedia.org/wiki/Introsort" rel="noreferrer">introsort</a> in all common implementations of the <a href="https://en.wikipedia.org/wiki/Standard_Template_Library" rel="noreferrer">STL</a>. This works quite well. However, I'm convinced that <a href="https://en.wikipedia.org/wiki/Radix_sort" rel="noreferrer">radix sort</a> fits my problem set perfectly and should work <strong>much</strong> better in practice.</p> <h2>Details</h2> <p>I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (<em>N</em> > 5,000,000).</p> <p>The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.</p> <h2>Use Cases</h2> <p>Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with <a href="https://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry" rel="noreferrer">IUPAC</a> <a href="https://en.wikipedia.org/wiki/Nucleic_acid_notation" rel="noreferrer">ambiguity codes</a> (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.</p> <h2>Research</h2> <p>Unfortunately, the <a href="https://en.wikipedia.org/wiki/Radix_sort" rel="noreferrer">Wikipedia article on radix sort</a> is useless. The section about an in-place variant is complete rubbish. The <a href="http://xlinux.nist.gov/dads/HTML/radixsort.html" rel="noreferrer">NIST-DADS section on radix sort</a> is next to nonexistent. There's a promising-sounding paper called <a href="https://www.mii.vu.lt/informatica/pdf/INFO562.pdf" rel="noreferrer">Efficient Adaptive In-Place Radix Sorting</a> which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.</p> <p>In particular, there are the following things.</p> <p>First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions <code>dest_group</code> and <code>dest_address</code> without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least <code>dest_address</code> isn’t trivial).</p> <p>Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).</p> <p>Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.</p> <p><strong>To recap</strong>: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload