Note that there are some explanatory texts on larger screens.

plurals
  1. POExtract first N unique integers from an Array
    primarykey
    data
    text
    <p>I have a large list of integers (thousands), and I want to extract the first N (in the order of 10-20) unique elements from it. Each integer in the list occurs roughly three times.</p> <p>Writing an algorithm to do this is trivial, but I wonder what's the most speed and memory efficient way to do it.</p> <p>There are some additional constraints and informations in my case:</p> <ul> <li><p>In my use-case I extract my uniques multiple times on the array, each time skipping some elements from the beginning. The amount of elements that I skip is not known during unique-extraction. I don't even have a upper bound. Therefore sorting is not speed efficient (I have to preserve the order of the array).</p></li> <li><p>The integers are all over the place, so a bit-array as a lookup solution is not feasible.</p></li> <li><p>I want to avoid temporary allocations during the search at all costs.</p></li> </ul> <p>My current solution looks roughly like this:</p> <pre><code> int num_uniques = 0; int uniques[16]; int startpos = 0; while ((num_uniques != N) &amp;&amp; (start_pos &lt; array_length)) { // a temporary used later. int insert_position; // Get next element. int element = array[startpos++]; // check if the element exist. If the element is not found // return the position where it could be inserted while keeping // the array sorted. if (!binary_search (uniques, element, num_uniques, &amp;insert_position)) { // insert the new unique element while preserving // the order of the array. insert_into_array (uniques, element, insert_position); uniques++; } } </code></pre> <p>The binary_search / insert into array algorithm gets the job done, but the performance is not great. The insert_into_array call moves elements around a lot, and this slows everythign down.</p> <p>Any ideas?</p> <hr> <p><strong>EDIT</strong></p> <p>Great answers, guys! Everyone deserves an accepted answer, but I can give only one. I'll implement a bunch of your ideas and do a performance-shootout with some typical data. The one with the idea that lead to the quickest implementation get's the accepted answer. </p> <p>I'll run the code on a modern PC and a embedded CortexA8-CPU and I'll weight the results somehow. Will post the results as well.</p> <hr> <p><strong>EDIT: Results of the shoot-out</strong></p> <p>Timings on a Core-Duo, 100 iterations over a 160kb test-dataset.</p> <pre><code>Bruteforce (Pete): 203 ticks Hash and Bruteforce (Antti): 219 ticks Inplace Binary Tree (Steven): 390 ticks Binary-Search (Nils): 438 ticks </code></pre> <p><a href="http://torus.untergrund.net/code/unique_search_shootout.zip" rel="nofollow noreferrer">http://torus.untergrund.net/code/unique_search_shootout.zip</a> (C-source and testdata)</p> <p>Additional remarks:</p> <ul> <li><p>The Inplace Binary Tree absolutely rocks for true random distributions (my test-data has a tendency to be ascending). </p></li> <li><p>The Binary-Search works very well on my testdata for more than 32 uniques. It performs almost linear.</p></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload