StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhat's the most efficient way to partition a large hash into N smaller hashes in Ruby?
text
Body
copied!<h3>The Problem</h3> <p><br> I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.</p> <ul> <li><p>Are there any non O(n) approaches?</p></li> <li><p>Is there a non-Ruby i.e. C/C++ implementation? </p></li> </ul> <p>Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.</p> <p>My concern is that Ruby is too slow to do this kind of work.</p> <p><br></p> <h3>The initial approach</h3> <p><br> This was the first solution I tried. What was appealing about it was:</p> <ul> <li>it didn't need to loop slavishly across the hash</li> <li>it didn't need to manage a counter to allocate the members evenly among the shards.</li> <li>it's short and neat looking</li> </ul> <p>Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.</p> <p><br></p> <pre><code>pivot = s.size / 2 slices = s.each_slice(pivot) s1 = Hash[*slices.entries[0].flatten] s2 = Hash[*slices.entries[1].flatten] </code></pre> <p><br></p> <h3>A better solution</h3> <p></p> <p>Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)</p> <p></p> <pre><code>def split_shard(s) shard1 = {} shard2 = {} t = Benchmark.measure do n = 0 pivot = s.size / 2 s.each_pair do |k,v| if n < pivot shard1[k] = v else shard2[k] = v end n += 1 end end $b += t.real $e += s.size return shard1, shard2 end </code></pre> <p><br></p> <h3>The results</h3> <p><br> In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324. </p> <p>My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over <em>9s</em></p> <p>Mark's approach runs in just over <em>5s</em></p> <p>That's a significant win</p> <p><br></p> <h3>Take away</h3> <p><br> Don't be fooled by 'intuition' - measure the performance of competing algorithm</p> <p>Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.</p> <p>Thanks to Mark and Mike who both get points from me for their help. </p> <p>Thanks!</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload