StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I don't know much about Java, so I guess my solution will HAVE to be general :)</p> <p><strong>1. Compact the lists</strong></p> <p>Since Booleans are inefficient, each <code>List<Boolean></code> should be compacted into a <code>List<Byte></code>, it's easy, just grab them 8 at a time.</p> <p>The last "byte" may be incomplete, so you need to store how many bits have been encoded of course.</p> <p><strong>2. Serializing a list of elements</strong></p> <p>You have 2 ways to proceed: either you encode the number of items of the list, either you use a pattern to mark an end. I would recommend encoding the number of items, the pattern approach requires escaping and it's creepy, plus it's more difficult with packed bits.</p> <p>To encode the length you can use a variable scheme: ie the number of bytes necessary to encode a length should be proportional to the length, one I already used. You can indicate how many bytes are used to encode the length itself by using a prefix on the first byte:</p> <pre><code>0... .... > this byte encodes the number of items (7 bits of effective) 10.. .... / .... .... > 2 bytes 110. .... / .... .... / .... .... > 3 bytes </code></pre> <p>It's quite space efficient, and decoding occurs on whole bytes, so not too difficult. One could remark it's very similar to the UTF8 scheme :)</p> <p><strong>3. Apply recursively</strong></p> <p><code>List< List< Boolean > ></code> becomes <code>[Length Item ... Item]</code> where each <code>Item</code> is itself the representation of a <code>List<Boolean></code></p> <p><strong>4. Zip</strong></p> <p>I suppose there is a <code>zlib</code> library available for Java, or anything else like <code>deflate</code> or <code>lcw</code>. Pass it your buffer and make sure to precise you wish as much compression as possible, whatever the time it takes.</p> <p>If there is any repetitive pattern (even ones you did not see) in your representation it should be able to compress it. Don't trust it dumbly though and DO check that the "compressed" form is lighter than the "uncompressed" one, it's not always the case.</p> <p><strong>5. Examples</strong></p> <p>Where one notices that keeping track of the edge of the lists is space consuming :)</p> <pre><code>// Tricky here, we indicate how many bits are used, but they are packed into bytes ;) List<Boolean> list = [false,false,true,true,false,false,true,true] encode(list) == [0x08, 0x33] // [00001000, 00110011] (2 bytes) // Easier: the length actually indicates the number of elements List<List<Boolean>> super = [list,list] encode(super) == [0x02, 0x08, 0x33, 0x08, 0x33] // [00000010, ...] (5 bytes) </code></pre> <p><strong>6. Space consumption</strong></p> <p>Suppose we have a <code>List<Boolean></code> of <code>n</code> booleans, the space consumed to encode it is:</p> <pre><code>booleans = ceil( n / 8 ) </code></pre> <p>To encode the number of bits (<code>n</code>), we need:</p> <pre><code>length = 1 for 0 <= n < 2^7 ~ 128 length = 2 for 2^7 <= n < 2^14 ~ 16384 length = 3 for 2^14 <= n < 2^21 ~ 2097152 ... length = ceil( log(n) / 7 ) # for n != 0 ;) </code></pre> <p>Thus to fully encode a list:</p> <pre><code>bytes = if n == 0: 1 else : ceil( log(n) / 7 ) + ceil( n / 8 ) </code></pre> <p><strong>7. Small Lists</strong></p> <p>There is one corner case though: the low end of the spectrum (ie almost empty list).</p> <p>For <code>n == 1</code>, <code>bytes</code> is evaluated to <code>2</code>, which may indeed seem wasteful. I would not however try to guess what will happen once the compression kicks in.</p> <p>You may wish though to pack even more. It's possible if we abandon the idea of preserving whole bytes...</p> <ol> <li>Keep the length encoding as is (on whole bytes), but do not "pad" the <code>List<Boolean></code>. A one element list becomes <code>0000 0001 x</code> (9 bits)</li> <li>Try to 'pack' the length encoding as well</li> </ol> <p>The second point is more difficult, we are effectively down to a double length encoding:</p> <ol> <li>Indicates how many bits encode the length</li> <li>Actually encode the length on these bits</li> </ol> <p>For example:</p> <pre><code>0 -> 0 0 1 -> 0 1 2 -> 10 10 3 -> 10 11 4 -> 110 100 5 -> 110 101 8 -> 1110 1000 16 -> 11110 10000 (=> 1 byte and 2 bits) </code></pre> <p>It works pretty well for very small lists, but quickly degenerate:</p> <pre><code># Original scheme length = ceil( ( log(n) / 7) # New scheme length = 2 * ceil( log(n) ) </code></pre> <p>The breaking point ? <code>8</code></p> <p>Yep, you read it right, it's only better for list with less than 8 elements... and only better by "bits".</p> <pre><code>n -> bits spared [0,1] -> 6 [2,3] -> 4 [4,7] -> 2 [8,15] -> 0 # Turn point [16,31] -> -2 [32,63] -> -4 [64,127] -> -6 [128,255] -> 0 # Interesting eh ? That's the whole byte effect! </code></pre> <p>And of course, once the compression kicks in, chances are it won't really matter.</p> <p>I understand you may appreciate <code>recursive</code>'s algorithm, but I would still advise to compute the figures of the actual space consumption or even better to actually test it with archiving applied on real test sets.</p> <p><strong>8. Recursive / Variable coding</strong></p> <p>I have read with interest <code>TheDon</code>'s answer, and the link he submitted to <a href="http://en.wikipedia.org/wiki/Elias_omega_coding" rel="noreferrer">Elias Omega Coding</a>.</p> <p>They are sound answers, in the theoretical domain. Unfortunately they are quite unpractical. The main issue is that they have extremely interesting asymptotic behaviors, but when do we actually need to encode a Gigabyte worth of data ? Rarely if ever.</p> <p>A recent study of memory usage at work suggested that most containers were used for a dozen items (or a few dozens). Only in some very rare case do we reach the thousand. Of course for your particular problem the best way would be to actually examine your own data and see the distribution of values, but from experience I would say you cannot just concentrate on the high end of the spectrum, because your data lay in the low end.</p> <p>An example of <code>TheDon</code>'s algorithm. Say I have a list <code>[0,1,0,1,0,1,0,1]</code></p> <pre><code>len('01010101') = 8 -> 1000 len('1000') = 4 -> 100 len('100') = 3 -> 11 len('11') = 2 -> 10 encode('01010101') = '10' '0' '11' '0' '100' '0' '1000' '1' '01010101' len(encode('01010101')) = 2 + 1 + 2 + 1 + 3 + 1 + 4 + 1 + 8 = 23 </code></pre> <p>Let's make a small table, with various 'tresholds' to stop the recursion. It represents the number of bits of overhead for various ranges of <code>n</code>.</p> <pre><code>threshold 2 3 4 5 My proposal ----------------------------------------------- [0,3] -> 3 4 5 6 8 [4,7] -> 10 4 5 6 8 [8,15] -> 15 9 5 6 8 [16,31] -> 16 10 5 6 8 [32,63] -> 17 11 12 6 8 [64,127] -> 18 12 13 14 8 [128,255]-> 19 13 14 15 16 </code></pre> <p>To be fair, I concentrated on the low end, and my proposal is suited for this task. I wanted to underline that it's not so clear cut though. Especially because near <code>1</code>, the <code>log</code> function is almost linear, and thus the recursion loses its charm. The treshold helps tremendously and <code>3</code> seems to be a good candidate...</p> <p>As for <code>Elias omega coding</code>, it's even worse. From the wikipedia article:</p> <pre><code>17 -> '10 100 10001 0' </code></pre> <p>That's it, a whooping 11 bits.</p> <p><em>Moral: You cannot chose an encoding scheme without considering the data at hand.</em></p> <p>So, unless your <code>List<Boolean></code> have a length in the hundreds, don't bother and stick to my little proposal.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload