Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.</p> <p>Firstly, using a <code>short[]</code> (16 bit data type) instead of an <code>int[]</code> will halve (!) the amount of data sent, you can do this because the numbers are easily between <code>-2^15</code> (-32768) and <code>2^15-1</code> (32767). This is ridiculously easy to implement.</p> <p>Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.</p> <pre><code>[10, 40, 0, 0, 0, 30, 0, 100, 0, 0, 0, 0] &lt;=&gt; [10, 40, -3, 30, -1, 100, -4] </code></pre> <p>This is harder to implement that just substituting <code>short</code> for <code>int</code>, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).</p> <p>I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.</p> <p>Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400 <code>int</code>s (or <code>short</code>s) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additional <code>int</code> (or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).</p> <p><img src="https://i.stack.imgur.com/bHHMe.png" alt="histograms"></p> <p><strong>Summary</strong>: using <code>short</code>s instead of <code>int</code>s and one of the compression methods, you will be able to compress the data to about 1% of its original size!</p> <p>For the simulation, I used the following R script:</p> <pre><code>SIZE &lt;- 50000 lengths &lt;- sample(1000:10000, SIZE, replace=T) nonzeros &lt;- sample(1:100, SIZE, replace=T) f.rle &lt;- function(len, nonzero) { indexes &lt;- sort(c(0,sample(1:len, nonzero, F))) steps &lt;- diff(indexes) sum(steps &gt; 1) + nonzero # one short per run of zeros, and one per zero } f.index &lt;- function(len, nonzero) { nonzero * 2 } # using the [value, -1 * number of zeros,...] method rle.comprs &lt;- mapply(f.rle, lengths, nonzeros) print(mean(lengths - rle.comprs)) # average number of shorts saved rle.ratios &lt;- rle.comprs / lengths * 100 print(mean(rle.ratios)) # average compression ratio # using the [(index, value),...] method index.comprs &lt;- mapply(f.index, lengths, nonzeros) print(mean(lengths - index.comprs)) # average number of shorts saved index.ratios &lt;- index.comprs / lengths * 100 print(mean(index.ratios)) # average compression ratio par(mfrow=c(2,1)) hist(rle.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Run length encoding") hist(index.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Store indices") </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload