StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>John Reiser <a href="http://groups.google.com/group/comp.compression/msg/9a72681e14cda1d4" rel="nofollow noreferrer">explained on comp.compression</a>:</p> <blockquote> <p>For the dictionary: make a <strong>histogram of short substrings</strong>, sort by payoff (number of occurrences times number of bits saved when compressed) and put the highest-payoff substrings into the dictionary. For example, if k is the length of the shortest substring that can be compressed (usually 3==k or 2==k), then make a histogram of all the substrings of lengths k, 1+k, 2+k, and 3+k. <em>Of course there is some art to placing those substrings into the dictionary, taking advantage of substrings, overlapping, short strings nearer to the high-address end</em>, etc.</p> <p>The Linux kernel uses a similar technique to compress names of symbols that are used for printing backtraces of the subroutine calling stack. See the file scripts/kallsyms.c. For instance, <a href="https://code.woboq.org/linux/linux/scripts/kallsyms.c.html" rel="nofollow noreferrer">https://code.woboq.org/linux/linux/scripts/kallsyms.c.html</a></p> </blockquote> <p>The <a href="http://www.zlib.net/manual.html" rel="nofollow noreferrer">zlib manual recommends</a> to place the most common ocurrences at the end of the dictionary.</p> <blockquote> <p>The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.</p> </blockquote> <p>This is because LZ77 has a sliding window algorithm, so the later substrings will be reachable further on your stream of data than the first few.</p> <p>I'd play with generating the dictionary with a higher level language with good support of strings. A crude JavaScript example:</p> <pre><code>var str = "The dictionary should consist of strings (byte sequences) that" + " are likely to be encountered later in the data to be compressed," + " with the most commonly used strings preferably put towards the " + "end of the dictionary. Using a dictionary is most useful when the" + " data to be compressed is short and can be predicted with good" + " accuracy; the data can then be compressed better than with the " + "default empty dictionary."; // Extract words, remove punctuation (extra: replace(/\s/g, " ")) var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort(); var wcnt = [], w = "", cnt = 0; // pairs, current word, current word count for (var i = 0, cnt = 0, w = ""; i < words.length; i++) { if (words[i] === w) { cnt++; // another match } else { if (w !== "") wcnt.push([cnt, w]); // Push a pair (count, word) cnt = 1; // Start counting for this word w = words[i]; // Start counting again } } if (w !== "") wcnt.push([cnt, w]); // Push last word wcnt.sort(); // Greater matches at the end for (var i in wcnt) wcnt[i] = wcnt[i][1]; // Just take the words var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars </code></pre> <p>Then dict is a string of 70 chars with:</p> <pre><code>rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe </code></pre> <p>You can try it copy-paste-run <a href="http://www.squarefree.com/shell/shell.html" rel="nofollow noreferrer">here</a> (add: "print(dict)")</p> <p>That's just whole words, not substrings. Also there are ways to overlap common substrings to save space on the dictionary.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload