StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Because lossless compression works better on some areas than others, if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.</p> <p>You might look at "Compression: A Key for Next-Generation Text Retrieval Systems" by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates in <em>Computer</em> magazine November 2000 <a href="http://doi.ieeecomputersociety.org/10.1109/2.881693" rel="nofollow noreferrer">http://doi.ieeecomputersociety.org/10.1109/2.881693</a></p> <p>Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word. One can directly search the compressed text for words or phrases, which turns out to be even faster than searching uncompressed text.</p> <p>Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.</p> <p>You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text. (There are almost 13,000 unique words in the KJV Bible). Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases". (The compression gained by packing frequent words and phrases into two bytes is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter). There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression. For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data. Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase. Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file. Then use "sort --unique" to eliminate duplicate words in that lexicon file. (Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)</p> <p>Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file. Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase". My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.</p> <p>You can pick any random even byte offset into the compressed text, and start decompressing from there. I don't think it's possible to make a finer-grained random access compressed file format.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload