Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You've omitted critical information about the number of strings you intend to index.</p> <p>But given that you say you expect the <em>minimum</em> length of an indexed string to be 256, storing the indices as 64% incurs <em>at most</em> 3% overhead. If the total length of the string file is less than 4GB, you could use 32-bit indices and incur 1.5% overhead. These numbers suggest to me that if compression matters, <strong>you're better off compressing the strings, not the indices</strong>. For that problem a variation on LZ77 seems in order.</p> <p>If you want to try a wild idea, put each string in a separate file, pull them all into a zip file, and see how you can do with <a href="http://zziplib.sourceforge.net/" rel="nofollow noreferrer"><code>zziplib</code></a>. This probably won't be great, but it's nearly zero work on your part.</p> <p>More data on the problem would be welcome:</p> <ul> <li>Number of strings</li> <li>Average length of a string</li> <li>Maximum length of a string</li> <li>Median length of strings</li> <li>Degree to which the strings file compresses with <code>gzip</code></li> <li>Whether you are allowed to change the order of strings to improve compression</li> </ul> <h2>EDIT</h2> <p>The comment and revised question makes the problem much clearer. I like your idea of grouping, and I would try a simple delta encoding, group the deltas, and use a variable-length code within each group. I wouldn't wire in 64 as the group size&ndash;I think you will probably want to determine that empirically. </p> <p>You asked for existing libraries. For the grouping and delta encoding I doubt you will find much. For variable-length integer codes, I'm not seeing much in the way of C libraries, but you can find variable-length codings in <a href="http://www.dlugosz.com/ZIP2/VLI.html" rel="nofollow noreferrer">Perl</a> and <a href="http://www.codecodex.com/wiki/index.php?title=Variable-Length_Integers" rel="nofollow noreferrer">Python</a>. There are a ton of papers and some patents on this topic, and I suspect you're going to wind up having to roll your own. But there are some simple codes out there, and you could give UTF-8 a try&mdash;it can code unsigned integers up to 32 bits, and you can grab C code from <a href="http://plan9.bell-labs.com/sources/plan9/sys/src/" rel="nofollow noreferrer">Plan&nbsp;9</a> and I'm sure many other sources.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload