Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Go with a single file. At the front of the file, store the ID-to-offset mapping. Assuming your ID space is sparse, store an array of ID+offset pairs, sorted by ID. Use binary search to find the right entry. Roughly <em>log(n/K)</em> seeks, where "K" is the number of ID+offset pairs you can store on a single disk block (though the OS might need an additional additional seek or two to find each block).</p> <p>If you want spend some memory to reduce disk seeks, store an in-memory sorted array of every 10,000th ID. When looking up an ID, find the closest ID without going over. This will give you a 10,000-ID range in the header that you can binary search over. You can very precisely scale up/down your memory usage by increasing/decreasing the number of keys in the in-memory table.</p> <p><strong>Dense ID space</strong>: But all of this is completely unnecessary if your ID space is relatively dense, which it seems it might be since you have 1 billion IDs out of a total possible ~4 billion (assuming <code>uint</code> is 32-bits).</p> <p>The sorted array technique described above requires storing ID+offset for 1 billion IDs. Assuming offsets are 8 bytes, this requires 12 GB in the file header. If you went with a straight array of offsets it would require 32 GB in the file header, but now only a single disk seek (plus the OS's seeks) and no in-memory lookup table.</p> <p>If 32 GB is too much, you can use a hybrid scheme where you use an array on the first 16 or 24 bits and use a sorted array for the last 16 or 8. If you have multiple levels of arrays, then you basically have a trie (as someone else suggested).</p> <p><strong>Note on multiple files</strong>: With multiple files you're basically trying to use the operating system's name lookup mechanism to handle one level of your ID-to-offset lookup. This is not as efficient as handling the entire lookup yourself.</p> <p>There may be other reasons to store things as multiple files, though. With a single file, you need to rewrite your entire dataset if anything changes. With multiple files you only have to rewrite a single file. This is where the operating system's name lookup mechanism comes in handy.</p> <p>But if you do end up using multiple files, it's probably more efficient for ID lookup to make sure they have roughly the same number of keys rather than the same file size.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload