Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The critical parameter in that kind of job is the cost of an individual disk lookup. A disk lookup has an innate latency, because the read/write heads must be moved to the right position. On a typical disk, you can count on about a hundred lookups per second. On the other hand, disks are very good at sequential reading, so for each lookup, you can read, say, one megabyte worth of data, for little extra cost.</p> <p>I here assume that the "text file" has a regular format. For instance, each hash value uses exactly 33 bytes, 32 for the MD5 result itself (in hexadecimal) and 1 extra byte for a "newline" character. Adjust if needed, depending on the exact format. With these figures, your text file has length about 3.3 GB.</p> <p>Since MD5 acts mostly like a random function, the 100 million hashes should be uniformly spread in the space of 128-bit values. This means that, given a hash value, you can compute the approximate position of that value in the file (if it is in the file). For instance, the hash value <code>9378ec093d09863d008154f1c8f5ca8f</code> should be at an offset close to <em>0.5761*n*33</em>, where <em>n</em> is the number of hashes in the big file and "33" is explained in the paragraph above. <em>0.5761</em> is the result of <em>0x9378EC</em> divided by <em>0x1000000</em>. Hence you can read one megabyte worth of your text file, centered on that computed position. This will contain about 30000 hashes. The standard deviation for 100 million random values is on the order of 10000, so chances are high that the 30000 hashes will contain the right values to decide whether your hash is in the list or not. If the estimate was off, you will have to read another megabyte, but this will not happen often. Possibly, you could read a bit more than a megabyte to make that occurrence rare: there is a trade-off, which is to be adjusted through actual measures.</p> <p>Once you have a (small) block of hash values in RAM, use a binary search. But the initial lookup cost will completely dwarf out that part anyway.</p> <p>An alternate solution uses an extra index file. Build a secondary file which contains one every 10000 hashes in the big file. This file will have length about 330 kB. Keep this file in RAM as much as possible. Use it (with a binary search) to know which sequence of 10000 hashes is relevant for your lookup. Then read that chunk from the big file. The index file must be rebuilt whenever the list of hashes changes; this is kind of expensive, but less than the actual big file change. Depending on the system which produces the big file, you may perhaps integrate the index file generation for a negligible extra cost.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload