Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Someone hinted at a possible way to index this thing if you have abundant RAM (or possibly even disk/swap) available.</p> <p>Imagine if you performed a simple 32-bit CRC on a 1K block extending from each character in the original Gig string. This would result in 4bytes of checksum data for each byte offset from the beginning of the data.</p> <p>By itself this might give a modest improvement in search speed. The checksum of each 1K search target could be checked against each CRC ... which each collision tested for a true match. That should still be a couple orders of magnitude faster than a normal linear search.</p> <p>That, obviously, costs us 4GB of RAM for he CRC array (plus the original Gig for the original data and a little more overhead for the environment and our program).</p> <p>If we have ~16GB we could sort the checksums and store a list of offsets where each is found. That becomes an indexed search (average of about 16 probes per target search ... worst case around 32 or 33 (might be a fence post there).</p> <p>It's possible that a 16BG file index would still give better performance than a linear checksum search and it would almost certainly be better than a linear raw search (unless you have extremely slow filesystems/storage).</p> <p>(Adding): I should clarify that this strategy is only beneficial given that you've described a need to do many searches on the same one gigabyte data blob.</p> <p>You might use a threaded approach to building the index (while reading it as well as having multiple threads performing the checksumming). You might also offload the indexing into separate processes or a cluster of nodes (particularly if you use a file-based index --- the ~16GB option described above). With a simple 32-bit CRC you might be able to perform the checksums/indexing as fast as your reader thread can get the data (but we are talking about 1024 checksums for each 1K of data so perhaps not).</p> <p>You might further improve performance by coding a Python module in C for actually performing the search ... and/or possibly for performing the checksumming/indexing.</p> <p>The development and testing of such C extensions entail other trade-offs, obviously enough. It sounds like this would have near zero re-usability.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload