Note that there are some explanatory texts on larger screens.

plurals
  1. POUnivocal hash function for a string 76 chars long
    text
    copied!<p>Here's my problem (I'm programming in C):</p> <p>I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.</p> <p>Reducing memory requirements and improving speed efficiency as far as I can is VITAL. My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!</p> <p>My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big. What about doing the same thing and store it in a double? Would the computation become a lot slower? Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload