Note that there are some explanatory texts on larger screens.

plurals
  1. POAlgorithm for determining a file’s identity (Optimisation)
    primarykey
    data
    text
    <p>Further to this question: <a href="https://stackoverflow.com/questions/456175/algorithm-for-determining-a-files-identity">Algorithm for determining a file’s identity</a></p> <p><strong>Recap</strong>: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time. </p> <p>I went ahead and implemented an algorithm that gives me a "<em>pretty unique</em>" hash per file.</p> <p>The way my algorithm works is:</p> <ul> <li><p>For files smaller than a certain threshold I use the full files content for the identity hash. </p></li> <li><p>For files larger than the threshold I take random N samples of X size.</p></li> <li><p>I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash) </p></li> </ul> <p>Questions: </p> <ul> <li><p>What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive) </p></li> <li><p>The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash) </p></li> <li><p>The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).</p></li> <li><p>Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files) </p></li> </ul> <p>Relevant information:</p> <p><a href="http://gist.github.com/101591" rel="nofollow noreferrer">The algorithm I implemented</a></p> <p>Thanks for your help!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload