Note that there are some explanatory texts on larger screens.

plurals
  1. POAlgorithm for determining a file's identity
    text
    copied!<p>For an open source project I have I am writing an abstraction layer on top of the filesystem. </p> <p>This layer allows me to attach metadata and relationships to each file. </p> <p>I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied. </p> <p>To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies. </p> <p>So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap. </p> <p>One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash. </p> <p>Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes. </p> <p>I need this algorithm to work for very large files (1GB+ and tiny files 5K) </p> <p><strong>EDIT</strong> </p> <p>I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype). </p> <p><strong>EDIT 2</strong></p> <p>Related question: <a href="https://stackoverflow.com/questions/788761/algorithm-for-determining-a-files-identity-optimisation">Algorithm for determining a file’s identity (Optimisation)</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload