Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There is a lot of dubious information in the answers here. You tagged your question with <code>cryptography</code> and only mention cryptographic hash functions, but it sounds like you don't really need cryptographic security, in particular because you say:</p> <blockquote> <p>I have over 1 billion records I need to hash and verify if they are unique.</p> </blockquote> <p>There are four properties to a <a href="http://en.wikipedia.org/wiki/Cryptographic_hash_function" rel="nofollow">cryptographic hash function</a>:</p> <blockquote> <ul> <li>it is easy to compute the hash value for any given message</li> <li>it is infeasible to generate a message that has a given hash</li> <li>it is infeasible to modify a message without changing the hash</li> <li>it is infeasible to find two different messages with the same hash.</li> </ul> </blockquote> <p>You're really only interested in the first quality and the uniqueness is a smaller scale requirement only partially related to the other three properties of cryptographic security.</p> <h2>Why do you care?</h2> <p>There is overhead in cryptographic security. You don't need it, and you're interested in speed, so why not skip it? The hash width of MD5 and the SHA family are admittedly large enough for your purposes.</p> <p>Check out the list of <a href="http://en.wikipedia.org/wiki/List_of_hash_functions" rel="nofollow">hash functions</a> on wikipedia, or check out the article on <a href="http://en.wikipedia.org/wiki/Hash_function" rel="nofollow">normal hash functions</a>. More to the point, what's wrong with the built in .NET hashing functions? Have you tried just deferring to the <a href="http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx" rel="nofollow"><code>Object.GetHashCode()</code></a> method? That MSDN reference has a lot to say about using hash functions. You don't say much about the data you're hashing, so it's hard to say whether the output would be unique between your objects or not. How are you feeding the object into the MD5 hasher? I presume you're taking the binary representation of it. A similar approach could be used to use the built-in non-crypto hash function.</p> <p>You may be concerned about the uniqueness of the built in hash functions. They do only return a regular int, which is 2^32, only about 4 times larger than the data set you're working with. However, you always need to have a back up plan for hash functions. Collisions are infeasible, not impossible. The standard fallback is to perform a more expensive comparison, usually a reference comparison and a field-wise value comparison.</p> <p>If you aren't prepared to do an exact comparison on your hash outputs, you're basically counting down until you get a false positive. That might not be a big deal for you: only you can judge what the downside there is.</p> <p>Additionally, performing another hash function computation is probably not much faster than the direct comparison. You're better off on all counts to go with the sure thing and perform the lengthy, direct comparison.</p> <p>Another common anti-collision technique is to use multiple keys. So if your data points have several large subcomponents, you hash and compare the independently. If it has some large and some small components (say some simple numeric types), you hash the large ones and do a direct comparison on the small ones. If they have some data that's easy to take the ordinal of (say the lengths of strings or the size of some containers), you can perform the direct comparison on those bits.</p> <p>If that doesn't work out for you, take a look at implementations of the other hash functions listed on the wiki. Here's a <a href="http://blog.teamleadnet.com/2012/08/murmurhash3-ultra-fast-hash-algorithm.html" rel="nofollow">pretty good reference for MurmerHash3</a>, which can compute 32 bit or 128 bit hash values. There are other hash functions in the list that have long hash widths as well and also have C# libraries available. But as that reference points out, Murmurhash is way faster than MD5 and SHA functions, although it doesn't do a direct comparison to the Object.GetHashCode method I mentioned above.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload