Note that there are some explanatory texts on larger screens.

plurals
  1. POFinding similarity between datasets
    text
    copied!<p>I have data sets containing different values:</p> <blockquote> <p>Set1 = {X1, X2, ..., Xn}</p> <p>Set2 = {X1, X2, ..., Xn}</p> <p>...</p> </blockquote> <p>X values have different range (which is exactly why I can't figure out needed algorithm to solve my problem) - some are strictly [0.0 - 1.0] values, others might be in different/any range.</p> <p>I need to figure out a way to "group" these Sets, or in other words - find "similarity" between two given sets. </p> <p>Obviously I could simply write long chains of "IF" statements comparing each value with another and if they differ by some DELTA amount, I can indicate that two given sets are not "similar". The problem is, my sets are huge and contains dynamic data. Therefore I need a <em>generic</em> function to calculate some sort of Hash value for each set (at least that's the way I'm thinking):</p> <blockquote> <p>int hash1 = HashFunction(Set1)</p> <p>int hash2 = HashFunction(Set2)</p> <p>if (|hash1 - hash2| &lt; DELTA): return "Sets are similar"</p> </blockquote> <p>I would really appreciate any tips or ideas how to implement it.</p> <p><strong>Update:</strong></p> <p>Reading through comments I realized maybe I should change my question a bit as well: <em>What are your suggestions for a good "similarity" metric?</em></p> <p>By "similarity" I mean some dynamic value indicating how "close" sets' values are. For example, if I have a test set: <em>SetA{ 0.5, 100 }</em>, then <em>SetB{ 0.5, 100 }</em> should yield 1 (or some other value indicating a perfect match). At the same comparing <em>SetA</em> with <em>SetC { 0.1, 300 }</em> should return a lot lower "match" value, while <em>SetD { 0.45, 101 }</em> should return a value similar to a "perfect match". The key thing to notice here, for example values of 0.45 and 0.5 are "more similar" than values of 100 and 300 because: <em>|0.45 - 0.5| / max(0.45, 0.5) &lt; |100 - 300| / max(100, 300)</em>.</p> <p>If I simply calculate sum of value difference between 2 sets, it won't provide me any meaningful result (since two sets can contain completely different numbers (from logical point of view), yet cancel each other out giving incorrect result)</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload