Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I'm actually not surprised they differ. You could also add in Weka's implementation of LOF, and you will probably get yet another answer.</p> <p>Here is one more difference for you to add to your equations: as far as I know, the rapidminer implementation <em>merges</em> points that have the same coordinates. But maybe, they forgot to take these weights into account when computing the nearest neighbors!</p> <p>In the classic database context, you would <strong>not merge duplicate coordinates</strong> into a single observation. They are still valid database records and should be counted as full records.</p> <p>I do not know if any of them perform some automatic data preprocessing such as rescaling the data set.</p> <p><strong>The ELKI implementation has been verified</strong> against a number of textbook examples we use for teaching.</p> <p>However, there are corner cases in the algorithm that are not 100% fixed, so there is room for difference even in "literal" implementations of the algorithm. You have already run into three of these:</p> <ol> <li><p>How to treat duplicate points: A) aggregate, B) drop, C) consider different</p> <p>From a data mining point of view, C is correct, and A (when implemented correctly) is an optimization that can save you unnecessary distance computations. B is the common mathematical view, but does not make as much sense for a database context. If I have two "John Doe", are they the same person?</p></li> <li><p>Definition of k nearest neighbors and k-distance.</p> <p>The usual definition of k-distance is: the smallest distance, such that at least k observations are contained. When excluding the query point, this yields the inverval up to 5.7457 from the starting point: there are 10 other observations in a radius of 5.7457 - 4.32323.</p> <p>The k nearest neighbors are usually defined as any point within this distance, which may be more than k. But then all additional objects must have the <em>same distance as the kth</em>! It seems as if rapidminer uses <em>exactly k</em>, which does not align with the LOF publication (see Definition 4 in the LOF publication!)</p> <p>It's really the k nearest neighbors (including ties, but other than that not more than k objects), <strong>not the k-ths smallest <em>distinct</em> distance</strong>. Where did you get the "distinct" from?</p> <p>Defintions 3 and 4 in the LOF publication are pretty clear on the kNN set LOF uses.</p> <p>Your neighborhood of 48 objects thus is not correct.</p></li> <li><p>What to do if there are more than minPts duplicate points (a literal implementation will yield a division by zero, but for obvious reasons the point should be given a LOF of 1.0)</p> <p>This is maybe what is happening to Rapidminer.</p></li> </ol> <p>And then there is reachability distance: this one is <strong>really tricky</strong>, because it is not a mathematical distance. It is <em>asymmetric</em>.</p> <p>The reachability of the first observation <em>from</em> the second happens to be the k-distance of the second, which from a quick look (did not double check) <code>reach-dist(x[0], x[1]) = max(5.97766 - 5.12595, 5.12595 - 4.32323) = 0.80272</code></p> <p>See <a href="http://www.dbs.ifi.lmu.de/Lehre/KDD/SS12/uebung/Tutorial06Outlier.pdf">my extensive tutorial slides on outlier detection</a> for a step-by-step demonstration of how to compute LOF. As far as I can tell, this is literal LOF. It doesn't touch all the corner cases, but it motivates the design of the LOF algorithm and is quite exhaustive.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload