Note that there are some explanatory texts on larger screens.

plurals
  1. POUsing a smoother with the L Method to determine the number of K-Means clusters
    primarykey
    data
    text
    <p>Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?</p> <p>The "L-Method" is detailed in: <a href="http://cs.fit.edu/~pkc/papers/ictai04salvador.pdf" rel="noreferrer"><strong>Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms</strong>, Salvador &amp; Chan</a></p> <p>This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.</p> <p>For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).</p> <p>K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. COThinking about this further, I don't think an even (ie. running average) smoother would have any notable effect, because the L-Method then fits lines using least squares. However, a shaped smoother such as a Gaussian might behave differently. I am going to try and implement a Gaussian of moderate size (half width of about 6-10 seems about right to me). It is going to be a qualitative test.
      singulars
    2. COI do think this will be a good moderate sized research project. If there are any college students looking for a project, I'd be interested in collaborating/mentoring/co-authoring. Such a project should perform quantitative comparisons and be more general than my specific application. I'll add the project-ideas tag to the question.
      singulars
    3. COI have some very rough, unscientific and qualitative results: I tried Gaussian filters of HalfWidthHalfHeight of 5 and 3. In both cases, it increased the estimated number of clusters, but the estimated error dropped (I ran tests of about 8-10 runs with each configuration). This is real world data, and an increase in the estimate is plausible. So I think this provides enough to warrant a mini research project with controlled data and under better conditions.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload