Note that there are some explanatory texts on larger screens.

plurals
  1. POWhich data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
    primarykey
    data
    text
    <p>Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).</p> <p>For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.</p> <p>For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes. </p> <p>I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.</p> <p>Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them <i>per location</i>. So each location has its own time-series of events that can thus be analyzed independently.</p> <p>Some technical details:<br> - as the dataset is not that large, it can fit all in memory.<br> - parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.<br> - the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.<br> - for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably <a href="http://www.cs.waikato.ac.nz/ml/weka/" rel="nofollow noreferrer">Weka</a> will be suggested.<br> - visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool. </p> <p>Thank you. Questions that I found useful are: <a href="https://stackoverflow.com/questions/2027252">How to find center of clusters of numbers? statistics problem?</a>, <a href="https://stackoverflow.com/questions/562904">Clustering Algorithm for Paper Boys</a>, <a href="https://stackoverflow.com/questions/2129269">Java Clustering Library</a>, <a href="https://stackoverflow.com/questions/691922">How to cluster objects (without coordinates)</a>, <a href="https://stackoverflow.com/questions/356035">Algorithm for detecting &quot;clusters&quot; of dots</a> </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COJust to be certain: is the clustering parameter the size of the neighborhood that contains the points (as opposed to the maximum distance between data points)? Can a data point be a member of more than one cluster? For example, with a cluster parameter of 3 minutes, if lightning struck the Empire State three times, with two minutes between strikes, what are the clusters?
      singulars
    2. COThe clustering parameter is the maximum distance between adjacent events. How many happen to be in a cluster depends; in fact the primary goal of this analysis is to identify which events are those that happened together (in the real world case, the strikes that were closer together than the others need to be analyzed further). A data point can only be part of 1 cluster (so fuzzy clustering is not applicable). With the above clarification that a cluster's parameter is the max distance between adjacent points, a value of 3 minutes puts all those strikes in 1 cluster.
      singulars
    3. COAfter more googling, I found out about (RapidMiner)[http://www.rapidminer.com] which has a number of clustering algorithms (for the curious, they are: k-Means, k-Means (Kernel), k-Medoids, DBSCAN, Expectation Maximization Clustering, Support Vector Clustering, Random Clustering, Agglomerative Clustering, Top Down Clustering, Flatten Clustering, Extract Cluster Prototypes) plus several others supported via the Weka plugin (Weka: W-CLOPE, W-Cobweb, W-EM, W-FarthestFirst, W-HierarchicalClusterer, W-SimpleKMeans, W-XMeans, W-sIB). I'm not familiar with most so need to grab a coffee and learn.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload