StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhich data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
primarykey
Id
2301290
data
AcceptedAnswerId
2308501
AnswerCount
3
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2010-02-20T06:19:19.037
FavoriteCount
6
LastActivityDate
2015-11-08T04:50:23.097
LastEditDate
2017-05-23T11:53:23.150
LastEditorUserId
-1
OwnerUserId
277434
ParentId
0
PostTypeId
1
Score
11
ViewCount
3374
LastEditorDisplayName
text
Body
Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem). For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially. For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes. I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown. Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently. Some technical details: - as the dataset is not that large, it can fit all in memory. - parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much. - the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time. - for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably <a href="http://www.cs.waikato.ac.nz/ml/weka/" rel="nofollow noreferrer">Weka</a> will be suggested. - visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool. Thank you. Questions that I found useful are: <a href="https://stackoverflow.com/questions/2027252">How to find center of clusters of numbers? statistics problem?</a>, <a href="https://stackoverflow.com/questions/562904">Clustering Algorithm for Paper Boys</a>, <a href="https://stackoverflow.com/questions/2129269">Java Clustering Library</a>, <a href="https://stackoverflow.com/questions/691922">How to cluster objects (without coordinates)</a>, <a href="https://stackoverflow.com/questions/356035">Algorithm for detecting "clusters" of dots</a> 
Tags
<algorithm><language-agnostic><cluster-analysis>
Title
Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USwishihadabettername
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POWhich data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
 UserUserId
 USJimmy
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 POWhich data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POWhich data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.