StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PODoubts about clustering methods for tweets
primarykey
Id
19054062
data
AcceptedAnswerId
19112934
AnswerCount
2
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-09-27T15:04:09.263
FavoriteCount
3
LastActivityDate
2015-09-16T17:04:52.687
LastEditDate
2015-09-16T17:04:52.687
LastEditorUserId
1060350
OwnerUserId
309926
ParentId
0
PostTypeId
1
Score
2
ViewCount
1835
LastEditorDisplayName
text
Body
I'm fairly new to clustering and related topics so please forgive my questions. I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time. Being ignorant in this field, my idea (probably naive) would be to do something like this: <pre><code>1. For each new tweet in the db, extract N-grams (N=3 for example) into a set 2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster 3. Once finished I'd get M clusters containing similar tweets </code></pre> Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem. Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity. And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it. Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding. From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct? Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
Tags
<cluster-analysis><data-mining><hierarchical-clustering>
Title
Doubts about clustering methods for tweets
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USAnony-Mousse
UserOwnerUserId
1. USDan
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PODoubts about clustering methods for tweets
 UserUserId
 USDan
 VoteTypeVoteTypeId
 VTBountyStart
2. VO
 singulars
 PostPostId
 PODoubts about clustering methods for tweets
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PODoubts about clustering methods for tweets
 UserUserId
 USBytemain
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.