StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POEffective clustering of a similarity matrix
primarykey
Id
10086551
data
AcceptedAnswerId
0
AnswerCount
3
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2012-04-10T09:53:53.897
FavoriteCount
4
LastActivityDate
2013-12-29T06:40:43.223
LastEditDate
2017-05-23T10:33:55.573
LastEditorUserId
-1
OwnerUserId
519270
ParentId
0
PostTypeId
1
Score
5
ViewCount
8575
LastEditorDisplayName
text
Body
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php. 1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use <ul> <li>filtering (only "real" words)</li> <li>tokenization (split sentences into words)</li> <li>stemming (reduce words to their base form; Porter's stemmer)</li> <li>pruning (cut of words with too high & low frequency)</li> </ul> as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web <a href="https://stackoverflow.com/a/1290286/519270">and here</a>. The result then is a similarity matrix like this: <pre><code> A B C D E A 0 30 51 75 80 B X 0 21 55 70 C X X 0 25 10 D X X X 0 15 E X X X X 0 </code></pre> A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%. I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together. I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold) <pre><code> foreach article get similar entries where sim > 60 foreach similar entry check if one of the entries already has a cluster number if no: assign new cluster number to all similar entries if yes: use that number </code></pre> It worked (somehow), but wasn't good at all and the results were often monster-clusters. So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together. Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Tags
<matrix><machine-learning><cluster-analysis><distance><similarity>
Title
Effective clustering of a similarity matrix
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USMartin
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POEffective clustering of a similarity matrix
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POEffective clustering of a similarity matrix
 UserUserId
 USuser1149913
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POEffective clustering of a similarity matrix
 UserUserId
 USBhavik Maneck
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. CODid you get any good answers? It's not even clear to me how many dimensions the clustering should work in...
 singulars
 PostPostId
 POEffective clustering of a similarity matrix
 UserUserId
 USJim

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.