StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
8121645
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2011-11-14T12:33:16.823
FavoriteCount
0
LastActivityDate
2016-08-19T08:52:25.013
LastEditDate
2016-08-19T08:52:25.013
LastEditorUserId
66549
OwnerUserId
66549
ParentId
8102515
PostTypeId
2
Score
12
ViewCount
0
LastEditorDisplayName
text
Body
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list. <blockquote> How would I determine if my [clustering] algorithms works correctly? </blockquote> k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?" Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio: inter-centroidal separation / intra-cluster variance As the value of this ratio increase, the quality of your clustering result increases. This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)? But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster. In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k). The desired result is tight (small) clusters, each one far away from the others. The calculation is simple: For inter-centroidal separation: <ul> <li>calculate the pair-wise distance between cluster centers; then</li> <li>calculate the median of those distances. </li> </ul> For intra-cluster variance: <ul> <li>for each cluster, calculate the distance of every data point in a given cluster from its cluster center; next</li> <li>(for each cluster) calculate the variance of the sequence of distances from the step above; then</li> <li>average these variance values.</li> </ul> <hr> That's my answer to the first question. Here's the second question: <blockquote> Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ? </blockquote> First, the easy question--is Euclidean distance a valid metric as dimensions/features increase? Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points: <ul> <li>subtract their feature vectors element-wise,</li> <li>square each item in that result vector,</li> <li>sum that result,</li> <li>take the square root of that scalar.</li> </ul> Nowhere in this sequence of calculations is scale implicated. But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do. In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric. To identify an appropriate similarity metric given your data: <img src="https://i.stack.imgur.com/mLMvx.png" alt="enter image description here">
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POSelecting an appropriate similarity metric & assessing the validity of a k-means clustering model
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USdoug
UserOwnerUserId
1. USdoug
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.