StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
1234620
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2009-08-05T17:23:39.987
FavoriteCount
0
LastActivityDate
2009-08-05T17:23:39.987
LastEditDate
LastEditorUserId
0
OwnerUserId
95810
ParentId
1233593
PostTypeId
2
Score
6
ViewCount
0
LastEditorDisplayName
text
Body
Not all these observations are directly relevant to your issues as expressed, but..: a. why are the key in prefs, as shown, longs? unless you have billions of users, simple ints will be fine and save you a little memory. b. your code: <pre><code>centroids = [prefs[random.choice(users)] for i in range(k)] </code></pre> can give you repeats (two identical centroids), which in turn would not make the K-means algorithm happy. Just use the faster and more solid <pre><code>centroids = [prefs[u] for random.sample(users, k)] </code></pre> c. in your code as posted you're calling a function <code>simple_pearson</code> which you never define anywhere; I assume you mean to call <code>sim_func</code>, but it's really hard to help on different issues while at the same time having to guess how the code you posted differs from any code that might actually be working d. one more indication that this posted code may be different from anything that might actually work: you set <code>bestmatch=(0,0)</code> but then test with <code>if d < bestmatch[1]:</code> -- how is the test ever going to succeed? is the distance function returning negative values? e. the point of a defaultdict is that just accessing <code>row[m]</code> magically adds an item to <code>row</code> at index <code>m</code> (with the value obtained by calling the defaultdict's factory, here 0.0). That item will then take up memory forevermore. You absolutely DON'T need this behavior, and therefore your code: <pre><code> row = prefs[user_id] for m in items: if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best) </code></pre> is wasting huge amount of memory, making <code>prefs</code> into a dense matrix (mostly full of 0.0 values) from the sparse one it used to be. If you code instead <pre><code> row = prefs[user_id] for m in row: centroids[i][m]+=(row[m]/len_best) </code></pre> there will be no growth in <code>row</code> and therefore in <code>prefs</code> because you're looping over the keys that <code>row</code> already has. There may be many other such issues, major like the last one or minor ones -- as an example of the latter, f. don't divide a bazillion times by <code>len_best</code>: compute its inverse one outside the loop and multiply by that inverse -- also you don't need to do that multiplication inside the loop, you can do it at the end in a separate since it's the same value that's multiplying every item -- this saves no memory but avoids wantonly wasting CPU time;-). OK, these are two minor issues, I guess, not just one;-). As I mentioned there may be many others, but with the density of issues already shown by these six (or seven), plus the separate suggestion already advanced by S.Lott (which I think would not fix your main out-of-memory problem, since his code still addressing the <code>row</code> defaultdict by too many keys it doesn't contain), I think it wouldn't be very productive to keep looking for even more -- maybe start by fixing these ones and if problems persist post a separate question about those...?
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POk-means clustering implementation in python, running out of memory
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USAlex Martelli
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POk-means clustering implementation in python, running out of memory
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId
1. COThanks for your input, I'll update the original question with a link to simple_pearson (which I've pasted elsewhere to avoid clutter here). The sim_func in the method definition is a remnant of older code.
 singulars
 PostPostId
 PO
 UserUserId
 USAndrew Ingram
2. COThese hints seem to have done the trick, it's managing to reach the second iteration and beyond. Iterations past the first one are very slow though (about 10 minutes each), I'll let it run overnight and see what happens. Once I've got my centroids I don't imagine I'll have to recalculate them very often anyway.
 singulars
 PostPostId
 PO
 UserUserId
 USAndrew Ingram

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.