StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POk-means clustering implementation in python, running out of memory
primarykey
Id
1233593
data
AcceptedAnswerId
1234620
AnswerCount
2
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2009-08-05T14:24:04.627
FavoriteCount
6
LastActivityDate
2014-03-09T00:50:06.013
LastEditDate
2014-03-09T00:50:06.013
LastEditorUserId
321731
OwnerUserId
15687
ParentId
0
PostTypeId
1
Score
6
ViewCount
3596
LastEditorDisplayName
text
Body
<h3> Note: updates/solutions at the bottom of this question</h3> As part of a product recommendation engine, I'm trying to segment my users based on their product preferences starting with using the k-means clustering algorithm. My data is a dictionary of the form: <pre><code>prefs = { 'user_id_1': { 1L: 3.0f, 2L: 1.0f, }, 'user_id_2': { 4L: 1.0f, 8L: 1.5f, }, } </code></pre> where the product ids are the longs, and ratings are floats. the data is sparse. I currently have about 60,000 users, most of whom have only rated a handful of products. The dictionary of values for each user is implemented using a defaultdict(float) to simplify the code. My implementation of k-means clustering is as follows: <pre><code>def kcluster(prefs,sim_func=pearson,k=100,max_iterations=100): from collections import defaultdict users = prefs.keys() centroids = [prefs[random.choice(users)] for i in range(k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,0) for i in range(k): d = simple_pearson(row,centroids[i]) if d < bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].append(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best > 0: items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]]) for user_id in bestmatches[i]: row = prefs[user_id] for m in items: if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best) return bestmatches </code></pre> As far as I can tell, the algorithm is handling the first part (assigning each user to its nearest centroid) fine. The problem is when doing the next part, taking the average rating for each product in each cluster and using these average ratings as the centroids for the next pass. Basically, before it's even managed to do the calculations for the first cluster (i=0), the algorithm bombs out with a MemoryError at this line: <pre><code>if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best) </code></pre> Originally the division step was in a seperate loop, but fared no better. This is the exception I get: <pre><code>malloc: *** mmap(size=16777216) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug </code></pre> Any help would be greatly appreciated. <hr> <h3>Update: Final algorithms</h3> Thanks to the help recieved here, this is my fixed algorithm. If you spot anything blatantly wrong please add a comment. First, the simple_pearson implementation <pre><code>def simple_pearson(v1,v2): si = [val for val in v1 if val in v2] n = len(si) if n==0: return 0.0 sum1 = 0.0 sum2 = 0.0 sum1_sq = 0.0 sum2_sq = 0.0 p_sum = 0.0 for v in si: sum1+=v1[v] sum2+=v2[v] sum1_sq+=pow(v1[v],2) sum2_sq+=pow(v2[v],2) p_sum+=(v1[v]*v2[v]) # Calculate Pearson score num = p_sum-(sum1*sum2/n) temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n) if temp < 0.0: temp = -temp den = sqrt(temp) if den==0: return 1.0 r = num/den return r </code></pre> A simple method to turn simple_pearson into a distance value: <pre><code>def distance(v1,v2): return 1.0-simple_pearson(v1,v2) </code></pre> And finally, the k-means clustering implementation: <pre><code>def kcluster(prefs,k=21,max_iterations=50): from collections import defaultdict users = prefs.keys() centroids = [prefs[u] for u in random.sample(users, k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,2.0) for i in range(k): d = distance(row,centroids[i]) if d <= bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].append(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best > 0: for user_id in bestmatches[i]: row = prefs[user_id] for m in row: centroids[i][m]+=row[m] for key in centroids[i].keys(): centroids[i][key]/=len_best # We may have made the centroids quite dense which significantly # slows down subsequent iterations, so we delete values below a # threshold to speed things up if centroids[i][key] < 0.001: del centroids[i][key] return centroids, bestmatches </code></pre>
Tags
<python>
Title
k-means clustering implementation in python, running out of memory
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USTshepang
UserOwnerUserId
1. USAndrew Ingram
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POk-means clustering implementation in python, running out of memory
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POk-means clustering implementation in python, running out of memory
 UserUserId
 USNelson
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POk-means clustering implementation in python, running out of memory
 UserUserId
 USAnthony Kong
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId
1. COI am doing this on a Windows Vista laptop with 4GB and the memory use appears to be about 100MB using 100k users. so I am not getting the problem you describe. However, I do get this: >>> print kcluster(prefs,k=100,max_iterations=100) Iteration 0 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 38, in kcluster KeyError: 3 So maybe there is something wrong with your algorithm. Or the indentation: I could not cut and paste your code from SO without reformatting. Might have got that wrong.
 singulars
 PostPostId
 POk-means clustering implementation in python, running out of memory
 UserUserId
 UShughdbrown

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.