Note that there are some explanatory texts on larger screens.

plurals
  1. POk-means clustering implementation in python, running out of memory
    primarykey
    data
    text
    <h3> Note: updates/solutions at the bottom of this question</h3> <p>As part of a product recommendation engine, I'm trying to segment my users based on their product preferences starting with using the k-means clustering algorithm.</p> <p>My data is a dictionary of the form:</p> <pre><code>prefs = { 'user_id_1': { 1L: 3.0f, 2L: 1.0f, }, 'user_id_2': { 4L: 1.0f, 8L: 1.5f, }, } </code></pre> <p>where the product ids are the longs, and ratings are floats. the data is sparse. I currently have about 60,000 users, most of whom have only rated a handful of products. The dictionary of values for each user is implemented using a defaultdict(float) to simplify the code.</p> <p>My implementation of k-means clustering is as follows:</p> <pre><code>def kcluster(prefs,sim_func=pearson,k=100,max_iterations=100): from collections import defaultdict users = prefs.keys() centroids = [prefs[random.choice(users)] for i in range(k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,0) for i in range(k): d = simple_pearson(row,centroids[i]) if d &lt; bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].append(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best &gt; 0: items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]]) for user_id in bestmatches[i]: row = prefs[user_id] for m in items: if row[m] &gt; 0.0: centroids[i][m]+=(row[m]/len_best) return bestmatches </code></pre> <p>As far as I can tell, the algorithm is handling the first part (assigning each user to its nearest centroid) fine.</p> <p>The problem is when doing the next part, taking the average rating for each product in each cluster and using these average ratings as the centroids for the next pass.</p> <p>Basically, before it's even managed to do the calculations for the first cluster (i=0), the algorithm bombs out with a MemoryError at this line:</p> <pre><code>if row[m] &gt; 0.0: centroids[i][m]+=(row[m]/len_best) </code></pre> <p>Originally the division step was in a seperate loop, but fared no better.</p> <p>This is the exception I get:</p> <pre><code>malloc: *** mmap(size=16777216) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug </code></pre> <p>Any help would be greatly appreciated.</p> <hr> <h3>Update: Final algorithms</h3> <p>Thanks to the help recieved here, this is my fixed algorithm. If you spot anything blatantly wrong please add a comment.</p> <p>First, the simple_pearson implementation</p> <pre><code>def simple_pearson(v1,v2): si = [val for val in v1 if val in v2] n = len(si) if n==0: return 0.0 sum1 = 0.0 sum2 = 0.0 sum1_sq = 0.0 sum2_sq = 0.0 p_sum = 0.0 for v in si: sum1+=v1[v] sum2+=v2[v] sum1_sq+=pow(v1[v],2) sum2_sq+=pow(v2[v],2) p_sum+=(v1[v]*v2[v]) # Calculate Pearson score num = p_sum-(sum1*sum2/n) temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n) if temp &lt; 0.0: temp = -temp den = sqrt(temp) if den==0: return 1.0 r = num/den return r </code></pre> <p>A simple method to turn simple_pearson into a distance value:</p> <pre><code>def distance(v1,v2): return 1.0-simple_pearson(v1,v2) </code></pre> <p>And finally, the k-means clustering implementation:</p> <pre><code>def kcluster(prefs,k=21,max_iterations=50): from collections import defaultdict users = prefs.keys() centroids = [prefs[u] for u in random.sample(users, k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,2.0) for i in range(k): d = distance(row,centroids[i]) if d &lt;= bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].append(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best &gt; 0: for user_id in bestmatches[i]: row = prefs[user_id] for m in row: centroids[i][m]+=row[m] for key in centroids[i].keys(): centroids[i][key]/=len_best # We may have made the centroids quite dense which significantly # slows down subsequent iterations, so we delete values below a # threshold to speed things up if centroids[i][key] &lt; 0.001: del centroids[i][key] return centroids, bestmatches </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload