StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSubset of a matrix multiplication, fast, and sparse
text
Body
copied!<p>Converting a collaborative filtering code to use sparse matrices I'm puzzling on the following problem: given two full matrices X (m by l) and Theta (n by l), and a sparse matrix R (m by n), is there a fast way to calculate the sparse inner product . Large dimensions are m and n (order 100000), while l is small (order 10). This is probably a fairly common operation for big data since it shows up in the cost function of most linear regression problems, so I'd expect a solution built into scipy.sparse, but I haven't found anything obvious yet.</p> <p>The naive way to do this in python is R.multiply(X<em>Theta.T), but this will result in evaluation of the full matrix X</em>Theta.T (m by n, order 100000**2) which occupies too much memory, then dumping most of the entries since R is sparse.</p> <p>There is a <a href="https://stackoverflow.com/questions/13731405/calculate-subset-of-matrix-multiplication">pseudo solution already here on stackoverflow</a>, but it is non-sparse in one step:</p> <pre><code>def sparse_mult_notreally(a, b, coords): rows, cols = coords rows, r_idx = np.unique(rows, return_inverse=True) cols, c_idx = np.unique(cols, return_inverse=True) C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is dense return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) ) </code></pre> <p>This works fine, and fast, for me on small enough arrays, but it barfs on my big datasets with the following error:</p> <pre><code>... in sparse_mult(a, b, coords) 132 rows, r_idx = np.unique(rows, return_inverse=True) 133 cols, c_idx = np.unique(cols, return_inverse=True) --> 134 C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is not sparse 135 return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) ) ValueError: array is too big. </code></pre> <p>A solution which IS actually sparse, but very slow, is:</p> <pre><code>def sparse_mult(a, b, coords): rows, cols = coords n = len(rows) C = np.array([ float(a[rows[i],:]*b[:,cols[i]]) for i in range(n) ]) # this is sparse, but VERY slow return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) ) </code></pre> <p>Does anyone know a fast, fully sparse way to do this?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload