Note that there are some explanatory texts on larger screens.

plurals
  1. POSubset of a matrix multiplication, fast, and sparse
    primarykey
    data
    text
    <p>Converting a collaborative filtering code to use sparse matrices I'm puzzling on the following problem: given two full matrices X (m by l) and Theta (n by l), and a sparse matrix R (m by n), is there a fast way to calculate the sparse inner product . Large dimensions are m and n (order 100000), while l is small (order 10). This is probably a fairly common operation for big data since it shows up in the cost function of most linear regression problems, so I'd expect a solution built into scipy.sparse, but I haven't found anything obvious yet.</p> <p>The naive way to do this in python is R.multiply(X<em>Theta.T), but this will result in evaluation of the full matrix X</em>Theta.T (m by n, order 100000**2) which occupies too much memory, then dumping most of the entries since R is sparse.</p> <p>There is a <a href="https://stackoverflow.com/questions/13731405/calculate-subset-of-matrix-multiplication">pseudo solution already here on stackoverflow</a>, but it is non-sparse in one step:</p> <pre><code>def sparse_mult_notreally(a, b, coords): rows, cols = coords rows, r_idx = np.unique(rows, return_inverse=True) cols, c_idx = np.unique(cols, return_inverse=True) C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is dense return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) ) </code></pre> <p>This works fine, and fast, for me on small enough arrays, but it barfs on my big datasets with the following error:</p> <pre><code>... in sparse_mult(a, b, coords) 132 rows, r_idx = np.unique(rows, return_inverse=True) 133 cols, c_idx = np.unique(cols, return_inverse=True) --&gt; 134 C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is not sparse 135 return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) ) ValueError: array is too big. </code></pre> <p>A solution which IS actually sparse, but very slow, is:</p> <pre><code>def sparse_mult(a, b, coords): rows, cols = coords n = len(rows) C = np.array([ float(a[rows[i],:]*b[:,cols[i]]) for i in range(n) ]) # this is sparse, but VERY slow return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) ) </code></pre> <p>Does anyone know a fast, fully sparse way to do this?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COThat operation is as sparse as it can be with a vectorized approach. It'd be interesting to split the offending line in three to see when is the Memory error happening, i.e. `aa = a[rows, :]; bb = b[:, cols]; C = np.dot(aa, bb)`. You don't need the `np.array` call, and as is, it actually makes a copy of the array, so it may even be the culprit of your memory error. Can you destroy `X` and `Theta` in the process of generating `R`?
      singulars
    2. COMy inputs a,b are np.matrix, so the result C[r_idx,c_idx] without the np.array is a 2D (n x 1) matrix instead of a 1D array. This caused an error in the sp.coo_matrix call, so I put the np.array in there. Converting to arrays beforehand might save some time though.
      singulars
    3. COX and Theta are both full matrices, but I can't see why you'd want to destroy them? I'll note that while my matrix R is sparse, the sparsity pattern is such that every row has at least one entry, and every column has at least one entry, such that the result of the unique calls is the full range of m and n. This causes the np.dot result to be the full dense product of the two arrays we're trying to multiply sparsely. I tried defining aa and bb as you suggested and it died with the same error at 'C = ...' again. Maybe cython is the solution?
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload