Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>A couple of things:</p> <ul> <li><p>You use <code>[</code> a lot and get a list out which you then unlist. Use <code>[[</code> instead to get the actual value out. This will be much faster.</p></li> <li><p>Try to formulate the problem as a matrix (or vector) that you can operate on in one go. The <code>dist</code> function does that, but for knn it might use too much memory if the problem is large.</p></li> <li><p>If you still need to use <code>sapply</code>, try <code>vapply</code> instead. It has much less overhead since you specify the result type so it doesn't have to guess.</p></li> <li><p>You might want to look at some other postings regarding knn, like <a href="https://stackoverflow.com/q/5560218/662787">Computing sparse pairwise distance matrix in R</a>. I suggested a way to calculate knn there that might be useful to you.</p></li> </ul> <p>That said, if I understand your code correctly, rewriting <code>knnEstimate</code> a bit provides a healthy speedup (16x):</p> <pre><code># Using your original knnEstimate system.time( a1 &lt;- crossValidate(knnEstimate, data) ) # 12.68 secs # Using a vectorized version knnEstimate &lt;- function(data, v1, k=3) { v &lt;- unlist(v1) # Get the matrix m &lt;- do.call(rbind, data[,'input']) idx &lt;- order(sqrt(colSums((t(m)-v)^2)))[seq_len(k)] mean(unlist(data[idx, 'result'])) } system.time( a2 &lt;- crossValidate(knnEstimate, data) ) # 0.75 secs </code></pre> <p>The <code>sqrt(colSums((t(m)-v)^2))</code> is what calculates the euclidean distance between the point <code>v</code> and <strong>all</strong> points in <code>m</code> in one go. Each row in <code>m</code> is a point, but it would be better to have each column being a point (no need to transpose then).</p> <p>You can improve it further by keeping the matrix data in a matrix and not as elements in a list. Same goes for the result vector... And calculate <code>t(m)</code> outside <code>knnEstimate</code> to avoid doing it repeatedly.</p> <p><strong>[UPDATE]</strong> Regarding your question about other distance metrics, here's a variant that calls a (more efficient) <code>equclidean</code> function. It also uses <code>vapply</code>:</p> <pre><code>euclidean &lt;- function(v1, v2) sqrt(sum((v1 - v2) ^ 2)) knnEstimate &lt;- function(data, v1, k=3) { v &lt;- unlist(v1) # Get the matrix m &lt;- do.call(rbind, data[,'input']) idx &lt;- order(vapply(seq_len(nrow(m)), function(i) euclidean(m[i,], v), numeric(1)))[seq_len(k)] mean(unlist(data[idx, 'result'])) } system.time( a3 &lt;- crossValidate(knnEstimate, data) ) # 5.22 secs </code></pre> <p>...but you should still consider handling the euclidean case separately since it performs so much better vectorized.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload