StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Building on the great answer by @John Colby, we can get rid of the apply step and speed it up quite a bit (about 20x):</p> <pre><code># Create a bigger test set A <- c(1, NA, NA, 4, NA, 7, NA, NA, NA, NA) B <- c(NA, 2, NA, NA, 5, NA, 8, NA, 11, NA) C <- c(NA, NA, 3, NA, NA, NA, NA, 9, NA, 12) n=1e6; test_df = data.frame(A=rep(A, len=n), B=rep(B, len=n), C=rep(C, len=n)) # John Colby's method, 9.66 secs system.time({ df1 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) df1 = data.frame(df1[!apply(df1, 1, function(x) all(is.na(x))), ]) colnames(df1) = c('A', 'B', 'C') }) # My method, 0.48 secs system.time({ df2 = with(test_df, data.frame(A=A[1:(length(A)-2)], B=B[2:(length(B)-1)], C=C[3:length(C)])) df2 = df2[is.finite(with(df2, A|B|C)),] row.names(df2) <- NULL }) identical(df1, df2) # TRUE </code></pre> <p>...The trick here is that <code>A|B|C</code> is only <code>NA</code> if all values are <code>NA</code>. This turns out to be much faster than calling <code>all(is.na(x))</code> on each row of a matrix using <code>apply</code>.</p> <p><strong>EDIT</strong> @John has a different approach that also speeds it up. I added some code to turn the result into a data.frame with correct names and timed it. It seems to be pretty much the same speed as my solution.</p> <pre><code># John's method, 0.50 secs system.time({ test_m = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) test_m[is.na(test_m)] <- -1 test_m <- test_m[rowSums(test_m) > -3,] test_m[test_m == -1] <- NA df3 <- data.frame(test_m) colnames(df3) = c('A', 'B', 'C') }) identical(df1, df3) # TRUE </code></pre> <p><strong>EDIT AGAIN</strong> ...and @John Colby's updated answer is even faster!</p> <pre><code># John Colby's method, 0.39 secs system.time({ df4 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)])) df4 = data.frame(df4[rowSums(is.na(df4)) != ncol(df4), ]) colnames(df4) = c('A', 'B', 'C') }) identical(df1, df4) # TRUE </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload