Note that there are some explanatory texts on larger screens.

plurals
  1. POHow can I make this R matrix filling function faster?
    primarykey
    data
    text
    <p>A while back I wrote a function to fill time series matrices that had NA values up according to the specifications needed and it's had occational uses on a few matrices that are about 50000 rows, 350 columns. The matrix can contain either numeric or character values. The main problem is fixing the matrix is slow and I thought I'd gauge some experts on how to do this faster.</p> <p>I guess going to rcpp or paralleling it might help but I think it's might be my design rather than R itself that's inefficient. I generally vecotrize everything in R but since the missing values follow no pattern I've found no other way than to work with the matrix on a per row basis. </p> <p>The function needs to be called so it can carry forwards missing values and also be called to quickly just fill the latest values with the last known one.</p> <p>Here is an example matrix:</p> <pre><code>testMatrix &lt;- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", "2011-04-30", "2011-05-31"))) print(testMatrix) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA </code></pre> <p>This is the function I currently use:</p> <pre><code># ---------------------------------------------------------------------------- # GetMatrixWithBlanksFilled # ---------------------------------------------------------------------------- # # Arguments: # inputMatrix --- A matrix with gaps in the time series rows # fillGapMax --- The max number of columns to carry a number # forward if there are no more values in the # time series row. # # Returns: # A matrix with gaps filled. GetMatrixWithBlanksFilled &lt;- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) { if("DEBUG_ON" %in% ls(globalenv())){browser()} cntRow &lt;- nrow(inputMatrix) cntCol &lt;- ncol(inputMatrix) # if (forwardLooking) { for (i in 1:cntRow) { # Store the location of the first non NA element in the row firstValueCol &lt;- (1:cntCol)[!is.na(inputMatrix[i,])][1] if (!(is.na(firstValueCol))) { if (!(firstValueCol == cntCol)) { nextValueCol &lt;- firstValueCol # If there is a a value number in the row and it's not at the end of the time # series, start iterating through the row while there are more NA values and # more data values and not at the end of the row continue. while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))&gt;0 &amp;&amp; (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))&gt;0 &amp;&amp; !(nextValueCol == cntCol)) { # Find the next NA element nextNaCol &lt;- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1] # Find the next value element nextValueCol &lt;- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1] # If there is another value element then fill up all NA elements in between with the last known value if (!is.na(nextValueCol)) { inputMatrix[i,nextNaCol:(nextValueCol-1)] &lt;- inputMatrix[i,(nextNaCol-1)] } else { # If there is no other value element then fill up all NA elements up to the max number supplied # with the last known value unless it's close to the end of the row then just fill up to the end. inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] &lt;- inputMatrix[i,(nextNaCol-1)] nextValueCol &lt;- cntCol } } } } } } else { for (i in 1:cntRow) { if (is.na(inputMatrix[i,ncol(inputMatrix)])) { tempRow &lt;- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])] if (length(tempRow[!is.na(tempRow)])&gt;0) { lastNonNaLocation &lt;- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] &lt;- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])] } } } } return(inputMatrix) } </code></pre> <p>I'm then calling it with something like:</p> <pre><code>&gt; fixedMatrix1 &lt;- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE) &gt; print(fixedMatrix1) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39 </code></pre> <p>or</p> <pre><code>&gt; fixedMatrix2 &lt;- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE) &gt; print(fixedMatrix2) 2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31 ID_a NA NA NA NA NA NA NA NA NA ID_b NA NA NA NA NA NA NA NA NA ID_c NA NA NA NA NA NA NA NA 4.72 ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98 ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65 ID_f NA NA NA NA NA NA NA NA NA ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72 ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61 ID_i NA NA NA NA NA NA NA NA 45.30 ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39 </code></pre> <p>This example runs quickly but is there any way to make it faster for large matrices?</p> <pre><code>&gt; n &lt;- 38 &gt; m &lt;- 5000 &gt; bigM &lt;- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE) &gt; system.time(output &lt;- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE)) user system elapsed 86.47 0.06 87.24 </code></pre> <p>This dummy one has a lot of NA only rows and completely filled ones but the normal ones can take about 15-20 min.</p> <p><strong>UPDATE</strong></p> <p>Regarding Charles' comment about na.locf not completely mirroring the logic of the above: Below is a simplified version of how the final function is excluding checks for inputs etc:</p> <pre><code>FillGaps &lt;- function( dataMatrix, fillGapMax ) { require("zoo") numRow &lt;- nrow(dataMatrix) numCol &lt;- ncol(dataMatrix) iteration &lt;- (numCol-fillGapMax) if(length(iteration)&gt;0) { for (i in iteration:1) { tempMatrix &lt;- dataMatrix[,i:(i+fillGapMax),drop=FALSE] tempMatrix &lt;- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax)) dataMatrix[,i:(i+fillGapMax)] &lt;- tempMatrix } } return(dataMatrix) } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload