StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Here's my stab at it. I never like to see a for loop in R, but in the case of a sparsely-NA vector, it looks like it will actually be more efficient (performance metrics below). The gist of the code is below.</p> <pre><code> #get the index of all NA values nas <- which(is.na(dat)) #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values namask <- is.na(dat) #calculate the maximum size of a run of NAs length <- getLengthNAs(dat); #the furthest away an NA value could be is half of the length of the maximum NA run windowSize <- ceiling(length/2) #loop through all NAs for (thisIndex in nas){ #extract the neighborhood of this NA neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)] #any already-filled-in values which were NA can be replaced with NAs neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA #the center of this neighborhood center <- windowSize + 1 #compute the difference within this neighborhood to find the nearest non-NA value delta <- center - which(!is.na(neighborhood)) #find the closest replacement replacement <- delta[abs(delta) == min(abs(delta))] #in case length > 1, just pick the first replacement <- replacement[1] #replace with the nearest non-NA value. dat[thisIndex] <- dat[(thisIndex - (replacement))] } </code></pre> <p>I liked the code you proposed, but I noticed that we were calculating the delta between every NA value and every other non-NA index in the matrix. I think this was the biggest performance hog. Instead, I just extract the minimum-sized neighborhood or window around each NA and find the nearest non-NA value within that window.</p> <p>So the performance scales linearly on the number of NAs and the window size -- where the window size is (the ceiling of) half the length of the maximum run of NAs. To calculate the length of the maximum run of NAs, you can use the following function:</p> <pre><code>getLengthNAs <- function(dat){ nas <- which(is.na(dat)) spacing <- diff(nas) length <- 1; while (any(spacing == 1)){ length <- length + 1; spacing <- diff(which(spacing == 1)) } length } </code></pre> <h2>Performance Comparison</h2> <pre><code>#create a test vector with 10% NAs and length 50,000. dat <- as.integer(runif(50000, min=0, max=10)) dat[dat==0] <- NA #the a() function is the code posted in the question a <- function(dat){ na.pos <- which(is.na(dat)) if (length(na.pos) == length(dat)) { return(dat) } non.na.pos <- setdiff(seq_along(dat), na.pos) nearest.non.na.pos <- sapply(na.pos, function(x) { return(which.min(abs(non.na.pos - x))) }) dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]] dat } #my code b <- function(dat){ #the same code posted above, but with some additional helper code to sanitize the input if(is.null(dat)){ return(NULL); } if (all(is.na(dat))){ stop("Can't impute NAs if there are no non-NA values.") } if (!any(is.na(dat))){ return(dat); } #starts with an NA (or multiple), handle these if (is.na(dat[1])){ firstNonNA <- which(!is.na(dat))[1] dat[1:(firstNonNA-1)] <- dat[firstNonNA] } #ends with an NA (or multiple), handle these if (is.na(dat[length(dat)])){ lastNonNA <- which(!is.na(dat)) lastNonNA <- lastNonNA[length(lastNonNA)] dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA] } #get the index of all NA values nas <- which(is.na(dat)) #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values namask <- is.na(dat) #calculate the maximum size of a run of NAs length <- getLengthNAs(dat); #the furthest away an NA value could be is half of the length of the maximum NA run #if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case. windowSize <- ceiling(length/2) #loop through all NAs for (thisIndex in nas){ #extract the neighborhood of this NA neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)] #any already-filled-in values which were NA can be replaced with NAs neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA #the center of this neighborhood center <- windowSize + 1 #compute the difference within this neighborhood to find the nearest non-NA value delta <- center - which(!is.na(neighborhood)) #find the closest replacement replacement <- delta[abs(delta) == min(abs(delta))] #in case length > 1, just pick the first replacement <- replacement[1] #replace with the nearest non-NA value. dat[thisIndex] <- dat[(thisIndex - (replacement))] } dat } #nograpes' answer on this question c <- function(dat){ nas=is.na(dat) if (!any(!nas)) return (dat) t=rle(nas) f=sapply(t$lengths[t$values],seq) a=unlist(f) b=unlist(lapply(f,rev)) x=which(nas) l=length(dat) dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)]) dat } #run 10 times each to get average performance. sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10) A: 5.059 sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10) B: 0.126 sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10) C: 0.287 </code></pre> <p>So it looks like this code (at least under these conditions), offers about a 40X speedup from the original code posted in the question, and a 2.2X speedup over @nograpes' answer below (though I imagine an <code>rle</code> solution would certainly be faster in some situations -- including a more NA-rich vector).</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload