StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POR: `unlist` using a lot of time when summing a subset of a matrix
primarykey
Id
11571980
data
AcceptedAnswerId
11674740
AnswerCount
1
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2012-07-20T02:20:36.087
FavoriteCount
0
LastActivityDate
2012-07-26T17:28:29.270
LastEditDate
2012-07-20T16:42:31.357
LastEditorUserId
111910
OwnerUserId
111910
ParentId
0
PostTypeId
1
Score
1
ViewCount
196
LastEditorDisplayName
text
Body
<p>I have a program which pulls data out of a MySQL database, decodes a pair of binary columns, and then sums together a subset of of the rows within the pair of binary columns. Running the program on a sample data set takes 12-14 seconds, with 9-10 of those taken up by <code>unlist</code>. I'm wondering if there is any way to speed things up.</p> <h2>Structure of the table</h2> <p>The rows I'm getting from the database look like:</p> <pre><code>| array_length | mz_array | intensity_array | |--------------+-----------------+-----------------| | 98 | 00c077e66340... | 002091c37240... | | 74 | c04a7c7340... | db87734000... | </code></pre> <p>where <code>array_length</code> is the number of little-endian doubles in the two arrays (they are guaranteed to be the same length). So the first row has 98 doubles in each of <code>mz_array</code> and <code>intensity_array</code>. <code>array_length</code> has a mean of 825 and a median of 620 with 13,000 rows.</p> <h2>Decoding the binary arrays</h2> <p>Each row gets decoded by being passed to the following function. Once the binary arrays have been decoded, <code>array_length</code> is no longer needed.</p> <pre><code>DecodeSpectrum <- function(array_length, mz_array, intensity_array) { sapply(list(mz_array=mz_array, intensity_array=intensity_array), readBin, what="double", endian="little", n=array_length) } </code></pre> <h2>Summing the arrays</h2> <p>The next step is to sum the values in <code>intensity_array</code>, but only if their corresponding entry in <code>mz_array</code> is within a certain window. The arrays are ordered by <code>mz_array</code>, ascending. I am using the following function to sum up the <code>intensity_array</code> values:</p> <pre><code>SumInWindow <- function(spectrum, lower, upper) { sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2]) } </code></pre> <p>Where <code>spectrum</code> is the output from <code>DecodeSpectrum</code>, a <code>matrix</code>.</p> <h2>Operating over list of rows</h2> <p>Each row is handled by:</p> <pre><code>ProcessSegment <- function(spectra, window_bounds) { lower <- window_bounds[1] upper <- window_bounds[2] ## Decode a single spectrum and sum the intensities within the window. SumDecode <- function (...) { SumInWindow(DecodeSpectrum(...), lower, upper) } do.call("mapply", c(SumDecode, spectra)) } </code></pre> <p>And finally, the rows are fetched and handed off to <code>ProcessSegment</code> with this function:</p> <pre><code>ProcessAllSegments <- function(conn, window_bounds) { nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize) while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) { print(ProcessSegment(res$data, window_bounds)) } } </code></pre> <p>I'm doing the fetches in segments so that R doesn't have to load the entire data set into memory at once (it was causing out of memory errors). I'm using the <code>RODBC</code> driver because the <code>RMySQL</code> driver isn't able to return unsullied binary values (as far as I could tell).</p> <h2>Performance</h2> <p>For a sample data set of about 140MiB, the whole process takes around 14 seconds to complete, which is not that bad for 13,000 rows. Still, I think there's room for improvement, especially when looking at the <code>Rprof</code> output:</p> <pre><code>$by.self self.time self.pct total.time total.pct "unlist" 10.26 69.99 10.30 70.26 "SumInWindow" 1.06 7.23 13.92 94.95 "mapply" 0.48 3.27 14.44 98.50 "as.vector" 0.44 3.00 10.60 72.31 "array" 0.40 2.73 0.40 2.73 "FUN" 0.40 2.73 0.40 2.73 "list" 0.30 2.05 0.30 2.05 "<" 0.22 1.50 0.22 1.50 "unique" 0.18 1.23 0.36 2.46 ">" 0.18 1.23 0.18 1.23 ".Call" 0.16 1.09 0.16 1.09 "lapply" 0.14 0.95 0.86 5.87 "simplify2array" 0.10 0.68 11.48 78.31 "&" 0.10 0.68 0.10 0.68 "sapply" 0.06 0.41 12.36 84.31 "c" 0.06 0.41 0.06 0.41 "is.factor" 0.04 0.27 0.04 0.27 "match.fun" 0.04 0.27 0.04 0.27 "<Anonymous>" 0.02 0.14 13.94 95.09 "unique.default" 0.02 0.14 0.06 0.41 $by.total total.time total.pct self.time self.pct "ProcessAllSegments" 14.66 100.00 0.00 0.00 "do.call" 14.50 98.91 0.00 0.00 "ProcessSegment" 14.50 98.91 0.00 0.00 "mapply" 14.44 98.50 0.48 3.27 "<Anonymous>" 13.94 95.09 0.02 0.14 "SumInWindow" 13.92 94.95 1.06 7.23 "sapply" 12.36 84.31 0.06 0.41 "DecodeSpectrum" 12.36 84.31 0.00 0.00 "simplify2array" 11.48 78.31 0.10 0.68 "as.vector" 10.60 72.31 0.44 3.00 "unlist" 10.30 70.26 10.26 69.99 "lapply" 0.86 5.87 0.14 0.95 "array" 0.40 2.73 0.40 2.73 "FUN" 0.40 2.73 0.40 2.73 "unique" 0.36 2.46 0.18 1.23 "list" 0.30 2.05 0.30 2.05 "<" 0.22 1.50 0.22 1.50 ">" 0.18 1.23 0.18 1.23 ".Call" 0.16 1.09 0.16 1.09 "nextSeg" 0.16 1.09 0.00 0.00 "odbcFetchRows" 0.16 1.09 0.00 0.00 "&" 0.10 0.68 0.10 0.68 "c" 0.06 0.41 0.06 0.41 "unique.default" 0.06 0.41 0.02 0.14 "is.factor" 0.04 0.27 0.04 0.27 "match.fun" 0.04 0.27 0.04 0.27 $sample.interval [1] 0.02 $sampling.time [1] 14.66 </code></pre> <p>I'm surprised to see <code>unlist</code> taking up so much time; this says to me that there might be some redundant copying or rearranging going on. I'm new at R, so it's entirely possible that this is normal, but I'd like to know if there's anything glaringly wrong.</p> <h1>Update: sample data posted</h1> <p>I've posted the full version of the program <a href="https://gist.github.com/3151005" rel="nofollow">here</a> and the sample data I use <a href="http://dl.dropbox.com/u/19422084/dump.sql.gz" rel="nofollow">here</a>. The sample data is the <code>gzip</code>ed output from <code>mysqldump</code>. You need to set the proper environment variables for the script to connect to the database:</p> <ul> <li><code>MZDB_HOST</code></li> <li><code>MZDB_DB</code></li> <li><code>MZDB_USER</code></li> <li><code>MZDB_PW</code></li> </ul> <p>To run the script, you must specify the <code>run_id</code> and the window boundaries. I run the program like this:</p> <pre><code>Rscript ChromatoGen.R -i 1 -m 600 -M 1200 </code></pre> <p>These window bounds are pretty arbitrary, but select roughly a half to a third of the range. If you want to print the results, put a <code>print()</code> around the call to <code>ProcessSegment</code> within <code>ProcessAllSegments</code>. Using those parameters, the first 5 should be:</p> <pre><code>[1] 7139.682 4522.314 3435.512 5255.024 5947.999 </code></pre> <p>You probably want want to limit the number of results, unless you want 13,000 numbers filling your screen :) The simplest way is just add <code>LIMIT 5</code> at the end of <code>query</code>.</p>
Tags
<r><matrix><dataframe>
Title
R: `unlist` using a lot of time when summing a subset of a matrix
singulars
PostAcceptedAnswerId
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. UShaxney
UserOwnerUserId
1. UShaxney
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
  singulars
  PostTypePostTypeId
  PTAnswer
VotesPostIdCreationDate
1. VO
  singulars
  PostPostId
  POR: `unlist` using a lot of time when summing a subset of a matrix
  UserUserId
  This table or related slice is empty.
  VoteTypeVoteTypeId
  VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.