StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
14614734
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2013-01-30T22:06:29.793
FavoriteCount
0
LastActivityDate
2013-01-30T22:53:19.433
LastEditDate
2013-01-30T22:53:19.433
LastEditorUserId
559784
OwnerUserId
559784
ParentId
14614306
PostTypeId
2
Score
21
ViewCount
0
LastEditorDisplayName
text
Body
These differences can be attributed to 1) communication overhead (especially if you run across nodes) and 2) performance overhead (if your job is not that intensive compared to initiating a parallelisation, for example). Usually, if the task you are parallelising is not that time-consuming, then you will mostly find that parallelisation does NOT have much of an effect (which is much highly visible on huge datasets. Even though this may not directly answer your benchmarking, I hope this should be rather straightforward and can be related to. As an example, here, I construct a <code>data.frame</code> with <code>1e6</code> rows with <code>1e4</code> unique column <code>group</code> entries and some values in column <code>val</code>. And then I run using <code>plyr</code> in <code>parallel</code> using <code>doMC</code> and without parallelisation. <pre><code>df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)), val = sample(1:10, 1e6, replace = T)) > head(df) group val # 1 8498 8 # 2 5253 6 # 3 1495 1 # 4 7362 9 # 5 2344 6 # 6 5602 9 > dim(df) # [1] 1000000 2 require(plyr) require(doMC) registerDoMC(20) # 20 processors # parallelisation using doMC + plyr P.PLYR <- function() { o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE) } # no parallelisation PLYR <- function() { o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE) } require(rbenchmark) benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed") test replications elapsed relative user.self sys.self user.child sys.child 2 PLYR() 2 8.925 1.000 8.865 0.068 0.000 0.000 1 P.PLYR() 2 30.637 3.433 15.841 13.945 8.944 38.858 </code></pre> As you can see, the parallel version of <code>plyr</code> runs 3.5 times slower Now, let me use the same <code>data.frame</code>, but instead of computing <code>sum</code>, let me construct a bit more demanding function, say, <code>median(.) * median(rnorm(1e4)</code> ((meaningless, yes): You'll see that the tides are beginning to shift: <pre><code># parallelisation using doMC + plyr P.PLYR <- function() { o1 <- ddply(df, .(group), function(x) median(x$val) * median(rnorm(1e4)), .parallel = TRUE) } # no parallelisation PLYR <- function() { o2 <- ddply(df, .(group), function(x) median(x$val) * median(rnorm(1e4)), .parallel = FALSE) } > benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed") test replications elapsed relative user.self sys.self user.child sys.child 1 P.PLYR() 2 41.911 1.000 15.265 15.369 141.585 34.254 2 PLYR() 2 73.417 1.752 73.372 0.052 0.000 0.000 </code></pre> Here, the parallel version is <code>1.752 times</code> faster than the non-parallel version. Edit: Following @Paul's comment, I just implemented a small delay using <code>Sys.sleep()</code>. Of course the results are obvious. But just for the sake of completeness, here's the result on a 20*2 data.frame: <pre><code>df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20)) # parallelisation using doMC + plyr P.PLYR <- function() { o1 <- ddply(df, .(group), function(x) { Sys.sleep(2) median(x$val) }, .parallel = TRUE) } # no parallelisation PLYR <- function() { o2 <- ddply(df, .(group), function(x) { Sys.sleep(2) median(x$val) }, .parallel = FALSE) } > benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed") # test replications elapsed relative user.self sys.self user.child sys.child # 1 P.PLYR() 2 4.116 1.000 0.056 0.056 0.024 0.04 # 2 PLYR() 2 20.050 4.871 0.028 0.000 0.000 0.00 </code></pre> The difference here is not surprising.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhy is the parallel package slower than just using apply?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USArun
UserOwnerUserId
1. USArun
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.