StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POParallel application in python becomes much slower when using mpi rather than multiprocessing module
primarykey
Id
17054469
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
6
CommunityOwnedDate
CreationDate
2013-06-11T21:52:25.390
FavoriteCount
8
LastActivityDate
2013-07-31T08:11:35.130
LastEditDate
2013-06-20T14:22:50.310
LastEditorUserId
2985269
OwnerUserId
2985269
ParentId
0
PostTypeId
1
Score
20
ViewCount
2070
LastEditorDisplayName
text
Body
Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools. The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism. My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive. What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below. y axis - processing time x axis - nr of processes colours - size of each individual (nr of floats) 1) Using multiprocessing module - Pool.map <img src="https://i.stack.imgur.com/gUFW6.png" alt="enter image description here"> 2) Using mpi - Scatter/Gather <img src="https://i.stack.imgur.com/sPcQt.png" alt="enter image description here"> 3) Both pictures on top of each other <img src="https://i.stack.imgur.com/vsAeQ.png" alt="enter image description here"> At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe. Does anyone have any explanation for these results? ADDED: I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores. However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour. I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same. Pool.map: <pre><code>def eval_population(func, pop): for ind in pop: ind.fitness.values = func(ind) return pop # ... self.pool = Pool(8) # ... for iter_ in xrange(nr_of_generations): # ... self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument. # ... </code></pre> MPI - Scatter/Gather <pre><code>def divide_list(lst, n): return [lst[i::n] for i in xrange(n)] def chain_list(lst): return list(chain.from_iterable(lst)) def evaluate_individuals_in_groups(func, rank, individuals): comm = MPI.COMM_WORLD size = MPI.COMM_WORLD.Get_size() packages = None if not rank: packages = divide_list(individuals, size) ind_for_eval = comm.scatter(packages) eval_population(func, ind_for_eval) pop_with_fit = comm.gather(ind_for_eval) if not rank: pop_with_fit = chain_list(pop_with_fit) for index, elem in enumerate(pop_with_fit): individuals[index] = elem for iter_ in xrange(nr_of_generations): # ... evaluate_individuals_in_groups(self.func, self.rank, pop) # ... </code></pre> ADDED 2: As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result: <img src="https://i.stack.imgur.com/Zxoii.png" alt="enter image description here"> I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark: <img src="https://i.stack.imgur.com/Vnkzx.png" alt="enter image description here"> Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
Tags
<python><multiprocessing><mpi><evolutionary-algorithm>
Title
Parallel application in python becomes much slower when using mpi rather than multiprocessing module
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMichal
UserOwnerUserId
1. USMichal
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POParallel application in python becomes much slower when using mpi rather than multiprocessing module
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POParallel application in python becomes much slower when using mpi rather than multiprocessing module
 UserUserId
 USCmdNtrf
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POParallel application in python becomes much slower when using mpi rather than multiprocessing module
 UserUserId
 USJohn Zwinck
 VoteTypeVoteTypeId
 VTFavorite
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.