Note that there are some explanatory texts on larger screens.

plurals
  1. POParallel application in python becomes much slower when using mpi rather than multiprocessing module
    primarykey
    data
    text
    <p>Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.</p> <p>The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.</p> <p>My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.</p> <p>What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.</p> <p>y axis - processing time<br> x axis - nr of processes<br> colours - size of each individual (nr of floats)<br></p> <p><strong>1) Using multiprocessing module - Pool.map</strong> <img src="https://i.stack.imgur.com/gUFW6.png" alt="enter image description here"></p> <p><strong>2) Using mpi - Scatter/Gather</strong> <img src="https://i.stack.imgur.com/sPcQt.png" alt="enter image description here"></p> <p><strong>3) Both pictures on top of each other</strong> <img src="https://i.stack.imgur.com/vsAeQ.png" alt="enter image description here"></p> <p>At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.</p> <p>Does anyone have any explanation for these results?</p> <p><strong>ADDED:</strong></p> <p>I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores. </p> <p>However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.</p> <p>I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.</p> <p><strong>Pool.map:</strong></p> <pre><code>def eval_population(func, pop): for ind in pop: ind.fitness.values = func(ind) return pop # ... self.pool = Pool(8) # ... for iter_ in xrange(nr_of_generations): # ... self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument. # ... </code></pre> <p><strong>MPI - Scatter/Gather</strong></p> <pre><code>def divide_list(lst, n): return [lst[i::n] for i in xrange(n)] def chain_list(lst): return list(chain.from_iterable(lst)) def evaluate_individuals_in_groups(func, rank, individuals): comm = MPI.COMM_WORLD size = MPI.COMM_WORLD.Get_size() packages = None if not rank: packages = divide_list(individuals, size) ind_for_eval = comm.scatter(packages) eval_population(func, ind_for_eval) pop_with_fit = comm.gather(ind_for_eval) if not rank: pop_with_fit = chain_list(pop_with_fit) for index, elem in enumerate(pop_with_fit): individuals[index] = elem for iter_ in xrange(nr_of_generations): # ... evaluate_individuals_in_groups(self.func, self.rank, pop) # ... </code></pre> <p><strong>ADDED 2:</strong> As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result: <img src="https://i.stack.imgur.com/Zxoii.png" alt="enter image description here"></p> <p>I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark: <img src="https://i.stack.imgur.com/Vnkzx.png" alt="enter image description here"></p> <p>Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload