Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as <code>--hostfile</code> to tell it what nodes to use.</p> <p>In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.</p> <p>Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:</p> <pre><code>mpirun -np 1 R --slave -f par.R </code></pre> <p>Since we build Open MPI with support for Torque, I don't use the <code>--hostfile</code> option: mpirun figures out what nodes to use from the <code>PBS_NODEFILE</code> environment variable automatically. The use of <code>-np 1</code> may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the <code>snow</code> package. I've never used <code>snowfall</code>, but after looking over the source code, it appears to me that <code>sfInit</code> always calls <code>makeMPIcluster</code> with a "count" argument which will cause <code>snow</code> to spawn workers, so I think that <code>-np 1</code> is required for MPI clusters with <code>snowfall</code>. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the <code>sfInit</code> "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the <code>Rmpi</code> <code>mpi.universe.size</code> function useful for that.</p> <p>If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload