Note that there are some explanatory texts on larger screens.

plurals
  1. POQuestions about torque checkpoint MPI jobs with BLCR
    primarykey
    data
    text
    <p>We're trying to use torque to checkpoint MPI jobs, but it seems that torque can only handle jobs running on a single node. I checked the code and found that when using qhold to checkpoint a job, qhold sends a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this request to master host, and then master host checkpoints the job processes running on itself with BLCR, but not send the request to its sister nodes, so it seems that MPI jobs can not be checkpointable in torque. </p> <p>Another problem, after checkpoint succeeds(as reported by qhold), torque sends a signal 15 to the process in master host to kill the process, then torque would copy the checkpoint file to pbs_server and remove all the files locally. When using qrls to restart this job,the scheduler would allocate new nodes for this job, and copy the checkpoint file to the new nodes and then restart the job through the checkpoint file, then the problem comes:</p> <ol> <li><p>Assume torque can checkpoint the processes of MPI jobs in every nodes, and usually our job uses a huge chunk of memory, and therefore the checkpoint file is very large, but the pbs server doesn't have a disk large enough to contain the checkpoint files.</p></li> <li><p>In our environment, before the MPI job starts, we pull some large meta data from another cluster directly to the nodes allocated for the MPI job for computing, and after checkpoint/restart, the job processes may resume in some different nodes, and the meta data might be missing.</p></li> </ol> <p>If there's someone who can tell me how you do checkpoint for MPI jobs, and if my question can be answered and it's need to modify torque code, I also like to do that.</p> <p>Thanks.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload