Using mpirun to run torchrun jobs

Posted August 28, 2024

Launch multi-node distributed torchrun jobs with mpirun

So the path of least resitance for launching distributed training is using torchrun.

For example you would launch a distributed job using a command like:

torchrun
    --nnodes=$NUM_NODES
    --nproc-per-node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

The catch with this is that you need to run this command on EVERY SINGLE NODE (machine, not gpu).

Now we don’t want to ssh into more than one machine to launch, kill / debug jobs. But we also don’t want to change any of our training code. Especially when using huggingface Trainer.

Well, this problem of multi-node job management is old and solved, and we will use a solution called OpenMPI. OpenMPI can be installed as part of your training image, and comes bundled with AWS Deep Learning AMI (DLAMI).

Now OpenMPI does not work with pytorch out of the box, and getting it to work requires some tweaking. When using OpenMPI, rank is stored in an environment variable called OMPI_COMM_WORLD_RANK instead of LOCAL_RANK, which is what pytorch expects.

So instead of having OpenMPI run a job for each gpu, we run a job for each node and let torchrun handle the gpu jobs, and since ultimate jobs are launched by torchrun, our huggingface trainer is none the wiser.

First, collect ip address of all your nodes in file. Let’s call it my-hosts.

# my-hosts file
192.10.40.41
192.10.40.42
192.10.40.43
192.10.40.44

Then run the following command:


NUM_NODES=$(wc -l < my-hosts)

export NUM_GPUS_PER_NODE=8 # change this to num of gpu on each node

mpirun \
-x LD_LIBRARY_PATH -x PATH \
-x NUM_GPUS_PER_NODE \
--hostfile my-hosts -np ${NUM_NODES} -npernode 1 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0  \
-mca btl ^openib \
--bind-to none \
dist_train.sh TRAIN_ARGS_HERE

where content of dist_train.sh is

#! /bin/bash
# dist_train.sh
set -euo pipefail

torchrun \
    --nnodes=${OMPI_COMM_WORLD_SIZE} \
    --nproc-per-node=${NUM_GPUS_PER_NODE} \
    YOUR_TRAINING_SCRIPT.py $@

And that is it. A hacky solution but it works.