• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Quickstart
2===========
3
4To launch a **fault-tolerant** job, run the following on all nodes.
5
6.. code-block:: bash
7
8    torchrun
9       --nnodes=NUM_NODES
10       --nproc-per-node=TRAINERS_PER_NODE
11       --max-restarts=NUM_ALLOWED_FAILURES
12       --rdzv-id=JOB_ID
13       --rdzv-backend=c10d
14       --rdzv-endpoint=HOST_NODE_ADDR
15       YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
16
17
18To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes
19and at most ``MAX_SIZE`` nodes.
20
21.. code-block:: bash
22
23    torchrun
24        --nnodes=MIN_SIZE:MAX_SIZE
25        --nproc-per-node=TRAINERS_PER_NODE
26        --max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
27        --rdzv-id=JOB_ID
28        --rdzv-backend=c10d
29        --rdzv-endpoint=HOST_NODE_ADDR
30        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
31
32.. note::
33   TorchElastic models failures as membership changes. When a node fails,
34   this is treated as a "scale down" event. When the failed node is replaced by
35   the scheduler, it is a "scale up" event. Hence for both fault tolerant
36   and elastic jobs, ``--max-restarts`` is used to control the total number of
37   restarts before giving up, regardless of whether the restart was caused
38   due to a failure or a scaling event.
39
40``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400),
41specifies the node and the port on which the C10d rendezvous backend should be
42instantiated and hosted. It can be any node in your training cluster, but
43ideally you should pick a node that has a high bandwidth.
44
45.. note::
46   If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400.
47
48.. note::
49   The ``--standalone`` option can be passed to launch a single node job with a
50   sidecar rendezvous backend. You don’t have to pass ``--rdzv-id``,
51   ``--rdzv-endpoint``, and ``--rdzv-backend`` when the ``--standalone`` option
52   is used.
53
54
55.. note::
56   Learn more about writing your distributed training script
57   `here <train_script.html>`_.
58
59If ``torchrun`` does not meet your requirements you may use our APIs directly
60for more powerful customization. Start by taking a look at the
61`elastic agent <agent.html>`_ API.
62