1Quickstart 2=========== 3 4To launch a **fault-tolerant** job, run the following on all nodes. 5 6.. code-block:: bash 7 8 torchrun 9 --nnodes=NUM_NODES 10 --nproc-per-node=TRAINERS_PER_NODE 11 --max-restarts=NUM_ALLOWED_FAILURES 12 --rdzv-id=JOB_ID 13 --rdzv-backend=c10d 14 --rdzv-endpoint=HOST_NODE_ADDR 15 YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) 16 17 18To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes 19and at most ``MAX_SIZE`` nodes. 20 21.. code-block:: bash 22 23 torchrun 24 --nnodes=MIN_SIZE:MAX_SIZE 25 --nproc-per-node=TRAINERS_PER_NODE 26 --max-restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES 27 --rdzv-id=JOB_ID 28 --rdzv-backend=c10d 29 --rdzv-endpoint=HOST_NODE_ADDR 30 YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) 31 32.. note:: 33 TorchElastic models failures as membership changes. When a node fails, 34 this is treated as a "scale down" event. When the failed node is replaced by 35 the scheduler, it is a "scale up" event. Hence for both fault tolerant 36 and elastic jobs, ``--max-restarts`` is used to control the total number of 37 restarts before giving up, regardless of whether the restart was caused 38 due to a failure or a scaling event. 39 40``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400), 41specifies the node and the port on which the C10d rendezvous backend should be 42instantiated and hosted. It can be any node in your training cluster, but 43ideally you should pick a node that has a high bandwidth. 44 45.. note:: 46 If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400. 47 48.. note:: 49 The ``--standalone`` option can be passed to launch a single node job with a 50 sidecar rendezvous backend. You don’t have to pass ``--rdzv-id``, 51 ``--rdzv-endpoint``, and ``--rdzv-backend`` when the ``--standalone`` option 52 is used. 53 54 55.. note:: 56 Learn more about writing your distributed training script 57 `here <train_script.html>`_. 58 59If ``torchrun`` does not meet your requirements you may use our APIs directly 60for more powerful customization. Start by taking a look at the 61`elastic agent <agent.html>`_ API. 62