1# Copyright (c) Facebook, Inc. and its affiliates. 2# All rights reserved. 3# 4# This source code is licensed under the BSD-style license found in the 5# LICENSE file in the root directory of this source tree. 6 7""" 8In the context of Torch Distributed Elastic we use the term *rendezvous* to 9refer to a particular functionality that combines a **distributed 10synchronization** primitive with **peer discovery**. 11 12It is used by Torch Distributed Elastic to gather participants of a training 13job (i.e. nodes) such that they all agree on the same list of participants and 14everyone's roles, as well as make a consistent collective decision on when 15training can begin/resume. 16 17Torch Distributed Elastic rendezvous provides the following critical 18functionalities: 19 20**Barrier**: 21 22Nodes performing rendezvous will all block until the rendezvous is considered 23complete - this happens when at least ``min`` total number of nodes have joined 24the rendezvous barrier (for the same job). This also implies the barrier is not 25necessarily of fixed size. 26 27There's an additional small waiting time after reaching ``min`` number of 28nodes - this is used to ensure the rendezvous is not completed "too quickly" 29(which could potentially exclude additional nodes attempting to join at 30approximately the same time). 31 32If ``max`` number of nodes is gathered at the barrier, the rendezvous is 33completed immediately. 34 35There's also an overall timeout which causes the rendezvous to fail if ``min`` 36number of nodes is never reached - this is meant to be a simple fail-safe to 37help release partially allocated job resources, in case there's a problem with 38the resource manager, and is meant to be interpreted as non-retryable. 39 40**Exclusivity**: 41 42A simple distributed barrier would not be sufficient, as we also need to ensure 43that only one group of nodes exists at any given time (for a given job). In 44other words, new nodes (i.e. joining late) should not be able to form a parallel 45independent group of workers for the same job. 46 47Torch Distributed Elastic rendezvous ensures that if a group of nodes has 48already completed a rendezvous (and hence might already be training), then 49additional "late" nodes attempting to rendezvous will only announce themselves 50as waiting, and will have to wait until the (previously completed) existing 51rendezvous is destroyed first. 52 53**Consistency**: 54 55When a rendezvous is completed, all its members will agree on the job membership 56and everyone's role in it. This role is represented using an integer, called 57rank, that is between between 0 and world size. 58 59Note that ranks are *not stable*, in the sense that the same node can be 60assigned a different rank in the next (re-)rendezvous. 61 62**Fault-tolerance**: 63 64Torch Distributed Elastic rendezvous is designed to tolerate node failures 65during the rendezvous process. Should a process crash (or lose network 66connectivity, etc), between joining the rendezvous and it being completed, then 67a re-rendezvous with remaining healthy nodes will happen automatically. 68 69A node can also fail *after* it has completed (or *has been observered* by other 70nodes to have completed) the rendezvous - this scenario will be handled by the 71Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a 72re-rendezvous). 73 74**Shared key-value store**: 75 76When the rendezvous is completed, a shared key-value store is created and 77returned. This store implements a ``torch.distributed.Store`` API (see 78`distributed communication docs 79<https://pytorch.org/docs/stable/distributed.html>`__). 80 81This store is only shared by the members of the completed rendezvous. It 82is intended to be used by Torch Distributed Elastic to exchange information 83necessary to initialize job control and data-planes. 84 85**Waiting workers and rendezvous closing**: 86 87Torch Distributed Elastic rendezvous handler object provides additional 88functionalities, which are technically not part of the rendezvous process: 89 901. Querying how many workers arrived late at the barrier, who can participate in 91 *next* rendezvous. 92 932. Setting the rendezvous *closed* to signal all nodes not to participate in 94 next rendezvous. 95 96**DynamicRendezvousHandler**: 97 98Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler` 99class that implements the rendezvous mechanism described above. It is a backend- 100agnostic type that expects a particular :py:class:`.RendezvousBackend` instance 101to be specified during construction. 102 103Torch distributed users can either implement their own backend type or use one 104of the following implementations that come with PyTorch: 105 106- :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default 107 ``TCPStore``) as the rendezvous backend. The main advantage of using a C10d 108 store is that it requires no 3rd-party dependency (such as etcd) to establish 109 a rendezvous. 110- :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy 111 :py:class:`.EtcdRendezvousHandler` class. Passing an 112 :py:class:`.EtcdRendezvousBackend` instance to 113 :py:class:`.DynamicRendezvousHandler` is functionally equivalent to 114 instantiating an :py:class:`.EtcdRendezvousHandler`. 115 116 :: 117 118 store = TCPStore("localhost") 119 120 backend = C10dRendezvousBackend(store, "my_run_id") 121 122 rdzv_handler = DynamicRendezvousHandler.from_backend( 123 run_id="my_run_id", 124 store=store, 125 backend=backend, 126 min_nodes=2, 127 max_nodes=4 128 ) 129""" 130 131from .api import ( 132 rendezvous_handler_registry, 133 RendezvousClosedError, 134 RendezvousConnectionError, 135 RendezvousError, 136 RendezvousGracefulExitError, 137 RendezvousHandler, 138 RendezvousHandlerCreator, 139 RendezvousHandlerRegistry, 140 RendezvousInfo, 141 RendezvousParameters, 142 RendezvousStateError, 143 RendezvousStoreInfo, 144 RendezvousTimeoutError, 145) 146from .registry import _register_default_handlers 147 148 149_register_default_handlers() 150 151 152__all__ = [ 153 "RendezvousClosedError", 154 "RendezvousConnectionError", 155 "RendezvousError", 156 "RendezvousGracefulExitError", 157 "RendezvousHandler", 158 "RendezvousHandlerCreator", 159 "RendezvousHandlerRegistry", 160 "RendezvousInfo", 161 "RendezvousParameters", 162 "RendezvousStateError", 163 "RendezvousStoreInfo", 164 "RendezvousTimeoutError", 165 "rendezvous_handler_registry", 166] 167