• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Copyright (c) Facebook, Inc. and its affiliates.
2# All rights reserved.
3#
4# This source code is licensed under the BSD-style license found in the
5# LICENSE file in the root directory of this source tree.
6
7"""
8In the context of Torch Distributed Elastic we use the term *rendezvous* to
9refer to a particular functionality that combines a **distributed
10synchronization** primitive with **peer discovery**.
11
12It is used by Torch Distributed Elastic to gather participants of a training
13job (i.e. nodes) such that they all agree on the same list of participants and
14everyone's roles, as well as make a consistent collective decision on when
15training can begin/resume.
16
17Torch Distributed Elastic rendezvous provides the following critical
18functionalities:
19
20**Barrier**:
21
22Nodes performing rendezvous will all block until the rendezvous is considered
23complete - this happens when at least ``min`` total number of nodes have joined
24the rendezvous barrier (for the same job). This also implies the barrier is not
25necessarily of fixed size.
26
27There's an additional small waiting time after reaching ``min`` number of
28nodes - this is used to ensure the rendezvous is not completed "too quickly"
29(which could potentially exclude additional nodes attempting to join at
30approximately the same time).
31
32If ``max`` number of nodes is gathered at the barrier, the rendezvous is
33completed immediately.
34
35There's also an overall timeout which causes the rendezvous to fail if ``min``
36number of nodes is never reached - this is meant to be a simple fail-safe to
37help release partially allocated job resources, in case there's a problem with
38the resource manager, and is meant to be interpreted as non-retryable.
39
40**Exclusivity**:
41
42A simple distributed barrier would not be sufficient, as we also need to ensure
43that only one group of nodes exists at any given time (for a given job). In
44other words, new nodes (i.e. joining late) should not be able to form a parallel
45independent group of workers for the same job.
46
47Torch Distributed Elastic rendezvous ensures that if a group of nodes has
48already completed a rendezvous (and hence might already be training), then
49additional "late" nodes attempting to rendezvous will only announce themselves
50as waiting, and will have to wait until the (previously completed) existing
51rendezvous is destroyed first.
52
53**Consistency**:
54
55When a rendezvous is completed, all its members will agree on the job membership
56and everyone's role in it. This role is represented using an integer, called
57rank, that is between between 0 and world size.
58
59Note that ranks are *not stable*, in the sense that the same node can be
60assigned a different rank in the next (re-)rendezvous.
61
62**Fault-tolerance**:
63
64Torch Distributed Elastic rendezvous is designed to tolerate node failures
65during the rendezvous process. Should a process crash (or lose network
66connectivity, etc), between joining the rendezvous and it being completed, then
67a re-rendezvous with remaining healthy nodes will happen automatically.
68
69A node can also fail *after* it has completed (or *has been observered* by other
70nodes to have completed) the rendezvous - this scenario will be handled by the
71Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a
72re-rendezvous).
73
74**Shared key-value store**:
75
76When the rendezvous is completed, a shared key-value store is created and
77returned. This store implements a ``torch.distributed.Store`` API (see
78`distributed communication docs
79<https://pytorch.org/docs/stable/distributed.html>`__).
80
81This store is only shared by the members of the completed rendezvous. It
82is intended to be used by Torch Distributed Elastic to exchange information
83necessary to initialize job control and data-planes.
84
85**Waiting workers and rendezvous closing**:
86
87Torch Distributed Elastic rendezvous handler object provides additional
88functionalities, which are technically not part of the rendezvous process:
89
901. Querying how many workers arrived late at the barrier, who can participate in
91   *next* rendezvous.
92
932. Setting the rendezvous *closed* to signal all nodes not to participate in
94   next rendezvous.
95
96**DynamicRendezvousHandler**:
97
98Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler`
99class that implements the rendezvous mechanism described above. It is a backend-
100agnostic type that expects a particular :py:class:`.RendezvousBackend` instance
101to be specified during construction.
102
103Torch distributed users can either implement their own backend type or use one
104of the following implementations that come with PyTorch:
105
106- :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default
107  ``TCPStore``) as the rendezvous backend. The main advantage of using a C10d
108  store is that it requires no 3rd-party dependency (such as etcd) to establish
109  a rendezvous.
110- :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy
111  :py:class:`.EtcdRendezvousHandler` class. Passing an
112  :py:class:`.EtcdRendezvousBackend` instance to
113  :py:class:`.DynamicRendezvousHandler` is functionally equivalent to
114  instantiating an :py:class:`.EtcdRendezvousHandler`.
115
116  ::
117
118     store = TCPStore("localhost")
119
120     backend = C10dRendezvousBackend(store, "my_run_id")
121
122     rdzv_handler = DynamicRendezvousHandler.from_backend(
123         run_id="my_run_id",
124         store=store,
125         backend=backend,
126         min_nodes=2,
127         max_nodes=4
128     )
129"""
130
131from .api import (
132    rendezvous_handler_registry,
133    RendezvousClosedError,
134    RendezvousConnectionError,
135    RendezvousError,
136    RendezvousGracefulExitError,
137    RendezvousHandler,
138    RendezvousHandlerCreator,
139    RendezvousHandlerRegistry,
140    RendezvousInfo,
141    RendezvousParameters,
142    RendezvousStateError,
143    RendezvousStoreInfo,
144    RendezvousTimeoutError,
145)
146from .registry import _register_default_handlers
147
148
149_register_default_handlers()
150
151
152__all__ = [
153    "RendezvousClosedError",
154    "RendezvousConnectionError",
155    "RendezvousError",
156    "RendezvousGracefulExitError",
157    "RendezvousHandler",
158    "RendezvousHandlerCreator",
159    "RendezvousHandlerRegistry",
160    "RendezvousInfo",
161    "RendezvousParameters",
162    "RendezvousStateError",
163    "RendezvousStoreInfo",
164    "RendezvousTimeoutError",
165    "rendezvous_handler_registry",
166]
167