• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Customization
2=============
3
4This section describes how to customize TorchElastic to fit your needs.
5
6Launcher
7------------------------
8
9The launcher program that ships with TorchElastic
10should be sufficient for most use-cases (see :ref:`launcher-api`).
11You can implement a custom launcher by
12programmatically creating an agent and passing it specs for your workers as
13shown below.
14
15.. code-block:: python
16
17  # my_launcher.py
18
19  if __name__ == "__main__":
20    args = parse_args(sys.argv[1:])
21    rdzv_handler = RendezvousHandler(...)
22    spec = WorkerSpec(
23        local_world_size=args.nproc_per_node,
24        fn=trainer_entrypoint_fn,
25        args=(trainer_entrypoint_fn args.fn_args,...),
26        rdzv_handler=rdzv_handler,
27        max_restarts=args.max_restarts,
28        monitor_interval=args.monitor_interval,
29    )
30
31    agent = LocalElasticAgent(spec, start_method="spawn")
32    try:
33        run_result = agent.run()
34        if run_result.is_failed():
35            print(f"worker 0 failed with: run_result.failures[0]")
36        else:
37            print(f"worker 0 return value is: run_result.return_values[0]")
38    except Exception ex:
39        # handle exception
40
41
42Rendezvous Handler
43------------------------
44
45To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler``
46and implement its methods.
47
48.. warning:: Rendezvous handlers are tricky to implement. Before you begin
49          make sure you completely understand the properties of rendezvous.
50          Please refer to :ref:`rendezvous-api` for more information.
51
52Once implemented you can pass your custom rendezvous handler to the worker
53spec when creating the agent.
54
55.. code-block:: python
56
57    spec = WorkerSpec(
58        rdzv_handler=MyRendezvousHandler(params),
59        ...
60    )
61    elastic_agent = LocalElasticAgent(spec, start_method=start_method)
62    elastic_agent.run(spec.role)
63
64
65Metric Handler
66-----------------------------
67
68TorchElastic emits platform level metrics (see :ref:`metrics-api`).
69By default metrics are emitted to `/dev/null` so you will not see them.
70To have the metrics pushed to a metric handling service in your infrastructure,
71implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your
72custom launcher.
73
74.. code-block:: python
75
76  # my_launcher.py
77
78  import torch.distributed.elastic.metrics as metrics
79
80  class MyMetricHandler(metrics.MetricHandler):
81      def emit(self, metric_data: metrics.MetricData):
82          # push metric_data to your metric sink
83
84  def main():
85    metrics.configure(MyMetricHandler())
86
87    spec = WorkerSpec(...)
88    agent = LocalElasticAgent(spec)
89    agent.run()
90
91Events Handler
92-----------------------------
93
94TorchElastic supports events recording (see :ref:`events-api`).
95The events module defines API that allows you to record events and
96implement custom EventHandler. EventHandler is used for publishing events
97produced during torchelastic execution to different sources, e.g.  AWS CloudWatch.
98By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores
99events. To configure custom events handler you need to implement
100`torch.distributed.elastic.events.EventHandler` interface and `configure` it
101in your custom launcher.
102
103.. code-block:: python
104
105  # my_launcher.py
106
107  import torch.distributed.elastic.events as events
108
109  class MyEventHandler(events.EventHandler):
110      def record(self, event: events.Event):
111          # process event
112
113  def main():
114    events.configure(MyEventHandler())
115
116    spec = WorkerSpec(...)
117    agent = LocalElasticAgent(spec)
118    agent.run()
119