1Customization 2============= 3 4This section describes how to customize TorchElastic to fit your needs. 5 6Launcher 7------------------------ 8 9The launcher program that ships with TorchElastic 10should be sufficient for most use-cases (see :ref:`launcher-api`). 11You can implement a custom launcher by 12programmatically creating an agent and passing it specs for your workers as 13shown below. 14 15.. code-block:: python 16 17 # my_launcher.py 18 19 if __name__ == "__main__": 20 args = parse_args(sys.argv[1:]) 21 rdzv_handler = RendezvousHandler(...) 22 spec = WorkerSpec( 23 local_world_size=args.nproc_per_node, 24 fn=trainer_entrypoint_fn, 25 args=(trainer_entrypoint_fn args.fn_args,...), 26 rdzv_handler=rdzv_handler, 27 max_restarts=args.max_restarts, 28 monitor_interval=args.monitor_interval, 29 ) 30 31 agent = LocalElasticAgent(spec, start_method="spawn") 32 try: 33 run_result = agent.run() 34 if run_result.is_failed(): 35 print(f"worker 0 failed with: run_result.failures[0]") 36 else: 37 print(f"worker 0 return value is: run_result.return_values[0]") 38 except Exception ex: 39 # handle exception 40 41 42Rendezvous Handler 43------------------------ 44 45To implement your own rendezvous, extend ``torch.distributed.elastic.rendezvous.RendezvousHandler`` 46and implement its methods. 47 48.. warning:: Rendezvous handlers are tricky to implement. Before you begin 49 make sure you completely understand the properties of rendezvous. 50 Please refer to :ref:`rendezvous-api` for more information. 51 52Once implemented you can pass your custom rendezvous handler to the worker 53spec when creating the agent. 54 55.. code-block:: python 56 57 spec = WorkerSpec( 58 rdzv_handler=MyRendezvousHandler(params), 59 ... 60 ) 61 elastic_agent = LocalElasticAgent(spec, start_method=start_method) 62 elastic_agent.run(spec.role) 63 64 65Metric Handler 66----------------------------- 67 68TorchElastic emits platform level metrics (see :ref:`metrics-api`). 69By default metrics are emitted to `/dev/null` so you will not see them. 70To have the metrics pushed to a metric handling service in your infrastructure, 71implement a `torch.distributed.elastic.metrics.MetricHandler` and `configure` it in your 72custom launcher. 73 74.. code-block:: python 75 76 # my_launcher.py 77 78 import torch.distributed.elastic.metrics as metrics 79 80 class MyMetricHandler(metrics.MetricHandler): 81 def emit(self, metric_data: metrics.MetricData): 82 # push metric_data to your metric sink 83 84 def main(): 85 metrics.configure(MyMetricHandler()) 86 87 spec = WorkerSpec(...) 88 agent = LocalElasticAgent(spec) 89 agent.run() 90 91Events Handler 92----------------------------- 93 94TorchElastic supports events recording (see :ref:`events-api`). 95The events module defines API that allows you to record events and 96implement custom EventHandler. EventHandler is used for publishing events 97produced during torchelastic execution to different sources, e.g. AWS CloudWatch. 98By default it uses `torch.distributed.elastic.events.NullEventHandler` that ignores 99events. To configure custom events handler you need to implement 100`torch.distributed.elastic.events.EventHandler` interface and `configure` it 101in your custom launcher. 102 103.. code-block:: python 104 105 # my_launcher.py 106 107 import torch.distributed.elastic.events as events 108 109 class MyEventHandler(events.EventHandler): 110 def record(self, event: events.Event): 111 # process event 112 113 def main(): 114 events.configure(MyEventHandler()) 115 116 spec = WorkerSpec(...) 117 agent = LocalElasticAgent(spec) 118 agent.run() 119