1Elastic Agent 2============== 3 4.. automodule:: torch.distributed.elastic.agent 5.. currentmodule:: torch.distributed.elastic.agent 6 7Server 8-------- 9 10.. automodule:: torch.distributed.elastic.agent.server 11 12Below is a diagram of an agent that manages a local group of workers. 13 14.. image:: agent_diagram.jpg 15 16Concepts 17-------- 18 19This section describes the high-level classes and concepts that 20are relevant to understanding the role of the ``agent`` in torchelastic. 21 22.. currentmodule:: torch.distributed.elastic.agent.server 23 24.. autoclass:: ElasticAgent 25 :members: 26 27.. autoclass:: WorkerSpec 28 :members: 29 30.. autoclass:: WorkerState 31 :members: 32 33.. autoclass:: Worker 34 :members: 35 36.. autoclass:: WorkerGroup 37 :members: 38 39Implementations 40------------------- 41 42Below are the agent implementations provided by torchelastic. 43 44.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent 45.. autoclass:: LocalElasticAgent 46 47 48Extending the Agent 49--------------------- 50 51To extend the agent you can implement ```ElasticAgent`` directly, however 52we recommend you extend ``SimpleElasticAgent`` instead, which provides 53most of the scaffolding and leaves you with a few specific abstract methods 54to implement. 55 56.. currentmodule:: torch.distributed.elastic.agent.server 57.. autoclass:: SimpleElasticAgent 58 :members: 59 :private-members: 60 61.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult 62 63 64Watchdog in the Agent 65--------------------- 66 67A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an 68environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has 69been defined in the ```LocalElasticAgent``` process. 70Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE``` 71can be set with a unique file name for the named pipe. If the environment 72variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent``` 73will internally create a unique file name and set it to the environment 74variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will 75be propagated to the worker processes to allow them to connect to the same 76named pipe that ```LocalElasticAgent``` uses. 77 78 79Health Check Server 80------------------- 81 82A health check monitoring server can be enabled in ```LocalElasticAgent``` 83if an environment variable ``TORCHELASTIC_HEALTH_CHECK_PORT`` has been defined 84in the ```LocalElasticAgent``` process. 85Adding interface for health check server which can be extended by starting tcp/http 86server on the specified port number. 87Additionally, health check server will have callback to check watchdog is alive. 88 89.. automodule:: torch.distributed.elastic.agent.server.health_check_server 90 91.. autoclass:: HealthCheckServer 92 :members: 93 94.. autofunction:: torch.distributed.elastic.agent.server.health_check_server.create_healthcheck_server 95