• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Elastic Agent
2==============
3
4.. automodule:: torch.distributed.elastic.agent
5.. currentmodule:: torch.distributed.elastic.agent
6
7Server
8--------
9
10.. automodule:: torch.distributed.elastic.agent.server
11
12Below is a diagram of an agent that manages a local group of workers.
13
14.. image:: agent_diagram.jpg
15
16Concepts
17--------
18
19This section describes the high-level classes and concepts that
20are relevant to understanding the role of the ``agent`` in torchelastic.
21
22.. currentmodule:: torch.distributed.elastic.agent.server
23
24.. autoclass:: ElasticAgent
25   :members:
26
27.. autoclass:: WorkerSpec
28   :members:
29
30.. autoclass:: WorkerState
31   :members:
32
33.. autoclass:: Worker
34   :members:
35
36.. autoclass:: WorkerGroup
37   :members:
38
39Implementations
40-------------------
41
42Below are the agent implementations provided by torchelastic.
43
44.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
45.. autoclass:: LocalElasticAgent
46
47
48Extending the Agent
49---------------------
50
51To extend the agent you can implement ```ElasticAgent`` directly, however
52we recommend you extend ``SimpleElasticAgent`` instead, which provides
53most of the scaffolding and leaves you with a few specific abstract methods
54to implement.
55
56.. currentmodule:: torch.distributed.elastic.agent.server
57.. autoclass:: SimpleElasticAgent
58   :members:
59   :private-members:
60
61.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
62
63
64Watchdog in the Agent
65---------------------
66
67A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an
68environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has
69been defined in the ```LocalElasticAgent``` process.
70Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE```
71can be set with a unique file name for the named pipe. If the environment
72variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent```
73will internally create a unique file name and set it to the environment
74variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will
75be propagated to the worker processes to allow them to connect to the same
76named pipe that ```LocalElasticAgent``` uses.
77
78
79Health Check Server
80-------------------
81
82A health check monitoring server can be enabled in ```LocalElasticAgent```
83if an environment variable ``TORCHELASTIC_HEALTH_CHECK_PORT`` has been defined
84in the ```LocalElasticAgent``` process.
85Adding interface for health check server which can be extended by starting tcp/http
86server on the specified port number.
87Additionally, health check server will have callback to check watchdog is alive.
88
89.. automodule:: torch.distributed.elastic.agent.server.health_check_server
90
91.. autoclass:: HealthCheckServer
92   :members:
93
94.. autofunction:: torch.distributed.elastic.agent.server.health_check_server.create_healthcheck_server
95