Lines Matching full:autograd
3 .. _distributed-autograd-design:
5 Distributed Autograd Design
8 This note will present the detailed design for distributed autograd and walk
10 :ref:`autograd-mechanics` and the :ref:`distributed-rpc-framework` before
41 The main motivation behind distributed autograd is to enable running a backward
47 Autograd recording during the forward pass
50 PyTorch builds the autograd graph during the forward pass and this graph is
52 :ref:`how-autograd-encodes-history`.
54 For distributed autograd, we need to keep track of all RPCs during the forward
56 we attach ``send`` and ``recv`` functions to the autograd graph when we perform
60 edges point to the autograd function for the input tensors of the RPC.
73 As an example, this is what the autograd graph for our example above would look
80 Distributed Autograd Context
83 Each forward and backward pass that uses distributed autograd is assigned a
84 unique :class:`torch.distributed.autograd.context` and this context has a
94 calling :meth:`torch.autograd.backward` multiple times locally. In order to
96 gradients are accumulated in the :class:`torch.distributed.autograd.context`
99 each autograd pass in this context. This ensures we hold references to the
100 appropriate nodes in the autograd graph to keep it alive. In addition to
104 distributed autograd pass.
108 From the user's perspective the autograd context is setup as follows:
112 import torch.distributed.autograd as dist_autograd
118 the distributed autograd context manager, as a valid context is needed in
144 This is what the autograd graph for the code above would look like:
149 The first step the autograd engine performs as part of the backward pass is
150 computing the number of dependencies for each node in the autograd graph. This
151 helps the autograd engine know when a node in the graph is ready for execution.
155 words doesn't need to be executed). The local autograd engine computes these
158 The fact that certain nodes in the autograd graph might not be executed in the
159 backward pass poses a challenge for distributed autograd. Consider this piece
175 The associated autograd graph for the code above would be:
179 Computing dependencies of this distributed autograd graph is much more
186 used). This simplifies the distributed autograd algorithm and is much more
210 `Distributed Autograd Context`_.
213 4. After computing dependencies, kick off the local autograd engine with the
215 5. When the autograd engine executes the ``recv`` function, the ``recv``
227 local autograd engine for that worker.
230 `Distributed Autograd Context`_. The gradients are stored in a
233 :meth:`~torch.distributed.autograd.get_gradients` API.
237 As an example the complete code with distributed autograd would be as follows:
242 import torch.distributed.autograd as dist_autograd
250 # Setup the autograd context. Computations that take
252 # the distributed autograd context manager.
273 The distributed autograd graph with dependencies would be as follows (t5.sum() excluded for simplic…
282 2. Now, we kickoff the local autograd engine on ``Worker 0``. We first execute
283 the ``mul`` function, accumulate its output in the autograd context as the
289 4. Next, we enqueue ``send2`` on the local autograd engine of ``Worker 1``, which
295 `Distributed Autograd Context`_.
300 you can refer to **Distributed Autograd Algorithm Smart mode** section in the
317 optimizers on the appropriate remote workers. A distributed autograd
329 distributed autograd and the distributed optimizer. If the code is placed into a
337 import torch.distributed.autograd as dist_autograd
356 # Use a distributed autograd context.
363 # Backward pass (run distributed autograd).