• Home
  • Raw
  • Download

Lines Matching full:gather

17 stream to allow for overlapping an all-gather with ``forward`` compute issued before it (from the C…
18 perspective). For example, if we have layer 0 all-gather -> layer 0 ``forward`` compute -> layer 1
19 all-gather -> …, then layer 1 all-gather can overlap with layer 0 ``forward`` compute even though t…
20 CPU thread issued it afterwards. (The 1st all-gather will not be able to overlap with anything.)
23 all-gather -> layer 1 all-gather -> layer 0 ``forward`` compute -> …. In eager mode, there is no wa…
31 the cost that the next all-gather’s output tensor must be allocated while the current one is still
32 in use. By issuing the next all- gather before the current ``forward`` compute kernels, the next
33 all-gather can start sooner on GPU. For most LLM workloads, this is not the case, so there is no
38 all-gather and reduce-scatter (partially because in earlier NCCL versions, it was not safe to use
41 we explicitly reorder the CPU issue order to be next all-gather -> current reduce-scatter, then the
42 current reduce-scatter would block the next all-gather and hence the next ``backward`` computation,
52 1. all-gather on parameters in ``forward``
53 2. all-gather on parameters in ``backward``
65 Each communication group corresponds to a single all-gather call and single reduce-scatter call. In
75 * The ``forward`` pass will communicate in chunks of ``0.2*4 = 0.8GB`` in all-gather
76 * The ``backward`` pass will communicate 2 times ``0.8GB`` each (1x all-gather and 1x reduce-scatte…
96 ``forward`` currently requires 2x all-gather buffer size. Here is why:
99 (``forward_prefetch=True`) case of layer 0 all-gather -> layer 0 forward compute -> layer 1
100 all-gather there is a need for 2 all-gather-sized buffers, because one buffer is used in the curren…
102gather-sized buffers. The reason is that in the flat-parameter FSDP design, we do not copy-out of …
104 … the recorded forward order as a possible 'failure mode'; a module's all-gather can always be foun…
106 ``backward`` currently requires at least 2x all-gather buffer size and potentially a bit more. Here…
148 …parameters. Hence, the nested ``nn.Module`` structure can affect the all-gather/free schedule and …