• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1torch.utils.bottleneck
2======================
3
4.. automodule:: torch.utils.bottleneck
5.. currentmodule:: torch.utils.bottleneck
6
7`torch.utils.bottleneck` is a tool that can be used as an initial step for
8debugging bottlenecks in your program. It summarizes runs of your script with
9the Python profiler and PyTorch's autograd profiler.
10
11Run it on the command line with
12
13::
14
15    python -m torch.utils.bottleneck /path/to/source/script.py [args]
16
17where [args] are any number of arguments to `script.py`, or run
18``python -m torch.utils.bottleneck -h`` for more usage instructions.
19
20.. warning::
21    Because your script will be profiled, please ensure that it exits in a
22    finite amount of time.
23
24.. warning::
25    Due to the asynchronous nature of CUDA kernels, when running against
26    CUDA code, the cProfile output and CPU-mode autograd profilers may
27    not show correct timings: the reported CPU time reports the amount of time
28    used to launch the kernels but does not include the time the kernel
29    spent executing on a GPU unless the operation does a synchronize.
30    Ops that do synchronize appear to be extremely expensive under regular
31    CPU-mode profilers.
32    In these case where timings are incorrect, the CUDA-mode autograd profiler
33    may be helpful.
34
35.. note::
36    To decide which (CPU-only-mode or CUDA-mode) autograd profiler output to
37    look at, you should first check if your script is CPU-bound
38    ("CPU total time is much greater than CUDA total time").
39    If it is CPU-bound, looking at the results of the CPU-mode autograd
40    profiler will help. If on the other hand your script spends most of its
41    time executing on the GPU, then it makes sense to start
42    looking for responsible CUDA operators in the output of the CUDA-mode
43    autograd profiler.
44
45    Of course the reality is much more complicated and your script might not be
46    in one of those two extremes depending on the part of the model you're
47    evaluating. If the profiler outputs don't help, you could try looking at
48    the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
49    However, please take into account that the NVTX overhead is very high and
50    often gives a heavily skewed timeline. Similarly, ``Intel® VTune™ Profiler``
51    helps to analyze performance on Intel platforms further with
52    :func:`torch.autograd.profiler.emit_itt()`.
53
54.. warning::
55    If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
56    (cProfile) will include the CUDA startup time (CUDA buffer allocation cost)
57    in its time reporting. This should not matter if your bottlenecks result
58    in code much slower than the CUDA startup time.
59
60For more complicated uses of the profilers (like in a multi-GPU case),
61please see https://docs.python.org/3/library/profile.html
62or :func:`torch.autograd.profiler.profile()` for more information.
63