• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1.. _torch.compiler_get_started:
2
3Getting Started
4===============
5
6Before you read this section, make sure to read the :ref:`torch.compiler_overview`.
7
8Let's start by looking at a simple ``torch.compile`` example that demonstrates
9how to use ``torch.compile`` for inference. This example demonstrates the
10``torch.cos()`` and ``torch.sin()`` features which are examples of pointwise
11operators as they operate element by element on a vector. This example might
12not show significant performance gains but should help you form an intuitive
13understanding of how you can use ``torch.compile`` in your own programs.
14
15.. note::
16   To run this script, you need to have at least one GPU on your machine.
17   If you do not have a GPU, you can remove the ``.to(device="cuda:0")`` code
18   in the snippet below and it will run on CPU. You can also set device to
19   ``xpu:0`` to run on Intel® GPUs.
20
21.. code:: python
22
23   import torch
24   def fn(x):
25      a = torch.cos(x)
26      b = torch.sin(a)
27      return b
28   new_fn = torch.compile(fn, backend="inductor")
29   input_tensor = torch.randn(10000).to(device="cuda:0")
30   a = new_fn(input_tensor)
31
32A more famous pointwise operator you might want to use would
33be something like ``torch.relu()``. Pointwise ops in eager mode are
34suboptimal because each one would need to read a tensor from the
35memory, make some changes, and then write back those changes. The single
36most important optimization that inductor performs is fusion. In the
37example above we can turn 2 reads (``x``, ``a``) and
382 writes (``a``, ``b``) into 1 read (``x``) and 1 write (``b``), which
39is crucial especially for newer GPUs where the bottleneck is memory
40bandwidth (how quickly you can send data to a GPU) rather than compute
41(how quickly your GPU can crunch floating point operations).
42
43Another major optimization that inductor provides is automatic
44support for CUDA graphs.
45CUDA graphs help eliminate the overhead from launching individual
46kernels from a Python program which is especially relevant for newer GPUs.
47
48TorchDynamo supports many different backends, but TorchInductor specifically works
49by generating `Triton <https://github.com/openai/triton>`__ kernels. Let's save
50our example above into a file called ``example.py``. We can inspect the code
51generated Triton kernels by running ``TORCH_COMPILE_DEBUG=1 python example.py``.
52As the script executes, you should see ``DEBUG`` messages printed to the
53terminal. Closer to the end of the log, you should see a path to a folder
54that contains ``torchinductor_<your_username>``. In that folder, you can find
55the ``output_code.py`` file that contains the generated kernel code similar to
56the following:
57
58.. code-block:: python
59
60   @pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
61   @triton.jit
62   def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
63      xnumel = 10000
64      xoffset = tl.program_id(0) * XBLOCK
65      xindex = xoffset + tl.arange(0, XBLOCK)[:]
66      xmask = xindex < xnumel
67      x0 = xindex
68      tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0)
69      tmp1 = tl.cos(tmp0)
70      tmp2 = tl.sin(tmp1)
71      tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
72
73.. note:: The above code snippet is an example. Depending on your hardware,
74   you might see different code generated.
75
76And you can verify that fusing the ``cos`` and ``sin`` did actually occur
77because the ``cos`` and ``sin`` operations occur within a single Triton kernel
78and the temporary variables are held in registers with very fast access.
79
80Read more on Triton's performance
81`here <https://openai.com/blog/triton/>`__. Because the code is written
82in Python, it's fairly easy to understand even if you have not written all that
83many CUDA kernels.
84
85Next, let's try a real model like resnet50 from the PyTorch
86hub.
87
88.. code-block:: python
89
90   import torch
91   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
92   opt_model = torch.compile(model, backend="inductor")
93   opt_model(torch.randn(1,3,64,64))
94
95And that is not the only available backend, you can run in a REPL
96``torch.compiler.list_backends()`` to see all the available backends. Try out the
97``cudagraphs`` next as inspiration.
98
99Using a pretrained model
100~~~~~~~~~~~~~~~~~~~~~~~~
101
102PyTorch users frequently leverage pretrained models from
103`transformers <https://github.com/huggingface/transformers>`__ or
104`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
105the design goals is TorchDynamo and TorchInductor is to work out of the box with
106any model that people would like to author.
107
108Let's download a pretrained model directly from the HuggingFace hub and optimize
109it:
110
111.. code-block:: python
112
113   import torch
114   from transformers import BertTokenizer, BertModel
115   # Copy pasted from here https://huggingface.co/bert-base-uncased
116   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
117   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
118   model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
119   text = "Replace me by any text you'd like."
120   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
121   output = model(**encoded_input)
122
123If you remove the ``to(device="cuda:0")`` from the model and
124``encoded_input``, then Triton will generate C++ kernels that will be
125optimized for running on your CPU. You can inspect both Triton or C++
126kernels for BERT. They are more complex than the trigonometry
127example we tried above but you can similarly skim through it and see if you
128understand how PyTorch works.
129
130Similarly, let's try out a TIMM example:
131
132.. code-block:: python
133
134   import timm
135   import torch
136   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
137   opt_model = torch.compile(model, backend="inductor")
138   opt_model(torch.randn(64,3,7,7))
139
140Next Steps
141~~~~~~~~~~
142
143In this section, we have reviewed a few inference examples and developed a
144basic understanding of how torch.compile works. Here is what you check out next:
145
146- `torch.compile tutorial on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
147- :ref:`torch.compiler_api`
148- :ref:`torchdynamo_fine_grain_tracing`
149