1.. _torch.compiler_get_started: 2 3Getting Started 4=============== 5 6Before you read this section, make sure to read the :ref:`torch.compiler_overview`. 7 8Let's start by looking at a simple ``torch.compile`` example that demonstrates 9how to use ``torch.compile`` for inference. This example demonstrates the 10``torch.cos()`` and ``torch.sin()`` features which are examples of pointwise 11operators as they operate element by element on a vector. This example might 12not show significant performance gains but should help you form an intuitive 13understanding of how you can use ``torch.compile`` in your own programs. 14 15.. note:: 16 To run this script, you need to have at least one GPU on your machine. 17 If you do not have a GPU, you can remove the ``.to(device="cuda:0")`` code 18 in the snippet below and it will run on CPU. You can also set device to 19 ``xpu:0`` to run on Intel® GPUs. 20 21.. code:: python 22 23 import torch 24 def fn(x): 25 a = torch.cos(x) 26 b = torch.sin(a) 27 return b 28 new_fn = torch.compile(fn, backend="inductor") 29 input_tensor = torch.randn(10000).to(device="cuda:0") 30 a = new_fn(input_tensor) 31 32A more famous pointwise operator you might want to use would 33be something like ``torch.relu()``. Pointwise ops in eager mode are 34suboptimal because each one would need to read a tensor from the 35memory, make some changes, and then write back those changes. The single 36most important optimization that inductor performs is fusion. In the 37example above we can turn 2 reads (``x``, ``a``) and 382 writes (``a``, ``b``) into 1 read (``x``) and 1 write (``b``), which 39is crucial especially for newer GPUs where the bottleneck is memory 40bandwidth (how quickly you can send data to a GPU) rather than compute 41(how quickly your GPU can crunch floating point operations). 42 43Another major optimization that inductor provides is automatic 44support for CUDA graphs. 45CUDA graphs help eliminate the overhead from launching individual 46kernels from a Python program which is especially relevant for newer GPUs. 47 48TorchDynamo supports many different backends, but TorchInductor specifically works 49by generating `Triton <https://github.com/openai/triton>`__ kernels. Let's save 50our example above into a file called ``example.py``. We can inspect the code 51generated Triton kernels by running ``TORCH_COMPILE_DEBUG=1 python example.py``. 52As the script executes, you should see ``DEBUG`` messages printed to the 53terminal. Closer to the end of the log, you should see a path to a folder 54that contains ``torchinductor_<your_username>``. In that folder, you can find 55the ``output_code.py`` file that contains the generated kernel code similar to 56the following: 57 58.. code-block:: python 59 60 @pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]}) 61 @triton.jit 62 def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): 63 xnumel = 10000 64 xoffset = tl.program_id(0) * XBLOCK 65 xindex = xoffset + tl.arange(0, XBLOCK)[:] 66 xmask = xindex < xnumel 67 x0 = xindex 68 tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0) 69 tmp1 = tl.cos(tmp0) 70 tmp2 = tl.sin(tmp1) 71 tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask) 72 73.. note:: The above code snippet is an example. Depending on your hardware, 74 you might see different code generated. 75 76And you can verify that fusing the ``cos`` and ``sin`` did actually occur 77because the ``cos`` and ``sin`` operations occur within a single Triton kernel 78and the temporary variables are held in registers with very fast access. 79 80Read more on Triton's performance 81`here <https://openai.com/blog/triton/>`__. Because the code is written 82in Python, it's fairly easy to understand even if you have not written all that 83many CUDA kernels. 84 85Next, let's try a real model like resnet50 from the PyTorch 86hub. 87 88.. code-block:: python 89 90 import torch 91 model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True) 92 opt_model = torch.compile(model, backend="inductor") 93 opt_model(torch.randn(1,3,64,64)) 94 95And that is not the only available backend, you can run in a REPL 96``torch.compiler.list_backends()`` to see all the available backends. Try out the 97``cudagraphs`` next as inspiration. 98 99Using a pretrained model 100~~~~~~~~~~~~~~~~~~~~~~~~ 101 102PyTorch users frequently leverage pretrained models from 103`transformers <https://github.com/huggingface/transformers>`__ or 104`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of 105the design goals is TorchDynamo and TorchInductor is to work out of the box with 106any model that people would like to author. 107 108Let's download a pretrained model directly from the HuggingFace hub and optimize 109it: 110 111.. code-block:: python 112 113 import torch 114 from transformers import BertTokenizer, BertModel 115 # Copy pasted from here https://huggingface.co/bert-base-uncased 116 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 117 model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0") 118 model = torch.compile(model, backend="inductor") # This is the only line of code that we changed 119 text = "Replace me by any text you'd like." 120 encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0") 121 output = model(**encoded_input) 122 123If you remove the ``to(device="cuda:0")`` from the model and 124``encoded_input``, then Triton will generate C++ kernels that will be 125optimized for running on your CPU. You can inspect both Triton or C++ 126kernels for BERT. They are more complex than the trigonometry 127example we tried above but you can similarly skim through it and see if you 128understand how PyTorch works. 129 130Similarly, let's try out a TIMM example: 131 132.. code-block:: python 133 134 import timm 135 import torch 136 model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2) 137 opt_model = torch.compile(model, backend="inductor") 138 opt_model(torch.randn(64,3,7,7)) 139 140Next Steps 141~~~~~~~~~~ 142 143In this section, we have reviewed a few inference examples and developed a 144basic understanding of how torch.compile works. Here is what you check out next: 145 146- `torch.compile tutorial on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ 147- :ref:`torch.compiler_api` 148- :ref:`torchdynamo_fine_grain_tracing` 149