1=================================== 2Compiling CUDA C/C++ with LLVM 3=================================== 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document contains the user guides and the internals of compiling CUDA 12C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM 13and developers who want to improve LLVM for GPUs. This document assumes a basic 14familiarity with CUDA. Information about CUDA programming can be found in the 15`CUDA programming guide 16<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 17 18How to Build LLVM with CUDA Support 19=================================== 20 21CUDA support is still in development and works the best in the trunk version 22of LLVM. Below is a quick summary of downloading and building the trunk 23version. Consult the `Getting Started 24<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting 25up LLVM. 26 27#. Checkout LLVM 28 29 .. code-block:: console 30 31 $ cd where-you-want-llvm-to-live 32 $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm 33 34#. Checkout Clang 35 36 .. code-block:: console 37 38 $ cd where-you-want-llvm-to-live 39 $ cd llvm/tools 40 $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang 41 42#. Configure and build LLVM and Clang 43 44 .. code-block:: console 45 46 $ cd where-you-want-llvm-to-live 47 $ mkdir build 48 $ cd build 49 $ cmake [options] .. 50 $ make 51 52How to Compile CUDA C/C++ with LLVM 53=================================== 54 55We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA 56CUDA installation guide 57<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if 58you have not. 59 60Suppose you want to compile and run the following CUDA program (``axpy.cu``) 61which multiplies a ``float`` array by a ``float`` scalar (AXPY). 62 63.. code-block:: c++ 64 65 #include <iostream> 66 67 __global__ void axpy(float a, float* x, float* y) { 68 y[threadIdx.x] = a * x[threadIdx.x]; 69 } 70 71 int main(int argc, char* argv[]) { 72 const int kDataLen = 4; 73 74 float a = 2.0f; 75 float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 76 float host_y[kDataLen]; 77 78 // Copy input data to device. 79 float* device_x; 80 float* device_y; 81 cudaMalloc(&device_x, kDataLen * sizeof(float)); 82 cudaMalloc(&device_y, kDataLen * sizeof(float)); 83 cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), 84 cudaMemcpyHostToDevice); 85 86 // Launch the kernel. 87 axpy<<<1, kDataLen>>>(a, device_x, device_y); 88 89 // Copy output data to host. 90 cudaDeviceSynchronize(); 91 cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), 92 cudaMemcpyDeviceToHost); 93 94 // Print the results. 95 for (int i = 0; i < kDataLen; ++i) { 96 std::cout << "y[" << i << "] = " << host_y[i] << "\n"; 97 } 98 99 cudaDeviceReset(); 100 return 0; 101 } 102 103The command line for compilation is similar to what you would use for C++. 104 105.. code-block:: console 106 107 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ 108 -L<CUDA install path>/<lib64 or lib> \ 109 -lcudart_static -ldl -lrt -pthread 110 $ ./axpy 111 y[0] = 2 112 y[1] = 4 113 y[2] = 6 114 y[3] = 8 115 116``<CUDA install path>`` is the root directory where you installed CUDA SDK, 117typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of 118your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want 119to run your program on a GPU with compute capability of 3.5, you should specify 120``--cuda-gpu-arch=sm_35``. 121 122Detecting clang vs NVCC 123======================= 124 125Although clang's CUDA implementation is largely compatible with NVCC's, you may 126still want to detect when you're compiling CUDA code specifically with clang. 127 128This is tricky, because NVCC may invoke clang as part of its own compilation 129process! For example, NVCC uses the host compiler's preprocessor when 130compiling for device code, and that host compiler may in fact be clang. 131 132When clang is actually compiling CUDA code -- rather than being used as a 133subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is 134defined only in device mode (but will be defined if NVCC is using clang as a 135preprocessor). So you can use the following incantations to detect clang CUDA 136compilation, in host and device modes: 137 138.. code-block:: c++ 139 140 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) 141 // clang compiling CUDA code, host mode. 142 #endif 143 144 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) 145 // clang compiling CUDA code, device mode. 146 #endif 147 148Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can 149detect NVCC specifically by looking for ``__NVCC__``. 150 151Flags that control numerical code 152================================= 153 154If you're using GPUs, you probably care about making numerical code run fast. 155GPU hardware allows for more control over numerical operations than most CPUs, 156but this results in more compiler options for you to juggle. 157 158Flags you may wish to tweak include: 159 160* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when 161 compiling CUDA) Controls whether the compiler emits fused multiply-add 162 operations. 163 164 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply 165 and add instructions. 166 * ``on``: fuse multiplies and adds within a single statement, but never 167 across statements (C11 semantics). Prevent ptxas from fusing other 168 multiplies and adds. 169 * ``fast``: fuse multiplies and adds wherever profitable, even across 170 statements. Doesn't prevent ptxas from fusing additional multiplies and 171 adds. 172 173 Fused multiply-add instructions can be much faster than the unfused 174 equivalents, but because the intermediate result in an fma is not rounded, 175 this flag can affect numerical code. 176 177* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, 178 floating point operations may flush `denormal 179 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. 180 Operations on denormal numbers are often much slower than the same operations 181 on normal numbers. 182 183* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the 184 compiler may emit calls to faster, approximate versions of transcendental 185 functions, instead of using the slower, fully IEEE-compliant versions. For 186 example, this flag allows clang to emit the ptx ``sin.approx.f32`` 187 instruction. 188 189 This is implied by ``-ffast-math``. 190 191Optimizations 192============= 193 194CPU and GPU have different design philosophies and architectures. For example, a 195typical CPU has branch prediction, out-of-order execution, and is superscalar, 196whereas a typical GPU has none of these. Due to such differences, an 197optimization pipeline well-tuned for CPUs may be not suitable for GPUs. 198 199LLVM performs several general and CUDA-specific optimizations for GPUs. The 200list below shows some of the more important optimizations for GPUs. Most of 201them have been upstreamed to ``lib/Transforms/Scalar`` and 202``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a 203customizable target-independent optimization pipeline. 204 205* **Straight-line scalar optimizations**. These optimizations reduce redundancy 206 in straight-line code. Details can be found in the `design document for 207 straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. 208 209* **Inferring memory spaces**. `This optimization 210 <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_ 211 infers the memory space of an address so that the backend can emit faster 212 special loads and stores from it. 213 214* **Aggressive loop unrooling and function inlining**. Loop unrolling and 215 function inlining need to be more aggressive for GPUs than for CPUs because 216 control flow transfer in GPU is more expensive. They also promote other 217 optimizations such as constant propagation and SROA which sometimes speed up 218 code by over 10x. An empirical inline threshold for GPUs is 1100. This 219 configuration has yet to be upstreamed with a target-specific optimization 220 pipeline. LLVM also provides `loop unrolling pragmas 221 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 222 and ``__attribute__((always_inline))`` for programmers to force unrolling and 223 inling. 224 225* **Aggressive speculative execution**. `This transformation 226 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is 227 mainly for promoting straight-line scalar optimizations which are most 228 effective on code along dominator paths. 229 230* **Memory-space alias analysis**. `This alias analysis 231 <http://reviews.llvm.org/D12414>`_ infers that two pointers in different 232 special memory spaces do not alias. It has yet to be integrated to the new 233 alias analysis infrastructure; the new infrastructure does not run 234 target-specific alias analysis. 235 236* **Bypassing 64-bit divides**. `An existing optimization 237 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ 238 enabled in the NVPTX backend. 64-bit integer divides are much slower than 239 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit 240 divides in our benchmarks have a divisor and dividend which fit in 32-bits at 241 runtime. This optimization provides a fast path for this common case. 242 243Publication 244=========== 245 246| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ 247| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt 248| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* 249| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ 250 251Tutorial 252======== 253 254`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ 255 256Obtaining Help 257============== 258 259To obtain help on LLVM in general and its CUDA support, see `the LLVM 260community <http://llvm.org/docs/#mailing-lists>`_. 261