1========================= 2Compiling CUDA with clang 3========================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes how to compile CUDA code with clang, and gives some 12details about LLVM and clang's CUDA implementations. 13 14This document assumes a basic familiarity with CUDA. Information about CUDA 15programming can be found in the 16`CUDA programming guide 17<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 18 19Compiling CUDA Code 20=================== 21 22Prerequisites 23------------- 24 25CUDA is supported in llvm 3.9, but it's still in active development, so we 26recommend you `compile clang/LLVM from HEAD 27<http://llvm.org/docs/GettingStarted.html>`_. 28 29Before you build CUDA code, you'll need to have installed the appropriate 30driver for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation 31guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ 32for details. Note that clang `does not support 33<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed 34by many Linux package managers; you probably need to install nvidia's package. 35 36You will need CUDA 7.0, 7.5, or 8.0 to compile with clang. 37 38CUDA compilation is supported on Linux, on MacOS as of 2016-11-18, and on 39Windows as of 2017-01-05. 40 41Invoking clang 42-------------- 43 44Invoking clang for CUDA compilation works similarly to compiling regular C++. 45You just need to be aware of a few additional flags. 46 47You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ 48program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're 49compiling CUDA code by noticing that your filename ends with ``.cu``. 50Alternatively, you can pass ``-x cuda``.) 51 52To build and run, run the following commands, filling in the parts in angle 53brackets as described below: 54 55.. code-block:: console 56 57 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ 58 -L<CUDA install path>/<lib64 or lib> \ 59 -lcudart_static -ldl -lrt -pthread 60 $ ./axpy 61 y[0] = 2 62 y[1] = 4 63 y[2] = 6 64 y[3] = 8 65 66On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get 67"CUDA driver version is insufficient for CUDA runtime version" errors when you 68run your program. 69 70* ``<CUDA install path>`` -- the directory where you installed CUDA SDK. 71 Typically, ``/usr/local/cuda``. 72 73 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, 74 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code 75 always have the same pointer widths, so if you're compiling 64-bit code for 76 the host, you're also compiling 64-bit code for the device.) 77 78* ``<GPU arch>`` -- the `compute capability 79 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you 80 want to run your program on a GPU with compute capability of 3.5, specify 81 ``--cuda-gpu-arch=sm_35``. 82 83 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; 84 only ``sm_XX`` is currently supported. However, clang always includes PTX in 85 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be 86 forwards-compatible with e.g. ``sm_35`` GPUs. 87 88 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs. 89 90The `-L` and `-l` flags only need to be passed when linking. When compiling, 91you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install 92the CUDA SDK into ``/usr/local/cuda``, ``/usr/local/cuda-7.0``, or 93``/usr/local/cuda-7.5``. 94 95Flags that control numerical code 96--------------------------------- 97 98If you're using GPUs, you probably care about making numerical code run fast. 99GPU hardware allows for more control over numerical operations than most CPUs, 100but this results in more compiler options for you to juggle. 101 102Flags you may wish to tweak include: 103 104* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when 105 compiling CUDA) Controls whether the compiler emits fused multiply-add 106 operations. 107 108 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply 109 and add instructions. 110 * ``on``: fuse multiplies and adds within a single statement, but never 111 across statements (C11 semantics). Prevent ptxas from fusing other 112 multiplies and adds. 113 * ``fast``: fuse multiplies and adds wherever profitable, even across 114 statements. Doesn't prevent ptxas from fusing additional multiplies and 115 adds. 116 117 Fused multiply-add instructions can be much faster than the unfused 118 equivalents, but because the intermediate result in an fma is not rounded, 119 this flag can affect numerical code. 120 121* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, 122 floating point operations may flush `denormal 123 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. 124 Operations on denormal numbers are often much slower than the same operations 125 on normal numbers. 126 127* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the 128 compiler may emit calls to faster, approximate versions of transcendental 129 functions, instead of using the slower, fully IEEE-compliant versions. For 130 example, this flag allows clang to emit the ptx ``sin.approx.f32`` 131 instruction. 132 133 This is implied by ``-ffast-math``. 134 135Standard library support 136======================== 137 138In clang and nvcc, most of the C++ standard library is not supported on the 139device side. 140 141``<math.h>`` and ``<cmath>`` 142---------------------------- 143 144In clang, ``math.h`` and ``cmath`` are available and `pass 145<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_ 146`tests 147<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_ 148adapted from libc++'s test suite. 149 150In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof`` 151in namespace std (e.g. ``std::sinf``) are not available, and where the standard 152calls for overloads that take integral arguments, these are usually not 153available. 154 155.. code-block:: c++ 156 157 #include <math.h> 158 #include <cmath.h> 159 160 // clang is OK with everything in this function. 161 __device__ void test() { 162 std::sin(0.); // nvcc - ok 163 std::sin(0); // nvcc - error, because no std::sin(int) override is available. 164 sin(0); // nvcc - same as above. 165 166 sinf(0.); // nvcc - ok 167 std::sinf(0.); // nvcc - no such function 168 } 169 170``<std::complex>`` 171------------------ 172 173nvcc does not officially support ``std::complex``. It's an error to use 174``std::complex`` in ``__device__`` code, but it often works in ``__host__ 175__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see 176below). However, we have heard from implementers that it's possible to get 177into situations where nvcc will omit a call to an ``std::complex`` function, 178especially when compiling without optimizations. 179 180As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is 181tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++ 182newer than 2016-11-16. 183 184``<algorithm>`` 185--------------- 186 187In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and 188``std::max``) become constexpr. You can therefore use these in device code, 189when compiling with clang. 190 191Detecting clang vs NVCC from code 192================================= 193 194Although clang's CUDA implementation is largely compatible with NVCC's, you may 195still want to detect when you're compiling CUDA code specifically with clang. 196 197This is tricky, because NVCC may invoke clang as part of its own compilation 198process! For example, NVCC uses the host compiler's preprocessor when 199compiling for device code, and that host compiler may in fact be clang. 200 201When clang is actually compiling CUDA code -- rather than being used as a 202subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is 203defined only in device mode (but will be defined if NVCC is using clang as a 204preprocessor). So you can use the following incantations to detect clang CUDA 205compilation, in host and device modes: 206 207.. code-block:: c++ 208 209 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) 210 // clang compiling CUDA code, host mode. 211 #endif 212 213 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) 214 // clang compiling CUDA code, device mode. 215 #endif 216 217Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can 218detect NVCC specifically by looking for ``__NVCC__``. 219 220Dialect Differences Between clang and nvcc 221========================================== 222 223There is no formal CUDA spec, and clang and nvcc speak slightly different 224dialects of the language. Below, we describe some of the differences. 225 226This section is painful; hopefully you can skip this section and live your life 227blissfully unaware. 228 229Compilation Models 230------------------ 231 232Most of the differences between clang and nvcc stem from the different 233compilation models used by clang and nvcc. nvcc uses *split compilation*, 234which works roughly as follows: 235 236 * Run a preprocessor over the input ``.cu`` file to split it into two source 237 files: ``H``, containing source code for the host, and ``D``, containing 238 source code for the device. 239 240 * For each GPU architecture ``arch`` that we're compiling for, do: 241 242 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for 243 ``P_arch``. 244 245 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file, 246 ``S_arch``, containing GPU machine code (SASS) for ``arch``. 247 248 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a 249 single "fat binary" file, ``F``. 250 251 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you 252 like). ``F`` is packaged up into a header file which is force-included into 253 ``H``; nvcc generates code that calls into this header to e.g. launch 254 kernels. 255 256clang uses *merged parsing*. This is similar to split compilation, except all 257of the host and device code is present and must be semantically-correct in both 258compilation steps. 259 260 * For each GPU architecture ``arch`` that we're compiling for, do: 261 262 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code 263 is parsed and must be semantically correct, even though we're not 264 generating code for the host at this time. 265 266 The output of this step is a ``ptx`` file ``P_arch``. 267 268 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike 269 nvcc, clang always generates SASS code. 270 271 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a 272 single fat binary file, ``F``. 273 274 * Compile ``H`` using clang. ``__device__`` code is parsed and must be 275 semantically correct, even though we're not generating code for the device 276 at this time. 277 278 ``F`` is passed to this compilation, and clang includes it in a special ELF 279 section, where it can be found by tools like ``cuobjdump``. 280 281(You may ask at this point, why does clang need to parse the input file 282multiple times? Why not parse it just once, and then use the AST to generate 283code for the host and each device architecture? 284 285Unfortunately this can't work because we have to define different macros during 286host compilation and during device compilation for each GPU architecture.) 287 288clang's approach allows it to be highly robust to C++ edge cases, as it doesn't 289need to decide at an early stage which declarations to keep and which to throw 290away. But it has some consequences you should be aware of. 291 292Overloading Based on ``__host__`` and ``__device__`` Attributes 293--------------------------------------------------------------- 294 295Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__`` 296functions", and "``__host__ __device__`` functions", respectively. Functions 297with no attributes behave the same as H. 298 299nvcc does not allow you to create H and D functions with the same signature: 300 301.. code-block:: c++ 302 303 // nvcc: error - function "foo" has already been defined 304 __host__ void foo() {} 305 __device__ void foo() {} 306 307However, nvcc allows you to "overload" H and D functions with different 308signatures: 309 310.. code-block:: c++ 311 312 // nvcc: no error 313 __host__ void foo(int) {} 314 __device__ void foo() {} 315 316In clang, the ``__host__`` and ``__device__`` attributes are part of a 317function's signature, and so it's legal to have H and D functions with 318(otherwise) the same signature: 319 320.. code-block:: c++ 321 322 // clang: no error 323 __host__ void foo() {} 324 __device__ void foo() {} 325 326HD functions cannot be overloaded by H or D functions with the same signature: 327 328.. code-block:: c++ 329 330 // nvcc: error - function "foo" has already been defined 331 // clang: error - redefinition of 'foo' 332 __host__ __device__ void foo() {} 333 __device__ void foo() {} 334 335 // nvcc: no error 336 // clang: no error 337 __host__ __device__ void bar(int) {} 338 __device__ void bar() {} 339 340When resolving an overloaded function, clang considers the host/device 341attributes of the caller and callee. These are used as a tiebreaker during 342overload resolution. See `IdentifyCUDAPreference 343<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules, 344but at a high level they are: 345 346 * D functions prefer to call other Ds. HDs are given lower priority. 347 348 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions 349 (with equal priority). HDs are given lower priority. 350 351 * HD functions prefer to call other HDs. 352 353 When compiling for device, HDs will call Ds with lower priority than HD, and 354 will call Hs with still lower priority. If it's forced to call an H, the 355 program is malformed if we emit code for this HD function. We call this the 356 "wrong-side rule", see example below. 357 358 The rules are symmetrical when compiling for host. 359 360Some examples: 361 362.. code-block:: c++ 363 364 __host__ void foo(); 365 __device__ void foo(); 366 367 __host__ void bar(); 368 __host__ __device__ void bar(); 369 370 __host__ void test_host() { 371 foo(); // calls H overload 372 bar(); // calls H overload 373 } 374 375 __device__ void test_device() { 376 foo(); // calls D overload 377 bar(); // calls HD overload 378 } 379 380 __host__ __device__ void test_hd() { 381 foo(); // calls H overload when compiling for host, otherwise D overload 382 bar(); // always calls HD overload 383 } 384 385Wrong-side rule example: 386 387.. code-block:: c++ 388 389 __host__ void host_only(); 390 391 // We don't codegen inline functions unless they're referenced by a 392 // non-inline function. inline_hd1() is called only from the host side, so 393 // does not generate an error. inline_hd2() is called from the device side, 394 // so it generates an error. 395 inline __host__ __device__ void inline_hd1() { host_only(); } // no error 396 inline __host__ __device__ void inline_hd2() { host_only(); } // error 397 398 __host__ void host_fn() { inline_hd1(); } 399 __device__ void device_fn() { inline_hd2(); } 400 401 // This function is not inline, so it's always codegen'ed on both the host 402 // and the device. Therefore, it generates an error. 403 __host__ __device__ void not_inline_hd() { host_only(); } 404 405For the purposes of the wrong-side rule, templated functions also behave like 406``inline`` functions: They aren't codegen'ed unless they're instantiated 407(usually as part of the process of invoking them). 408 409clang's behavior with respect to the wrong-side rule matches nvcc's, except 410nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call 411``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s 412call to ``host_only`` entirely, or it may try to generate code for 413``host_only`` on the device. What you get seems to depend on whether or not 414the compiler chooses to inline ``host_only``. 415 416Member functions, including constructors, may be overloaded using H and D 417attributes. However, destructors cannot be overloaded. 418 419Using a Different Class on Host/Device 420-------------------------------------- 421 422Occasionally you may want to have a class with different host/device versions. 423 424If all of the class's members are the same on the host and device, you can just 425provide overloads for the class's member functions. 426 427However, if you want your class to have different members on host/device, you 428won't be able to provide working H and D overloads in both classes. In this 429case, clang is likely to be unhappy with you. 430 431.. code-block:: c++ 432 433 #ifdef __CUDA_ARCH__ 434 struct S { 435 __device__ void foo() { /* use device_only */ } 436 int device_only; 437 }; 438 #else 439 struct S { 440 __host__ void foo() { /* use host_only */ } 441 double host_only; 442 }; 443 444 __device__ void test() { 445 S s; 446 // clang generates an error here, because during host compilation, we 447 // have ifdef'ed away the __device__ overload of S::foo(). The __device__ 448 // overload must be present *even during host compilation*. 449 S.foo(); 450 } 451 #endif 452 453We posit that you don't really want to have classes with different members on H 454and D. For example, if you were to pass one of these as a parameter to a 455kernel, it would have a different layout on H and D, so would not work 456properly. 457 458To make code like this compatible with clang, we recommend you separate it out 459into two classes. If you need to write code that works on both host and 460device, consider writing an overloaded wrapper function that returns different 461types on host and device. 462 463.. code-block:: c++ 464 465 struct HostS { ... }; 466 struct DeviceS { ... }; 467 468 __host__ HostS MakeStruct() { return HostS(); } 469 __device__ DeviceS MakeStruct() { return DeviceS(); } 470 471 // Now host and device code can call MakeStruct(). 472 473Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow 474you to overload based on the H/D attributes. Here's an idiom that works with 475both clang and nvcc: 476 477.. code-block:: c++ 478 479 struct HostS { ... }; 480 struct DeviceS { ... }; 481 482 #ifdef __NVCC__ 483 #ifndef __CUDA_ARCH__ 484 __host__ HostS MakeStruct() { return HostS(); } 485 #else 486 __device__ DeviceS MakeStruct() { return DeviceS(); } 487 #endif 488 #else 489 __host__ HostS MakeStruct() { return HostS(); } 490 __device__ DeviceS MakeStruct() { return DeviceS(); } 491 #endif 492 493 // Now host and device code can call MakeStruct(). 494 495Hopefully you don't have to do this sort of thing often. 496 497Optimizations 498============= 499 500Modern CPUs and GPUs are architecturally quite different, so code that's fast 501on a CPU isn't necessarily fast on a GPU. We've made a number of changes to 502LLVM to make it generate good GPU code. Among these changes are: 503 504* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These 505 reduce redundancy within straight-line code. 506 507* `Aggressive speculative execution 508 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ 509 -- This is mainly for promoting straight-line scalar optimizations, which are 510 most effective on code along dominator paths. 511 512* `Memory space inference 513 <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ -- 514 In PTX, we can operate on pointers that are in a paricular "address space" 515 (global, shared, constant, or local), or we can operate on pointers in the 516 "generic" address space, which can point to anything. Operations in a 517 non-generic address space are faster, but pointers in CUDA are not explicitly 518 annotated with their address space, so it's up to LLVM to infer it where 519 possible. 520 521* `Bypassing 64-bit divides 522 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ -- 523 This was an existing optimization that we enabled for the PTX backend. 524 525 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. 526 Many of the 64-bit divides in our benchmarks have a divisor and dividend 527 which fit in 32-bits at runtime. This optimization provides a fast path for 528 this common case. 529 530* Aggressive loop unrooling and function inlining -- Loop unrolling and 531 function inlining need to be more aggressive for GPUs than for CPUs because 532 control flow transfer in GPU is more expensive. More aggressive unrolling and 533 inlining also promote other optimizations, such as constant propagation and 534 SROA, which sometimes speed up code by over 10x. 535 536 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas 537 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 538 and ``__attribute__((always_inline))``.) 539 540Publication 541=========== 542 543The team at Google published a paper in CGO 2016 detailing the optimizations 544they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name: 545The relevant tools are now just vanilla clang/LLVM. 546 547| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ 548| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt 549| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* 550| 551| `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ 552| 553| `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ 554 555Obtaining Help 556============== 557 558To obtain help on LLVM in general and its CUDA support, see `the LLVM 559community <http://llvm.org/docs/#mailing-lists>`_. 560