1============================= 2User Guide for NVPTX Back-end 3============================= 4 5.. contents:: 6 :local: 7 :depth: 3 8 9 10Introduction 11============ 12 13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14along with a defined set of conventions used to represent GPU programming 15concepts. This document provides an overview of the general usage of the back- 16end, including a description of the conventions used and the set of accepted 17LLVM IR. 18 19.. note:: 20 21 This document assumes a basic familiarity with CUDA and the PTX 22 assembly language. Information about the CUDA Driver API and the PTX assembly 23 language can be found in the `CUDA documentation 24 <http://docs.nvidia.com/cuda/index.html>`_. 25 26 27 28Conventions 29=========== 30 31Marking Functions as Kernels 32---------------------------- 33 34In PTX, there are two types of functions: *device functions*, which are only 35callable by device code, and *kernel functions*, which are callable by host 36code. By default, the back-end will emit device functions. Metadata is used to 37declare a function as a kernel function. This metadata is attached to the 38``nvvm.annotations`` named metadata object, and has the following format: 39 40.. code-block:: text 41 42 !0 = !{<function-ref>, metadata !"kernel", i32 1} 43 44The first parameter is a reference to the kernel function. The following 45example shows a kernel function calling a device function in LLVM IR. The 46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47 48.. code-block:: llvm 49 50 define float @my_fmad(float %x, float %y, float %z) { 51 %mul = fmul float %x, %y 52 %add = fadd float %mul, %z 53 ret float %add 54 } 55 56 define void @my_kernel(float* %ptr) { 57 %val = load float, float* %ptr 58 %ret = call float @my_fmad(float %val, float %val, float %val) 59 store float %ret, float* %ptr 60 ret void 61 } 62 63 !nvvm.annotations = !{!1} 64 !1 = !{void (float*)* @my_kernel, !"kernel", i32 1} 65 66When compiled, the PTX kernel functions are callable by host-side code. 67 68 69.. _address_spaces: 70 71Address Spaces 72-------------- 73 74The NVPTX back-end uses the following address space mapping: 75 76 ============= ====================== 77 Address Space Memory Space 78 ============= ====================== 79 0 Generic 80 1 Global 81 2 Internal Use 82 3 Shared 83 4 Constant 84 5 Local 85 ============= ====================== 86 87Every global variable and pointer type is assigned to one of these address 88spaces, with 0 being the default address space. Intrinsics are provided which 89can be used to convert pointers between the generic and non-generic address 90spaces. 91 92As an example, the following IR will define an array ``@g`` that resides in 93global device memory. 94 95.. code-block:: llvm 96 97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 98 99LLVM IR functions can read and write to this array, and host-side code can 100copy data to it by name with the CUDA Driver API. 101 102Note that since address space 0 is the generic space, it is illegal to have 103global variables in address space 0. Address space 0 is the default address 104space in LLVM, so the ``addrspace(N)`` annotation is *required* for global 105variables. 106 107 108Triples 109------- 110 111The NVPTX target uses the module triple to select between 32/64-bit code 112generation and the driver-compiler interface to use. The triple architecture 113can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The 114operating system should be one of ``cuda`` or ``nvcl``, which determines the 115interface used by the generated code to communicate with the driver. Most 116users will want to use ``cuda`` as the operating system, which makes the 117generated PTX compatible with the CUDA Driver API. 118 119Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` 120 121Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` 122 123 124 125.. _nvptx_intrinsics: 126 127NVPTX Intrinsics 128================ 129 130Address Space Conversion 131------------------------ 132 133'``llvm.nvvm.ptr.*.to.gen``' Intrinsics 134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136Syntax: 137""""""" 138 139These are overloaded intrinsics. You can use these on any pointer types. 140 141.. code-block:: llvm 142 143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) 144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) 145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) 146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) 147 148Overview: 149""""""""" 150 151The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic 152address space to a generic address space pointer. 153 154Semantics: 155"""""""""" 156 157These intrinsics modify the pointer value to be a valid generic address space 158pointer. 159 160 161'``llvm.nvvm.ptr.gen.to.*``' Intrinsics 162^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 163 164Syntax: 165""""""" 166 167These are overloaded intrinsics. You can use these on any pointer types. 168 169.. code-block:: llvm 170 171 declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*) 172 declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*) 173 declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*) 174 declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*) 175 176Overview: 177""""""""" 178 179The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic 180address space to a pointer in the target address space. Note that these 181intrinsics are only useful if the address space of the target address space of 182the pointer is known. It is not legal to use address space conversion 183intrinsics to convert a pointer from one non-generic address space to another 184non-generic address space. 185 186Semantics: 187"""""""""" 188 189These intrinsics modify the pointer value to be a valid pointer in the target 190non-generic address space. 191 192 193Reading PTX Special Registers 194----------------------------- 195 196'``llvm.nvvm.read.ptx.sreg.*``' 197^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 198 199Syntax: 200""""""" 201 202.. code-block:: llvm 203 204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 217 218Overview: 219""""""""" 220 221The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 222special registers, in particular the kernel launch bounds. These registers 223map in the following way to CUDA builtins: 224 225 ============ ===================================== 226 CUDA Builtin PTX Special Register Intrinsic 227 ============ ===================================== 228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 232 ============ ===================================== 233 234 235Barriers 236-------- 237 238'``llvm.nvvm.barrier0``' 239^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240 241Syntax: 242""""""" 243 244.. code-block:: llvm 245 246 declare void @llvm.nvvm.barrier0() 247 248Overview: 249""""""""" 250 251The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 252instruction, equivalent to the ``__syncthreads()`` call in CUDA. 253 254 255Other Intrinsics 256---------------- 257 258For the full set of NVPTX intrinsics, please see the 259``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 260 261 262.. _libdevice: 263 264Linking with Libdevice 265====================== 266 267The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that 268implements many common mathematical functions. This library can be used as a 269high-performance math library for any compilers using the LLVM NVPTX target. 270The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and 271there is a separate version for each compute architecture. 272 273For a list of all math functions implemented in libdevice, see 274`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. 275 276To accommodate various math-related compiler flags that can affect code 277generation of libdevice code, the library code depends on a special LLVM IR 278pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This 279pass looks for calls to the ``@__nvvm_reflect`` function and replaces them 280with constants based on the defined reflection parameters. Such conditional 281code often follows a pattern: 282 283.. code-block:: c++ 284 285 float my_function(float a) { 286 if (__nvvm_reflect("FASTMATH")) 287 return my_function_fast(a); 288 else 289 return my_function_precise(a); 290 } 291 292The default value for all unspecified reflection parameters is zero. 293 294The ``NVVMReflect`` pass should be executed early in the optimization 295pipeline, immediately after the link stage. The ``internalize`` pass is also 296recommended to remove unused math functions from the resulting PTX. For an 297input IR module ``module.bc``, the following compilation flow is recommended: 298 2991. Save list of external functions in ``module.bc`` 3002. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` 3013. Internalize all functions not in list from (1) 3024. Eliminate all unused internal functions 3035. Run ``NVVMReflect`` pass 3046. Run standard optimization pipeline 305 306.. note:: 307 308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the 309 libdevice functions. It is possible to link two IR modules that have been 310 linked against libdevice using different reflection variables. 311 312Since the ``NVVMReflect`` pass replaces conditionals with constants, it will 313often leave behind dead code of the form: 314 315.. code-block:: llvm 316 317 entry: 318 .. 319 br i1 true, label %foo, label %bar 320 foo: 321 .. 322 bar: 323 ; Dead code 324 .. 325 326Therefore, it is recommended that ``NVVMReflect`` is executed early in the 327optimization pipeline before dead-code elimination. 328 329The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning 330of your pass manager; just use the following code when setting up your pass 331manager: 332 333.. code-block:: c++ 334 335 std::unique_ptr<TargetMachine> TM = ...; 336 PassManagerBuilder PMBuilder(...); 337 if (TM) 338 TM->adjustPassManager(PMBuilder); 339 340Reflection Parameters 341--------------------- 342 343The libdevice library currently uses the following reflection parameters to 344control code generation: 345 346==================== ====================================================== 347Flag Description 348==================== ====================================================== 349``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero 350==================== ====================================================== 351 352The value of this flag is determined by the "nvvm-reflect-ftz" module flag. 353The following sets the ftz flag to 1. 354 355.. code-block:: llvm 356 357 !llvm.module.flag = !{!0} 358 !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1} 359 360(``i32 4`` indicates that the value set here overrides the value in another 361module we link with. See the `LangRef <LangRef.html#module-flags-metadata>` 362for details.) 363 364Executing PTX 365============= 366 367The most common way to execute PTX assembly on a GPU device is to use the CUDA 368Driver API. This API is a low-level interface to the GPU driver and allows for 369JIT compilation of PTX code to native GPU machine code. 370 371Initializing the Driver API: 372 373.. code-block:: c++ 374 375 CUdevice device; 376 CUcontext context; 377 378 // Initialize the driver API 379 cuInit(0); 380 // Get a handle to the first compute device 381 cuDeviceGet(&device, 0); 382 // Create a compute device context 383 cuCtxCreate(&context, 0, device); 384 385JIT compiling a PTX string to a device binary: 386 387.. code-block:: c++ 388 389 CUmodule module; 390 CUfunction function; 391 392 // JIT compile a null-terminated PTX string 393 cuModuleLoadData(&module, (void*)PTXString); 394 395 // Get a handle to the "myfunction" kernel function 396 cuModuleGetFunction(&function, module, "myfunction"); 397 398For full examples of executing PTX assembly, please see the `CUDA Samples 399<https://developer.nvidia.com/cuda-downloads>`_ distribution. 400 401 402Common Issues 403============= 404 405ptxas complains of undefined function: __nvvm_reflect 406----------------------------------------------------- 407 408When linking with libdevice, the ``NVVMReflect`` pass must be used. See 409:ref:`libdevice` for more information. 410 411 412Tutorial: A Simple Compute Kernel 413================================= 414 415To start, let us take a look at a simple compute kernel written directly in 416LLVM IR. The kernel implements vector addition, where each thread computes one 417element of the output vector C from the input vectors A and B. To make this 418easier, we also assume that only a single CTA (thread block) will be launched, 419and that it will be one dimensional. 420 421 422The Kernel 423---------- 424 425.. code-block:: llvm 426 427 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 428 target triple = "nvptx64-nvidia-cuda" 429 430 ; Intrinsic to read X component of thread ID 431 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 432 433 define void @kernel(float addrspace(1)* %A, 434 float addrspace(1)* %B, 435 float addrspace(1)* %C) { 436 entry: 437 ; What is my ID? 438 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 439 440 ; Compute pointers into A, B, and C 441 %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id 442 %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id 443 %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id 444 445 ; Read A, B 446 %valA = load float, float addrspace(1)* %ptrA, align 4 447 %valB = load float, float addrspace(1)* %ptrB, align 4 448 449 ; Compute C = A + B 450 %valC = fadd float %valA, %valB 451 452 ; Store back to C 453 store float %valC, float addrspace(1)* %ptrC, align 4 454 455 ret void 456 } 457 458 !nvvm.annotations = !{!0} 459 !0 = !{void (float addrspace(1)*, 460 float addrspace(1)*, 461 float addrspace(1)*)* @kernel, !"kernel", i32 1} 462 463 464We can use the LLVM ``llc`` tool to directly run the NVPTX code generator: 465 466.. code-block:: text 467 468 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx 469 470 471.. note:: 472 473 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` 474 in the module data layout string and use ``nvptx-nvidia-cuda`` as the 475 target triple. 476 477 478The output we get from ``llc`` (as of LLVM 3.4): 479 480.. code-block:: text 481 482 // 483 // Generated by LLVM NVPTX Back-End 484 // 485 486 .version 3.1 487 .target sm_20 488 .address_size 64 489 490 // .globl kernel 491 // @kernel 492 .visible .entry kernel( 493 .param .u64 kernel_param_0, 494 .param .u64 kernel_param_1, 495 .param .u64 kernel_param_2 496 ) 497 { 498 .reg .f32 %f<4>; 499 .reg .s32 %r<2>; 500 .reg .s64 %rl<8>; 501 502 // %bb.0: // %entry 503 ld.param.u64 %rl1, [kernel_param_0]; 504 mov.u32 %r1, %tid.x; 505 mul.wide.s32 %rl2, %r1, 4; 506 add.s64 %rl3, %rl1, %rl2; 507 ld.param.u64 %rl4, [kernel_param_1]; 508 add.s64 %rl5, %rl4, %rl2; 509 ld.param.u64 %rl6, [kernel_param_2]; 510 add.s64 %rl7, %rl6, %rl2; 511 ld.global.f32 %f1, [%rl3]; 512 ld.global.f32 %f2, [%rl5]; 513 add.f32 %f3, %f1, %f2; 514 st.global.f32 [%rl7], %f3; 515 ret; 516 } 517 518 519Dissecting the Kernel 520--------------------- 521 522Now let us dissect the LLVM IR that makes up this kernel. 523 524Data Layout 525^^^^^^^^^^^ 526 527The data layout string determines the size in bits of common data types, their 528ABI alignment, and their storage size. For NVPTX, you should use one of the 529following: 530 53132-bit PTX: 532 533.. code-block:: llvm 534 535 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 536 53764-bit PTX: 538 539.. code-block:: llvm 540 541 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 542 543 544Target Intrinsics 545^^^^^^^^^^^^^^^^^ 546 547In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to 548read the X component of the current thread's ID, which corresponds to a read 549of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of 550intrinsics. A short list is shown below; please see 551``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. 552 553 554================================================ ==================== 555Intrinsic CUDA Equivalent 556================================================ ==================== 557``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} 558``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} 559``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} 560``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} 561``void @llvm.nvvm.barrier0()`` __syncthreads() 562================================================ ==================== 563 564 565Address Spaces 566^^^^^^^^^^^^^^ 567 568You may have noticed that all of the pointer types in the LLVM IR example had 569an explicit address space specifier. What is address space 1? NVIDIA GPU 570devices (generally) have four types of memory: 571 572- Global: Large, off-chip memory 573- Shared: Small, on-chip memory shared among all threads in a CTA 574- Local: Per-thread, private memory 575- Constant: Read-only memory shared across all threads 576 577These different types of memory are represented in LLVM IR as address spaces. 578There is also a fifth address space used by the NVPTX code generator that 579corresponds to the "generic" address space. This address space can represent 580addresses in any other address space (with a few exceptions). This allows 581users to write IR functions that can load/store memory using the same 582instructions. Intrinsics are provided to convert pointers between the generic 583and non-generic address spaces. 584 585See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. 586 587 588Kernel Metadata 589^^^^^^^^^^^^^^^ 590 591In PTX, a function can be either a `kernel` function (callable from the host 592program), or a `device` function (callable only from GPU code). You can think 593of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR 594function as a `kernel` function, we make use of special LLVM metadata. The 595NVPTX back-end will look for a named metadata node called 596``nvvm.annotations``. This named metadata must contain a list of metadata that 597describe the IR. For our purposes, we need to declare a metadata node that 598assigns the "kernel" attribute to the LLVM IR function that should be emitted 599as a PTX `kernel` function. These metadata nodes take the form: 600 601.. code-block:: text 602 603 !{<function ref>, metadata !"kernel", i32 1} 604 605For the previous example, we have: 606 607.. code-block:: llvm 608 609 !nvvm.annotations = !{!0} 610 !0 = !{void (float addrspace(1)*, 611 float addrspace(1)*, 612 float addrspace(1)*)* @kernel, !"kernel", i32 1} 613 614Here, we have a single metadata declaration in ``nvvm.annotations``. This 615metadata annotates our ``@kernel`` function with the ``kernel`` attribute. 616 617 618Running the Kernel 619------------------ 620 621Generating PTX from LLVM IR is all well and good, but how do we execute it on 622a real GPU device? The CUDA Driver API provides a convenient mechanism for 623loading and JIT compiling PTX to a native GPU device, and launching a kernel. 624The API is similar to OpenCL. A simple example showing how to load and 625execute our vector addition code is shown below. Note that for brevity this 626code does not perform much error checking! 627 628.. note:: 629 630 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline 631 compile PTX to machine code (SASS) for a specific GPU architecture. Such 632 binaries can be loaded by the CUDA Driver API in the same way as PTX. This 633 can be useful for reducing startup time by precompiling the PTX kernels. 634 635 636.. code-block:: c++ 637 638 #include <iostream> 639 #include <fstream> 640 #include <cassert> 641 #include "cuda.h" 642 643 644 void checkCudaErrors(CUresult err) { 645 assert(err == CUDA_SUCCESS); 646 } 647 648 /// main - Program entry point 649 int main(int argc, char **argv) { 650 CUdevice device; 651 CUmodule cudaModule; 652 CUcontext context; 653 CUfunction function; 654 CUlinkState linker; 655 int devCount; 656 657 // CUDA initialization 658 checkCudaErrors(cuInit(0)); 659 checkCudaErrors(cuDeviceGetCount(&devCount)); 660 checkCudaErrors(cuDeviceGet(&device, 0)); 661 662 char name[128]; 663 checkCudaErrors(cuDeviceGetName(name, 128, device)); 664 std::cout << "Using CUDA Device [0]: " << name << "\n"; 665 666 int devMajor, devMinor; 667 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); 668 std::cout << "Device Compute Capability: " 669 << devMajor << "." << devMinor << "\n"; 670 if (devMajor < 2) { 671 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; 672 return 1; 673 } 674 675 std::ifstream t("kernel.ptx"); 676 if (!t.is_open()) { 677 std::cerr << "kernel.ptx not found\n"; 678 return 1; 679 } 680 std::string str((std::istreambuf_iterator<char>(t)), 681 std::istreambuf_iterator<char>()); 682 683 // Create driver context 684 checkCudaErrors(cuCtxCreate(&context, 0, device)); 685 686 // Create module for object 687 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); 688 689 // Get kernel function 690 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); 691 692 // Device data 693 CUdeviceptr devBufferA; 694 CUdeviceptr devBufferB; 695 CUdeviceptr devBufferC; 696 697 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); 698 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); 699 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); 700 701 float* hostA = new float[16]; 702 float* hostB = new float[16]; 703 float* hostC = new float[16]; 704 705 // Populate input 706 for (unsigned i = 0; i != 16; ++i) { 707 hostA[i] = (float)i; 708 hostB[i] = (float)(2*i); 709 hostC[i] = 0.0f; 710 } 711 712 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); 713 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); 714 715 716 unsigned blockSizeX = 16; 717 unsigned blockSizeY = 1; 718 unsigned blockSizeZ = 1; 719 unsigned gridSizeX = 1; 720 unsigned gridSizeY = 1; 721 unsigned gridSizeZ = 1; 722 723 // Kernel parameters 724 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; 725 726 std::cout << "Launching kernel\n"; 727 728 // Kernel launch 729 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, 730 blockSizeX, blockSizeY, blockSizeZ, 731 0, NULL, KernelParams, NULL)); 732 733 // Retrieve device data 734 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); 735 736 737 std::cout << "Results:\n"; 738 for (unsigned i = 0; i != 16; ++i) { 739 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; 740 } 741 742 743 // Clean up after ourselves 744 delete [] hostA; 745 delete [] hostB; 746 delete [] hostC; 747 748 // Clean-up 749 checkCudaErrors(cuMemFree(devBufferA)); 750 checkCudaErrors(cuMemFree(devBufferB)); 751 checkCudaErrors(cuMemFree(devBufferC)); 752 checkCudaErrors(cuModuleUnload(cudaModule)); 753 checkCudaErrors(cuCtxDestroy(context)); 754 755 return 0; 756 } 757 758 759You will need to link with the CUDA driver and specify the path to cuda.h. 760 761.. code-block:: text 762 763 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda 764 765We don't need to specify a path to ``libcuda.so`` since this is installed in a 766system location by the driver, not the CUDA toolkit. 767 768If everything goes as planned, you should see the following output when 769running the compiled program: 770 771.. code-block:: text 772 773 Using CUDA Device [0]: GeForce GTX 680 774 Device Compute Capability: 3.0 775 Launching kernel 776 Results: 777 0 + 0 = 0 778 1 + 2 = 3 779 2 + 4 = 6 780 3 + 6 = 9 781 4 + 8 = 12 782 5 + 10 = 15 783 6 + 12 = 18 784 7 + 14 = 21 785 8 + 16 = 24 786 9 + 18 = 27 787 10 + 20 = 30 788 11 + 22 = 33 789 12 + 24 = 36 790 13 + 26 = 39 791 14 + 28 = 42 792 15 + 30 = 45 793 794.. note:: 795 796 You will likely see a different device identifier based on your hardware 797 798 799Tutorial: Linking with Libdevice 800================================ 801 802In this tutorial, we show a simple example of linking LLVM IR with the 803libdevice library. We will use the same kernel as the previous tutorial, 804except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. 805Libdevice provides an ``__nv_powf`` function that we will use. 806 807.. code-block:: llvm 808 809 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 810 target triple = "nvptx64-nvidia-cuda" 811 812 ; Intrinsic to read X component of thread ID 813 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 814 ; libdevice function 815 declare float @__nv_powf(float, float) 816 817 define void @kernel(float addrspace(1)* %A, 818 float addrspace(1)* %B, 819 float addrspace(1)* %C) { 820 entry: 821 ; What is my ID? 822 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 823 824 ; Compute pointers into A, B, and C 825 %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id 826 %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id 827 %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id 828 829 ; Read A, B 830 %valA = load float, float addrspace(1)* %ptrA, align 4 831 %valB = load float, float addrspace(1)* %ptrB, align 4 832 833 ; Compute C = pow(A, B) 834 %valC = call float @__nv_powf(float %valA, float %valB) 835 836 ; Store back to C 837 store float %valC, float addrspace(1)* %ptrC, align 4 838 839 ret void 840 } 841 842 !nvvm.annotations = !{!0} 843 !0 = !{void (float addrspace(1)*, 844 float addrspace(1)*, 845 float addrspace(1)*)* @kernel, !"kernel", i32 1} 846 847 848To compile this kernel, we perform the following steps: 849 8501. Link with libdevice 8512. Internalize all but the public kernel function 8523. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 8534. Optimize the linked module 8545. Codegen the module 855 856 857These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` 858tools. In a complete compiler, these steps can also be performed entirely 859programmatically by setting up an appropriate pass configuration (see 860:ref:`libdevice`). 861 862.. code-block:: text 863 864 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc 865 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc 866 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx 867 868.. note:: 869 870 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any 871 undefined variables will default to zero. It is shown here for evaluation 872 purposes. 873 874 875This gives us the following PTX (excerpt): 876 877.. code-block:: text 878 879 // 880 // Generated by LLVM NVPTX Back-End 881 // 882 883 .version 3.1 884 .target sm_20 885 .address_size 64 886 887 // .globl kernel 888 // @kernel 889 .visible .entry kernel( 890 .param .u64 kernel_param_0, 891 .param .u64 kernel_param_1, 892 .param .u64 kernel_param_2 893 ) 894 { 895 .reg .pred %p<30>; 896 .reg .f32 %f<111>; 897 .reg .s32 %r<21>; 898 .reg .s64 %rl<8>; 899 900 // %bb.0: // %entry 901 ld.param.u64 %rl2, [kernel_param_0]; 902 mov.u32 %r3, %tid.x; 903 ld.param.u64 %rl3, [kernel_param_1]; 904 mul.wide.s32 %rl4, %r3, 4; 905 add.s64 %rl5, %rl2, %rl4; 906 ld.param.u64 %rl6, [kernel_param_2]; 907 add.s64 %rl7, %rl3, %rl4; 908 add.s64 %rl1, %rl6, %rl4; 909 ld.global.f32 %f1, [%rl5]; 910 ld.global.f32 %f2, [%rl7]; 911 setp.eq.f32 %p1, %f1, 0f3F800000; 912 setp.eq.f32 %p2, %f2, 0f00000000; 913 or.pred %p3, %p1, %p2; 914 @%p3 bra BB0_1; 915 bra.uni BB0_2; 916 BB0_1: 917 mov.f32 %f110, 0f3F800000; 918 st.global.f32 [%rl1], %f110; 919 ret; 920 BB0_2: // %__nv_isnanf.exit.i 921 abs.f32 %f4, %f1; 922 setp.gtu.f32 %p4, %f4, 0f7F800000; 923 @%p4 bra BB0_4; 924 // %bb.3: // %__nv_isnanf.exit5.i 925 abs.f32 %f5, %f2; 926 setp.le.f32 %p5, %f5, 0f7F800000; 927 @%p5 bra BB0_5; 928 BB0_4: // %.critedge1.i 929 add.f32 %f110, %f1, %f2; 930 st.global.f32 [%rl1], %f110; 931 ret; 932 BB0_5: // %__nv_isinff.exit.i 933 934 ... 935 936 BB0_26: // %__nv_truncf.exit.i.i.i.i.i 937 mul.f32 %f90, %f107, 0f3FB8AA3B; 938 cvt.rzi.f32.f32 %f91, %f90; 939 mov.f32 %f92, 0fBF317200; 940 fma.rn.f32 %f93, %f91, %f92, %f107; 941 mov.f32 %f94, 0fB5BFBE8E; 942 fma.rn.f32 %f95, %f91, %f94, %f93; 943 mul.f32 %f89, %f95, 0f3FB8AA3B; 944 // inline asm 945 ex2.approx.ftz.f32 %f88,%f89; 946 // inline asm 947 add.f32 %f96, %f91, 0f00000000; 948 ex2.approx.f32 %f97, %f96; 949 mul.f32 %f98, %f88, %f97; 950 setp.lt.f32 %p15, %f107, 0fC2D20000; 951 selp.f32 %f99, 0f00000000, %f98, %p15; 952 setp.gt.f32 %p16, %f107, 0f42D20000; 953 selp.f32 %f110, 0f7F800000, %f99, %p16; 954 setp.eq.f32 %p17, %f110, 0f7F800000; 955 @%p17 bra BB0_28; 956 // %bb.27: 957 fma.rn.f32 %f110, %f110, %f108, %f110; 958 BB0_28: // %__internal_accurate_powf.exit.i 959 setp.lt.f32 %p18, %f1, 0f00000000; 960 setp.eq.f32 %p19, %f3, 0f3F800000; 961 and.pred %p20, %p18, %p19; 962 @!%p20 bra BB0_30; 963 bra.uni BB0_29; 964 BB0_29: 965 mov.b32 %r9, %f110; 966 xor.b32 %r10, %r9, -2147483648; 967 mov.b32 %f110, %r10; 968 BB0_30: // %__nv_powf.exit 969 st.global.f32 [%rl1], %f110; 970 ret; 971 } 972 973