1Requirements: 2 3- automake, autoconf, libtool 4 (not needed when compiling a release) 5- pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config) 6 (not needed when compiling a release using the included isl and pet) 7- gmp (http://gmplib.org/) 8- libyaml (http://pyyaml.org/wiki/LibYAML) 9 (only needed if you want to compile the pet executable) 10- LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html) 11 Unless you have some other reasons for wanting to use the svn version, 12 it is best to install the latest release (3.9). 13 For more details, see pet/README. 14 15If you are installing on Ubuntu, then you can install the following packages: 16 17automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm 18 19Note that you need at least version 3.2 of libclang-dev (ubuntu raring). 20Older versions of this package did not include the required libraries. 21If you are using an older version of ubuntu, then you need to compile and 22install LLVM/clang from source. 23 24 25Preparing: 26 27Grab the latest release and extract it or get the source from 28the git repository as follows. This process requires autoconf, 29automake, libtool and pkg-config. 30 31 git clone git://repo.or.cz/ppcg.git 32 cd ppcg 33 ./get_submodules.sh 34 ./autogen.sh 35 36 37Compilation: 38 39 ./configure 40 make 41 make check 42 43If you have installed any of the required libraries in a non-standard 44location, then you may need to use the --with-gmp-prefix, 45--with-libyaml-prefix and/or --with-clang-prefix options 46when calling "./configure". 47 48 49Using PPCG to generate CUDA or OpenCL code 50 51To convert a fragment of a C program to CUDA, insert a line containing 52 53 #pragma scop 54 55before the fragment and add a line containing 56 57 #pragma endscop 58 59after the fragment. To generate CUDA code run 60 61 ppcg --target=cuda file.c 62 63where file.c is the file containing the fragment. The generated 64code is stored in file_host.cu and file_kernel.cu. 65 66To generate OpenCL code run 67 68 ppcg --target=opencl file.c 69 70where file.c is the file containing the fragment. The generated code 71is stored in file_host.c and file_kernel.cl. 72 73 74Specifying tile, grid and block sizes 75 76The iterations space tile size, grid size and block size can 77be specified using the --sizes option. The argument is a union map 78in isl notation mapping kernels identified by their sequence number 79in a "kernel" space to singleton sets in the "tile", "grid" and "block" 80spaces. The sizes are specified outermost to innermost. 81 82The dimension of the "tile" space indicates the (maximal) number of loop 83dimensions to tile. The elements of the single integer tuple 84specify the tile sizes in each dimension. 85In case of hybrid tiling, the first element is half the size of 86the tile in the time (sequential) dimension. The second element 87specifies the number of elements in the base of the hexagon. 88The remaining elements specify the tile sizes in the remaining space 89dimensions. 90 91The dimension of the "grid" space indicates the (maximal) number of block 92dimensions in the grid. The elements of the single integer tuple 93specify the number of blocks in each dimension. 94 95The dimension of the "block" space indicates the (maximal) number of thread 96dimensions in the grid. The elements of the single integer tuple 97specify the number of threads in each dimension. 98 99For example, 100 101 { kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 } 102 103specifies that in kernel 0, two loops should be tiled with a tile 104size of 64 in both dimensions and that all kernels except kernel 4 105should be run using a block of 16 threads. 106 107Since PPCG performs some scheduling, it can be difficult to predict 108what exactly will end up in a kernel. If you want to specify 109tile, grid or block sizes, you may want to run PPCG first with the defaults, 110examine the kernels and then run PPCG again with the desired sizes. 111Instead of examining the kernels, you can also specify the option 112--dump-sizes on the first run to obtain the effectively used default sizes. 113 114 115Compiling the generated CUDA code with nvcc 116 117To get optimal performance from nvcc, it is important to choose --arch 118according to your target GPU. Specifically, use the flag "--arch sm_20" 119for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for 120GK110 Kepler. We discourage the use of older cards as we have seen 121correctness issues with compilation for older architectures. 122Note that in the absence of any --arch flag, nvcc defaults to 123"--arch sm_13". This will not only be slower, but can also cause 124correctness issues. 125If you want to obtain results that are identical to those obtained 126by the original code, then you may need to disable some optimizations 127by passing the "--fmad=false" option. 128 129 130Compiling the generated OpenCL code with gcc 131 132To compile the host code you need to link against the file 133ocl_utilities.c which contains utility functions used by the generated 134OpenCL host code. To compile the host code with gcc, run 135 136 gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL 137 138Note that we have experienced the generated OpenCL code freezing 139on some inputs (e.g., the PolyBench symm benchmark) when using 140at least some version of the Nvidia OpenCL library, while the 141corresponding CUDA code runs fine. 142We have experienced no such freezes when using AMD, ARM or Intel 143OpenCL libraries. 144 145By default, the compiled executable will need the _kernel.cl file at 146run time. Alternatively, the option --opencl-embed-kernel-code may be 147given to place the kernel code in a string literal. The kernel code is 148then compiled into the host binary, such that the _kernel.cl file is no 149longer needed at run time. Any kernel include files, in particular 150those supplied using --opencl-include-file, will still be required at 151run time. 152 153 154Function calls 155 156Function calls inside the analyzed fragment are reproduced 157in the CUDA or OpenCL code, but for now it is left to the user 158to make sure that the functions that are being called are 159available from the generated kernels. 160 161In the case of OpenCL code, the --opencl-include-file option 162may be used to specify one or more files to be #include'd 163from the generated code. These files may then contain 164the definitions of the functions being called from the 165program fragment. If the pathnames of the included files 166are relative to the current directory, then you may need 167to additionally specify the --opencl-compiler-options=-I. 168to make sure that the files can be found by the OpenCL compiler. 169The included files may contain definitions of types used by the 170generated kernels. By default, PPCG generates definitions for 171types as needed, but these definitions may collide with those in 172the included files, as PPCG does not consider the contents of the 173included files. The --no-opencl-print-kernel-types will prevent 174PPCG from generating type definitions. 175 176 177GNU extensions 178 179By default, PPCG may print out macro definitions that involve 180GNU extensions such as __typeof__ and statement expressions. 181Some compilers may not support these extensions. 182In particular, OpenCL 1.2 beignet 1.1.1 (git-6de6918) 183has been reported not to support __typeof__. 184The use of these extensions can be turned off with the 185--no-allow-gnu-extensions option. 186 187 188Processing PolyBench 189 190When processing a PolyBench/C 3.2 benchmark, you should always specify 191-DPOLYBENCH_USE_C99_PROTO on the ppcg command line. Otherwise, the source 192files are inconsistent, having fixed size arrays but parametrically 193bounded loops iterating over them. 194However, you should not specify this define when compiling 195the PPCG generated code using nvcc since CUDA does not support VLAs. 196 197 198CUDA and function overloading 199 200While CUDA supports function overloading based on the arguments types, 201no such function overloading exists in the input language C. Since PPCG 202simply prints out the same function name as in the original code, this 203may result in a different function being called based on the types 204of the arguments. For example, if the original code contains a call 205to the function sqrt() with a float argument, then the argument will 206be promoted to a double and the sqrt() function will be called. 207In the transformed (CUDA) code, however, overloading will cause the 208function sqrtf() to be called. Until this issue has been resolved in PPCG, 209we recommend that users either explicitly call the function sqrtf() or 210explicitly cast the argument to double in the input code. 211 212 213Contact 214 215For bug reports, feature requests and questions, 216contact http://groups.google.com/group/isl-development 217 218Whenever you report a bug, please mention the exact version of PPCG 219that you are using (output of "./ppcg --version"). If you are unable 220to compile PPCG, then report the git version (output of "git describe") 221or the version number included in the name of the tarball. 222 223 224Citing PPCG 225 226If you use PPCG for your research, you are invited to cite 227the following paper. 228 229@article{Verdoolaege2013PPCG, 230 author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and 231 G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and 232 Catthoor, Francky}, 233 title = {Polyhedral parallel code generation for CUDA}, 234 journal = {ACM Trans. Archit. Code Optim.}, 235 issue_date = {January 2013}, 236 volume = {9}, 237 number = {4}, 238 month = jan, 239 year = {2013}, 240 issn = {1544-3566}, 241 pages = {54:1--54:23}, 242 doi = {10.1145/2400682.2400713}, 243 acmid = {2400713}, 244 publisher = {ACM}, 245 address = {New York, NY, USA}, 246} 247