1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 2 "http://www.w3.org/TR/html4/strict.dtd"> 3<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> 4<html> 5<head> 6 <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 7 <title>Polly - GPGPU Code Generation</title> 8 <link type="text/css" rel="stylesheet" href="../menu.css"> 9 <link type="text/css" rel="stylesheet" href="../content.css"> 10</head> 11<body> 12<div id="box"> 13<!--#include virtual="../menu.html.incl"--> 14<div id="content"> 15 <!--*********************************************************************--> 16 <h1>Polly - GPGPU Code Generation</h1> 17 <!--*********************************************************************--> 18<p><em>WARNING: This project was part of the Google Summer of Code 2012. 19It is currently not finished, but it is in the design and implementation stage. 20The ideas/plans described here may not yet be implemented in Polly and may 21change later on.</em></p> 22 23This project adds GPGPU code generation feature to Polly. 24 25<h2>Objective</h2> 26<p>The overall objective of this GSoC project is to create a preliminary 27 implementation of GPGPU code generation for Polly. With this addition, users 28 can parallelize some perfectly nested loops with Polly to execute on a 29 heterogeneous platform, composed of CPU and GPU.</p> 30<p>There are several successful projects about automatic source-to-source gpu 31 code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for 32 computing an affine schedule and then applies a wavefront transformation to 33 obtain one sequential and n-1 parallel loops. The parallel loops are then 34 mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced 35 algorithms which can expose much more parallelism than other methods . And It 36 also introduces affine partition heuristics and code generation algorithms 37 for locality enhancement in the registers and shared memory.</p> 38<p>Since automatic GPGPU code generation is quite a complex problem and what we 39 target is a low-level intermediate representation, LLVM IR, rather than a 40 high-level language source, it is important for us to set a proper objective 41 as a start step to give a complete solution to GPGPU code generation for LLVM 42 IR.</p> 43<p>Firstly, we plan to target two kinds of relatively simple test cases. One is 44 comprised of pure parallel and perfectly nested loops, like the following 45 code.</p> 46<pre> 47parfor(int i=0 to M) 48 parfor(int j=0 to N) 49 LoopBody(i, j); 50</pre> 51<p>Another one is that all the loops in it are parallel except the inner-most 52 one, just like this:</p> 53<pre> 54parfor(int i=0 to M) 55 parfor(int j=0 to N) 56 non-parfor(int k=0 to K) 57 LoopBody(i, j, k); 58</pre> 59<p>The LoopBody part should be limited to instructions or functions calls 60 (intrinsics) which can be handled by LLVM's NVPTX backend.</p> 61<p>On the other hand, we focus on building a preliminary and scalable framework 62 of GPGPU code generation for polly. Thus we plan to employ relatively simple 63 tiling and mapping algorithms and optimize them later.</p> 64<h2>Work Flow</h2> 65<h3>GPGPU Code Generation In General</h3> 66<p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU 67 code generation problem.</p> 68<li>Look for parallel loops.</li> 69<li>Create a polyhedral model from the loops.</li> 70<li>Tile and map the loops to GPU blocks and threads.</li> 71<li>Determine where to place the data.</li> 72<h3>What has been done in Polly</h3> 73<p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and 74 many other analysis and transformation passes.</p> 75<h3>What to do in Polly</h3> 76<p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is 77 a low-level optimizer, which means we can't use a source-level compiler 78 (e.g. NVCC) to generate the final assembly for the device. We need manually 79 insert device driver API calls to execute the generated kernel assembly 80 text.</p> 81<p>In this project, we assume that the device driver library has provided an 82 interface to launch kernels in the form of assembly text. Fortunately, most 83 of the mainstream GPU vendors provide such a feature in their products (see 84 ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we 85 are going to do in Polly is:</p> 86<li>Find a way to tile the parallel loops.</li> 87<li>Find a way to extract the loop body and transform it into thread-centric 88 parallel code.</li> 89<li>Find a way to store/load the thread-centric code into/from a device module. 90<li>Find a way to pass the target machine information and generate code of the 91 device module for the target. 92<li>Find a way to map the tiled loop to GPU blocks and threads.</li> 93<li>Find a way to insert CUDA synchronization operations on-demand. 94<li>Find a way to generate the memory copy operations between a host and a 95 device.</li> 96<li>Implement/Wrap a runtime library to serve as the execution engine for the 97 generated device code.</li> 98 99<h3>The Work Flow</h3> 100<p>In this section, we assume that the host cpu is X86 and the device is NVIDIA 101 CUDA-compatible. we will use the following test case to describe our work 102 flow.</p> 103<pre> 104for(i = 0; i < 128; i++) 105 for(j = 0; j < 128; j++) 106 A[i][j] = i*128 + j; 107</pre> 108<p>The work flow of our code generator is as follows.</p> 109<p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel 110 tiled code.</p> 111The "schedule" part of the pre-optimization jscop file is as the following: 112<pre> 113"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, i0, 0, i1, 0] }" 114</pre> 115The jscop file describing the tiling transformation is: 116<pre> 117"schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, o0, o1, o2, o3]: 118 o0 >= 0 and o0 <= 7 and o1 >= 0 and o1 <= 15 and 119 o2 >= 0 and o2 <= 7 and o3 >= 0 and o3 <= 15 and 120 i0 = 16o0 + o1 and i1 = 16o2 + o3 }" 121</pre> 122We can test the schedule with the following command line. 123<pre> 124opt -load /path/to/polly/build/LLVMPolly.so -basic-aa -polly-import-jscop 125 -polly-ast -analyze -q ./test.ll 126 -polly-import-jscop-postfix=transformed+gpu 127</pre> 128The output of this schedule is: 129<pre> 130for (c2=0;c2<=7;c2++) { 131 for (c3=0;c3<=15;c3++) { 132 for (c4=0;c4<=7;c4++) { 133 for (c5=0;c5<=15;c5++) { 134 Stmt_for_body3(16*c2+c3,16*c4+c5); 135 } 136 } 137 } 138} 139</pre> 140Now we get a 4-dimensional parallel loops with a single SCoP statement in it. 141<p>2.We then extract the loop body (or the inner-most non-parallel loop) into a 142 LLVM function, tagging it with PTX_Kernel call convention.</p> 143<p>3.We extract the PTX_kernel function into a temporary module, set the target 144 triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary 145 module into a string, store it in the original module and erase the 146 PTX_kernel function.</p> 147<p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code 148 is composed of a call to the llvm.codegen intrinsic and function calls to our 149 GPU runtime library.</p> 150<p>5.Finally, we generate the executable program with <em>llc</em> or run the 151 optimized LLVM IRs with a JIT compiler like <em>lli</em>.</p> 152<h2>Usage</h2> 153<p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p> 154<pre>cd /path/to/llvm/source 155git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre> 156<p>2. Build the test case.</p> 157<pre>/path/to/polly/source/test/create_ll.sh test.c</pre> 158<p>3. Get and edit the jscop file (take function "gpu_codegen" as an example). 159</p> 160<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basic-aa 161 -polly-export-jscop ./test.ll 162cp gpu_codegen___%for.cond---%for.end8.jscop 163 gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu 164vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre> 165<p><em>(Please refer to section "The Work Flow" on how to edit the "schedule" 166 part of a statement)</em></p> 167<p>4. Optimize the code with GPGPU code generation.</p> 168<pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basic-aa 169 -polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu 170 -polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S 171 -o test.gpued.ll</pre> 172<p>5. Build the final assembly and executable.</p> 173<pre>llc test.gpued.ll -o test.s 174gcc test.s -lGPURuntime -o test</pre> 175<p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that 176 /path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p> 177<h2>TODO List</h2> 178 179<table class="wikitable" cellpadding="2"> 180<tbody> 181<tr style="background: rgb(239, 239, 239)"> 182 <th width="400px"> Tasks</th> 183 <th width="150px"> Status </th> 184 <th> Owner </th> 185</tr> 186<tr> 187<th align="left">Tiling the Parallel Loops with An External Jscop File</th> 188<td align="center" class='open'>Open, In Design</td> 189<td>Yabin Hu</td> 190</tr> 191<tr> 192<th align="left">GPU Runtime Library Implementation</th> 193<td align="center" class='inprogress'>Coding Finished, In Reviewing</td> 194<td></td> 195</tr> 196<tr> 197<th align="left">llvm.codegen Intrinsic Implementation</th> 198<td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td> 199<td></td> 200</tr> 201<tr> 202<th align="left">Code Generation For Host</th> 203<td align="center" class='inprogress'>50% Done</td> 204<td></td> 205</tr> 206 207</tbody></table> 208 209<h2>References</h2> 210<li type="1" value="1"> 211<em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br /> 212 Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br /> 213 International Conference on Compiler Construction (CC) 2010.<br /> 214</li> 215<li type="1"><em>PPCG Project</em><br /> 216<a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg 217</a></li> 218<li type="1"> 219<em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the 220 Answer. </em><br /> 221 Chris Gregg and Kim Hazelwood<br /> 222 International Symposium on Performance Analysis of Systems and Software 223 (ISPASS) 2011. 224</li> 225<p></p> 226</div> 227</div> 228</body> 229</html> 230