• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 /*
2  * Copyright (C) 2019 Collabora, Ltd.
3  *
4  * Permission is hereby granted, free of charge, to any person obtaining a
5  * copy of this software and associated documentation files (the "Software"),
6  * to deal in the Software without restriction, including without limitation
7  * the rights to use, copy, modify, merge, publish, distribute, sublicense,
8  * and/or sell copies of the Software, and to permit persons to whom the
9  * Software is furnished to do so, subject to the following conditions:
10  *
11  * The above copyright notice and this permission notice (including the next
12  * paragraph) shall be included in all copies or substantial portions of the
13  * Software.
14  *
15  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16  * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17  * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
18  * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19  * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21  * SOFTWARE.
22  *
23  * Authors:
24  *   Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
25  */
26 
27 #include "util/u_math.h"
28 #include "util/macros.h"
29 #include "pan_device.h"
30 #include "pan_encoder.h"
31 
32 /* Mali GPUs are tiled-mode renderers, rather than immediate-mode.
33  * Conceptually, the screen is divided into 16x16 tiles. Vertex shaders run.
34  * Then, a fixed-function hardware block (the tiler) consumes the gl_Position
35  * results. For each triangle specified, it marks each containing tile as
36  * containing that triangle. This set of "triangles per tile" form the "polygon
37  * list". Finally, the rasterization unit consumes the polygon list to invoke
38  * the fragment shader.
39  *
40  * In practice, it's a bit more complicated than this. On Midgard chips with an
41  * "advanced tiling unit" (all except T720/T820/T830), 16x16 is the logical
42  * tile size, but Midgard features "hierarchical tiling", where power-of-two
43  * multiples of the base tile size can be used: hierarchy level 0 (16x16),
44  * level 1 (32x32), level 2 (64x64), per public information about Midgard's
45  * tiling. In fact, tiling goes up to 4096x4096 (!), although in practice
46  * 128x128 is the largest usually used (though higher modes are enabled).  The
47  * idea behind hierarchical tiling is to use low tiling levels for small
48  * triangles and high levels for large triangles, to minimize memory bandwidth
49  * and repeated fragment shader invocations (the former issue inherent to
50  * immediate-mode rendering and the latter common in traditional tilers).
51  *
52  * The tiler itself works by reading varyings in and writing a polygon list
53  * out. Unfortunately (for us), both of these buffers are managed in main
54  * memory; although they ideally will be cached, it is the drivers'
55  * responsibility to allocate these buffers. Varying buffer allocation is
56  * handled elsewhere, as it is not tiler specific; the real issue is allocating
57  * the polygon list.
58  *
59  * This is hard, because from the driver's perspective, we have no information
60  * about what geometry will actually look like on screen; that information is
61  * only gained from running the vertex shader. (Theoretically, we could run the
62  * vertex shaders in software as a prepass, or in hardware with transform
63  * feedback as a prepass, but either idea is ludicrous on so many levels).
64  *
65  * Instead, Mali uses a bit of a hybrid approach, splitting the polygon list
66  * into three distinct pieces. First, the driver statically determines which
67  * tile hierarchy levels to use (more on that later). At this point, we know the
68  * framebuffer dimensions and all the possible tilings of the framebuffer, so
69  * we know exactly how many tiles exist across all hierarchy levels. The first
70  * piece of the polygon list is the header, which is exactly 8 bytes per tile,
71  * plus padding and a small 64-byte prologue. (If that doesn't remind you of
72  * AFBC, it should. See pan_afbc.c for some fun parallels). The next part is
73  * the polygon list body, which seems to contain 512 bytes per tile, again
74  * across every level of the hierarchy. These two parts form the polygon list
75  * buffer. This buffer has a statically determinable size, approximately equal
76  * to the # of tiles across all hierarchy levels * (8 bytes + 512 bytes), plus
77  * alignment / minimum restrictions / etc.
78  *
79  * The third piece is the easy one (for us): the tiler heap. In essence, the
80  * tiler heap is a gigantic slab that's as big as could possibly be necessary
81  * in the worst case imaginable. Just... a gigantic allocation that we give a
82  * start and end pointer to. What's the catch? The tiler heap is lazily
83  * allocated; that is, a huge amount of memory is _reserved_, but only a tiny
84  * bit is actually allocated upfront. The GPU just keeps using the
85  * unallocated-but-reserved portions as it goes along, generating page faults
86  * if it goes beyond the allocation, and then the kernel is instructed to
87  * expand the allocation on page fault (known in the vendor kernel as growable
88  * memory). This is quite a bit of bookkeeping of its own, but that task is
89  * pushed to kernel space and we can mostly ignore it here, just remembering to
90  * set the GROWABLE flag so the kernel actually uses this path rather than
91  * allocating a gigantic amount up front and burning a hole in RAM.
92  *
93  * As far as determining which hierarchy levels to use, the simple answer is
94  * that right now, we don't. In the tiler configuration fields (consistent from
95  * the earliest Midgard's SFBD through the latest Bifrost traces we have),
96  * there is a hierarchy_mask field, controlling which levels (tile sizes) are
97  * enabled. Ideally, the hierarchical tiling dream -- mapping big polygons to
98  * big tiles and small polygons to small tiles -- would be realized here as
99  * well. As long as there are polygons at all needing tiling, we always have to
100  * have big tiles available, in case there are big polygons. But we don't
101  * necessarily need small tiles available. Ideally, when there are small
102  * polygons, small tiles are enabled (to avoid waste from putting small
103  * triangles in the big tiles); when there are not, small tiles are disabled to
104  * avoid enabling more levels than necessary, which potentially costs in memory
105  * bandwidth / power / tiler performance.
106  *
107  * Of course, the driver has to figure this out statically. When tile
108  * hiearchies are actually established, this occurs by the tiler in
109  * fixed-function hardware, after the vertex shaders have run and there is
110  * sufficient information to figure out the size of triangles. The driver has
111  * no such luxury, again barring insane hacks like additionally running the
112  * vertex shaders in software or in hardware via transform feedback. Thus, for
113  * the driver, we need a heuristic approach.
114  *
115  * There are lots of heuristics to guess triangle size statically you could
116  * imagine, but one approach shines as particularly simple-stupid: assume all
117  * on-screen triangles are equal size and spread equidistantly throughout the
118  * screen. Let's be clear, this is NOT A VALID ASSUMPTION. But if we roll with
119  * it, then we see:
120  *
121  *      Triangle Area   = (Screen Area / # of triangles)
122  *                      = (Width * Height) / (# of triangles)
123  *
124  * Or if you prefer, we can also make a third CRAZY assumption that we only draw
125  * right triangles with edges parallel/perpendicular to the sides of the screen
126  * with no overdraw, forming a triangle grid across the screen:
127  *
128  * |--w--|
129  *  _____   |
130  * | /| /|  |
131  * |/_|/_|  h
132  * | /| /|  |
133  * |/_|/_|  |
134  *
135  * Then you can use some middle school geometry and algebra to work out the
136  * triangle dimensions. I started working on this, but realised I didn't need
137  * to to make my point, but couldn't bare to erase that ASCII art. Anyway.
138  *
139  * POINT IS, by considering the ratio of screen area and triangle count, we can
140  * estimate the triangle size. For a small size, use small bins; for a large
141  * size, use large bins. Intuitively, this metric makes sense: when there are
142  * few triangles on a large screen, you're probably compositing a UI and
143  * therefore the triangles are large; when there are a lot of triangles on a
144  * small screen, you're probably rendering a 3D mesh and therefore the
145  * triangles are tiny. (Or better said -- there will be tiny triangles, even if
146  * there are also large triangles. There have to be unless you expect crazy
147  * overdraw. Generally, it's better to allow more small bin sizes than
148  * necessary than not allow enough.)
149  *
150  * From this heuristic (or whatever), we determine the minimum allowable tile
151  * size, and we use that to decide the hierarchy masking, selecting from the
152  * minimum "ideal" tile size to the maximum tile size (2048x2048 in practice).
153  *
154  * Once we have that mask and the framebuffer dimensions, we can compute the
155  * size of the statically-sized polygon list structures, allocate them, and go!
156  *
157  * -----
158  *
159  * On T720, T820, and T830, there is no support for hierarchical tiling.
160  * Instead, the hardware allows the driver to select the tile size dynamically
161  * on a per-framebuffer basis, including allowing rectangular/non-square tiles.
162  * Rules for tile size selection are as follows:
163  *
164  *  - Dimensions must be powers-of-two.
165  *  - The smallest tile is 16x16.
166  *  - The tile width/height is at most the framebuffer w/h (clamp up to 16 pix)
167  *  - There must be no more than 64 tiles in either dimension.
168  *
169  * Within these constraints, the driver is free to pick a tile size according
170  * to some heuristic, similar to units with an advanced tiling unit.
171  *
172  * To pick a size without any heuristics, we may satisfy the constraints by
173  * defaulting to 16x16 (a power-of-two). This fits the minimum. For the size
174  * constraint, consider:
175  *
176  *      # of tiles < 64
177  *      ceil (fb / tile) < 64
178  *      (fb / tile) <= (64 - 1)
179  *      tile <= fb / (64 - 1) <= next_power_of_two(fb / (64 - 1))
180  *
181  * Hence we clamp up to align_pot(fb / (64 - 1)).
182 
183  * Extending to use a selection heuristic left for future work.
184  *
185  * Once the tile size (w, h) is chosen, we compute the hierarchy "mask":
186  *
187  *      hierarchy_mask = (log2(h / 16) << 6) | log2(w / 16)
188  *
189  * Of course with no hierarchical tiling, this is not a mask; it's just a field
190  * specifying the tile size. But I digress.
191  *
192  * We also compute the polgon list sizes (with framebuffer size W, H) as:
193  *
194  *      full_size = 0x200 + 0x200 * ceil(W / w) * ceil(H / h)
195  *      offset = 8 * ceil(W / w) * ceil(H / h)
196  *
197  * It further appears necessary to round down offset to the nearest 0x200.
198  * Possibly we would also round down full_size to the nearest 0x200 but
199  * full_size/0x200 = (1 + ceil(W / w) * ceil(H / h)) is an integer so there's
200  * nothing to do.
201  */
202 
203 /* Hierarchical tiling spans from 16x16 to 4096x4096 tiles */
204 
205 #define MIN_TILE_SIZE 16
206 #define MAX_TILE_SIZE 4096
207 
208 /* Constants as shifts for easier power-of-two iteration */
209 
210 #define MIN_TILE_SHIFT util_logbase2(MIN_TILE_SIZE)
211 #define MAX_TILE_SHIFT util_logbase2(MAX_TILE_SIZE)
212 
213 /* The hierarchy has a 64-byte prologue */
214 #define PROLOGUE_SIZE 0x40
215 
216 /* For each tile (across all hierarchy levels), there is 8 bytes of header */
217 #define HEADER_BYTES_PER_TILE 0x8
218 
219 /* Likewise, each tile per level has 512 bytes of body */
220 #define FULL_BYTES_PER_TILE 0x200
221 
222 /* If the width-x-height framebuffer is divided into tile_size-x-tile_size
223  * tiles, how many tiles are there? Rounding up in each direction. For the
224  * special case of tile_size=16, this aligns with the usual Midgard count.
225  * tile_size must be a power-of-two. Not really repeat code from AFBC/checksum,
226  * because those care about the stride (not just the overall count) and only at
227  * a a fixed-tile size (not any of a number of power-of-twos) */
228 
229 static unsigned
pan_tile_count(unsigned width,unsigned height,unsigned tile_width,unsigned tile_height)230 pan_tile_count(unsigned width, unsigned height, unsigned tile_width, unsigned tile_height)
231 {
232         unsigned aligned_width = ALIGN_POT(width, tile_width);
233         unsigned aligned_height = ALIGN_POT(height, tile_height);
234 
235         unsigned tile_count_x = aligned_width / tile_width;
236         unsigned tile_count_y = aligned_height / tile_height;
237 
238         return tile_count_x * tile_count_y;
239 }
240 
241 /* For `masked_count` of the smallest tile sizes masked out, computes how the
242  * size of the polygon list header. We iterate the tile sizes (16x16 through
243  * 2048x2048). For each tile size, we figure out how many tiles there are at
244  * this hierarchy level and therefore many bytes this level is, leaving us with
245  * a byte count for each level. We then just sum up the byte counts across the
246  * levels to find a byte count for all levels. */
247 
248 static unsigned
panfrost_hierarchy_size(unsigned width,unsigned height,unsigned mask,unsigned bytes_per_tile)249 panfrost_hierarchy_size(
250                 unsigned width,
251                 unsigned height,
252                 unsigned mask,
253                 unsigned bytes_per_tile)
254 {
255         unsigned size = PROLOGUE_SIZE;
256 
257         /* Iterate hierarchy levels */
258 
259         for (unsigned b = 0; b < (MAX_TILE_SHIFT - MIN_TILE_SHIFT); ++b) {
260                 /* Check if this level is enabled */
261                 if (!(mask & (1 << b)))
262                         continue;
263 
264                 /* Shift from a level to a tile size */
265                 unsigned tile_size = (1 << b) * MIN_TILE_SIZE;
266 
267                 unsigned tile_count = pan_tile_count(width, height, tile_size, tile_size);
268                 unsigned level_count = bytes_per_tile * tile_count;
269 
270                 size += level_count;
271         }
272 
273         /* This size will be used as an offset, so ensure it's aligned */
274         return ALIGN_POT(size, 0x200);
275 }
276 
277 /* Implement the formula:
278  *
279  *      0x200 + bytes_per_tile * ceil(W / w) * ceil(H / h)
280  *
281  * rounding down the answer to the nearest 0x200. This is used to compute both
282  * header and body sizes for GPUs without hierarchical tiling. Essentially,
283  * computing a single hierarchy level, since there isn't any hierarchy!
284  */
285 
286 static unsigned
panfrost_flat_size(unsigned width,unsigned height,unsigned dim,unsigned bytes_per_tile)287 panfrost_flat_size(unsigned width, unsigned height, unsigned dim, unsigned bytes_per_tile)
288 {
289         /* First, extract the tile dimensions */
290 
291         unsigned tw = (1 << (dim & 0b111)) * 8;
292         unsigned th = (1 << ((dim & (0b111 << 6)) >> 6)) * 8;
293 
294         /* tile_count is ceil(W/w) * ceil(H/h) */
295         unsigned raw = pan_tile_count(width, height, tw, th) * bytes_per_tile;
296 
297         /* Round down and add offset */
298         return 0x200 + ((raw / 0x200) * 0x200);
299 }
300 
301 /* Given a hierarchy mask and a framebuffer size, compute the header size */
302 
303 unsigned
panfrost_tiler_header_size(unsigned width,unsigned height,unsigned mask,bool hierarchy)304 panfrost_tiler_header_size(unsigned width, unsigned height, unsigned mask, bool hierarchy)
305 {
306         if (hierarchy)
307                 return panfrost_hierarchy_size(width, height, mask, HEADER_BYTES_PER_TILE);
308         else
309                 return panfrost_flat_size(width, height, mask, HEADER_BYTES_PER_TILE);
310 }
311 
312 /* The combined header/body is sized similarly (but it is significantly
313  * larger), except that it can be empty when the tiler disabled, rather than
314  * getting clamped to a minimum size.
315  */
316 
317 unsigned
panfrost_tiler_full_size(unsigned width,unsigned height,unsigned mask,bool hierarchy)318 panfrost_tiler_full_size(unsigned width, unsigned height, unsigned mask, bool hierarchy)
319 {
320         if (hierarchy)
321                 return panfrost_hierarchy_size(width, height, mask, FULL_BYTES_PER_TILE);
322         else
323                 return panfrost_flat_size(width, height, mask, FULL_BYTES_PER_TILE);
324 }
325 
326 /* On GPUs without hierarchical tiling, we choose a tile size directly and
327  * stuff it into the field otherwise known as hierarchy mask (not a mask). */
328 
329 static unsigned
panfrost_choose_tile_size(unsigned width,unsigned height,unsigned vertex_count)330 panfrost_choose_tile_size(
331         unsigned width, unsigned height, unsigned vertex_count)
332 {
333         /* Figure out the ideal tile size. Eventually a heuristic should be
334          * used for this */
335 
336         unsigned best_w = 16;
337         unsigned best_h = 16;
338 
339         /* Clamp so there are less than 64 tiles in each direction */
340 
341         best_w = MAX2(best_w, util_next_power_of_two(width / 63));
342         best_h = MAX2(best_h, util_next_power_of_two(height / 63));
343 
344         /* We have our ideal tile size, so encode */
345 
346         unsigned exp_w = util_logbase2(best_w / 16);
347         unsigned exp_h = util_logbase2(best_h / 16);
348 
349         return exp_w | (exp_h << 6);
350 }
351 
352 /* In the future, a heuristic to choose a tiler hierarchy mask would go here.
353  * At the moment, we just default to 0xFF, which enables all possible hierarchy
354  * levels. Overall this yields good performance but presumably incurs a cost in
355  * memory bandwidth / power consumption / etc, at least on smaller scenes that
356  * don't really need all the smaller levels enabled */
357 
358 unsigned
panfrost_choose_hierarchy_mask(unsigned width,unsigned height,unsigned vertex_count,bool hierarchy)359 panfrost_choose_hierarchy_mask(
360         unsigned width, unsigned height,
361         unsigned vertex_count, bool hierarchy)
362 {
363         /* If there is no geometry, we don't bother enabling anything */
364 
365         if (!vertex_count)
366                 return 0x00;
367 
368         if (!hierarchy)
369                 return panfrost_choose_tile_size(width, height, vertex_count);
370 
371         /* Otherwise, default everything on. TODO: Proper tests */
372 
373         return 0xFF;
374 }
375