• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Panfrost
2========
3
4The Panfrost driver stack includes an OpenGL ES implementation for Arm Mali
5GPUs based on the Midgard and Bifrost microarchitectures. It is **conformant**
6on Mali-G52 and Mali-G57 but **non-conformant** on other GPUs. The following
7hardware is currently supported:
8
9=========  ============ ============ =======
10Product    Architecture OpenGL ES    OpenGL
11=========  ============ ============ =======
12Mali T620  Midgard (v4) 2.0          2.1
13Mali T720  Midgard (v4) 2.0          2.1
14Mali T760  Midgard (v5) 3.1          3.1
15Mali T820  Midgard (v5) 3.1          3.1
16Mali T830  Midgard (v5) 3.1          3.1
17Mali T860  Midgard (v5) 3.1          3.1
18Mali T880  Midgard (v5) 3.1          3.1
19Mali G72   Bifrost (v6) 3.1          3.1
20Mali G31   Bifrost (v7) 3.1          3.1
21Mali G51   Bifrost (v7) 3.1          3.1
22Mali G52   Bifrost (v7) 3.1          3.1
23Mali G76   Bifrost (v7) 3.1          3.1
24Mali G57   Valhall (v9) 3.1          3.1
25=========  ============ ============ =======
26
27Other Midgard and Bifrost chips (T604, G71) are not yet supported.
28
29Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are
30supported in the :doc:`Lima <lima>` driver, not Panfrost. Lima is also
31available in Mesa.
32
33Other graphics APIs (Vulkan, OpenCL) are not supported at this time.
34
35Building
36--------
37
38Panfrost's OpenGL support is a Gallium driver. Since Mali GPUs are 3D-only and
39do not include a display controller, Mesa uses kmsro to support display
40controllers paired with Mali GPUs. If your board with a Panfrost supported GPU
41has a display controller with mainline Linux support not supported by kmsro,
42it's easy to add support, see the commit ``cff7de4bb597e9`` as an example.
43
44LLVM is *not* required by Panfrost's compilers. LLVM support in Mesa can
45safely be disabled for most OpenGL ES users with Panfrost.
46
47Build like ``meson . build/ -Dvulkan-drivers=
48-Dgallium-drivers=panfrost -Dllvm=disabled`` for a build directory
49``build``.
50
51For general information on building Mesa, read :doc:`the install documentation
52<../install>`.
53
54Chat
55----
56
57Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note
58that registering and authenticating with ``NickServ`` is required to prevent
59spam. `Join the chat. <https://webchat.oftc.net/?channels=panfrost>`_
60
61Compressed texture support
62--------------------------
63
64In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC,
65S3TC, etc.) However, Panfrost depends on the hardware to support these formats
66efficiently.  All supported Mali architectures support these formats, but not
67every system-on-chip with a Mali GPU support all these formats. Many lower-end
68systems lack support for some BCn formats, which can cause problems when playing
69desktop games with Panfrost. To check whether this issue applies to your
70system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query
71supported formats.
72
73To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa.
74Then inside your Mesa build directory, the tool is located at
75``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device,
76set as executable as necessary, and run on the target device. A table of
77supported formats will be printed to standard output.
78
79drm-shim
80--------
81
82Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface.
83Use cases for this functionality include:
84
85- Future hardware bring up
86- Running shader-db on non-Mali workstations
87- Reproducing compiler (and some driver) bugs without Mali hardware
88
89Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C
90code and should work on any Linux machine. In particular, you can test the
91compiler on shader-db on an Intel desktop.
92
93To build Mesa with Panfrost drm-shim, configure Meson with
94``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above
95building section for a full invocation. The drm-shim binary will be built to
96``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``.
97
98To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary.  It
99may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa
100was installed.
101
102By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU,
103set the ``PAN_GPU_ID`` environment variable to the desired GPU ID:
104
105=========  ============ =======
106Product    Architecture GPU ID
107=========  ============ =======
108Mali-T720  Midgard (v4) 720
109Mali-T860  Midgard (v5) 860
110Mali-G72   Bifrost (v6) 6221
111Mali-G52   Bifrost (v7) 7212
112Mali-G57   Valhall (v9) 9093
113=========  ============ =======
114
115Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in
116``src/panfrost/lib/pan_props.c``.
117
118As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's
119build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as:
120
121.. code-block:: sh
122
123   ~/shader-db$ BIFROST_MESA_DEBUG=shaders \
124   LIBGL_DRIVERS_PATH=~/lib/dri/ \
125   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
126   PAN_GPU_ID=7212 \
127   ./run shaders/glmark/1-1.shader_test
128
129The same shader can be compiled for Mali-T720 as:
130
131.. code-block:: sh
132
133   ~/shader-db$ MIDGARD_MESA_DEBUG=shaders \
134   LIBGL_DRIVERS_PATH=~/lib/dri/ \
135   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
136   PAN_GPU_ID=720 \
137   ./run shaders/glmark/1-1.shader_test
138
139These examples set the compilers' ``shaders`` debug flags to dump the optimized
140NIR, backend IR after instruction selection, backend IR after register
141allocation and scheduling, and a disassembly of the final compiled binary.
142
143As another example, this invocation runs a single dEQP test "on" Mali-G52,
144pretty-printing GPU data structures and disassembling all shaders
145(``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory
146(``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable
147and various flags to dEQP mimic the surfaceless environment that our
148continuous integration (CI) uses. This eliminates window system dependencies,
149although it requires a specially built CTS:
150
151.. code-block:: sh
152
153   ~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \
154   LIBGL_DRIVERS_PATH=~/lib/dri/ \
155   LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \
156   PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \
157   ./glcts --deqp-surface-type=pbuffer \
158   --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \
159   --deqp-surface-height=256 -n \
160   dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute
161
162U-interleaved tiling
163---------------------
164
165Panfrost supports u-interleaved tiling. U-interleaved tiling is
166indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier.
167
168The tiling reorders whole pixels (blocks). It does not compress or modify the
169pixels themselves, so it can be used for any image format. Internally, images
170are divided into tiles. Tiles occur in source order, but pixels (blocks) within
171each tile are reordered according to a space-filling curve.
172
173For regular formats, 16x16 tiles are used. This harmonizes with the default tile
174size for binning and CRCs (transaction elimination). It also means a single line
175(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.
176
177For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are
178used, where entire blocks are reorder. Most of these formats compress 4x4
179blocks, so this gives an effective 16x16 tiling. This justifies the tile size
180intuitively, though it's not a rule: ASTC may uses larger blocks.
181
182Within a tile, the X and Y bits are interleaved (like Morton order), but with a
183twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious.
184Visually, addresses take the form::
185
186   | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
187
188Reference routines to encode/decode u-interleaved images are available in
189``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling
190curve. This reference implementation is used to unit test the optimized
191implementation used in production. The optimized implementation is available in
192``src/panfrost/shared/pan_tiling.c``.
193
194Although these routines are part of Panfrost, they are also used by Lima, as Arm
195introduced the format with Utgard. It is the only tiling supported on Utgard. On
196Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
197should be used instead where possible. However, not all formats are
198compressible, so u-interleaved tiling remains an important fallback on Panfrost.
199
200Instancing
201----------
202
203The attribute descriptor lets the attribute unit compute the address of an
204attribute given the vertex and instance ID. Unfortunately, the way this works is
205rather complicated when instancing is enabled.
206
207To explain this, first we need to explain how compute and vertex threads are
208dispatched.  When a quad is dispatched, it receives a single, linear index.
209However, we need to translate that index into a (vertex id, instance id) pair.
210One option would be to do:
211
212.. math::
213   \text{vertex id} = \text{linear id} \% \text{num vertices}
214
215   \text{instance id} = \text{linear id} / \text{num vertices}
216
217but this involves a costly division and modulus by an arbitrary number.
218Instead, we could pad num_vertices. We dispatch padded_num_vertices *
219num_instances threads instead of num_vertices * num_instances, which results
220in some "extra" threads with vertex_id >= num_vertices, which we have to
221discard.  The more we pad num_vertices, the more "wasted" threads we
222dispatch, but the division is potentially easier.
223
224One straightforward choice is to pad num_vertices to the next power of two,
225which means that the division and modulus are just simple bit shifts and
226masking. But the actual algorithm is a bit more complicated. The thread
227dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
228to dividing by a power of two. As a result, padded_num_vertices can be
2291, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
230since we need less padding.
231
232padded_num_vertices is picked by the hardware. The driver just specifies the
233actual number of vertices. Note that padded_num_vertices is a multiple of four
234(presumably because threads are dispatched in groups of 4). Also,
235padded_num_vertices is always at least one more than num_vertices, which seems
236like a quirk of the hardware. For larger num_vertices, the hardware uses the
237following algorithm: using the binary representation of num_vertices, we look at
238the most significant set bit as well as the following 3 bits. Let n be the
239number of bits after those 4 bits. Then we set padded_num_vertices according to
240the following table:
241
242==========  =======================
243high bits   padded_num_vertices
244==========  =======================
2451000		   :math:`9 \cdot 2^n`
2461001		   :math:`5 \cdot 2^{n+1}`
247101x		   :math:`3 \cdot 2^{n+2}`
248110x		   :math:`7 \cdot 2^{n+1}`
249111x		   :math:`2^{n+4}`
250==========  =======================
251
252For example, if num_vertices = 70 is passed to glDraw(), its binary
253representation is 1000110, so n = 3 and the high bits are 1000, and
254therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72.
255
256The attribute unit works in terms of the original linear_id. if
257num_instances = 1, then they are the same, and everything is simple.
258However, with instancing things get more complicated. There are four
259possible modes, two of them we can group together:
260
2611. Use the linear_id directly. Only used when there is no instancing.
262
2632. Use the linear_id modulo a constant. This is used for per-vertex
264attributes with instancing enabled by making the constant equal
265padded_num_vertices. Because the modulus is always padded_num_vertices, this
266mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9.
267The shift field specifies the power of two, while the extra_flags field
268specifies the odd number. If shift = n and extra_flags = m, then the modulus
269is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as
270computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set
271extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware
272algorithm used to get padded_num_vertices in order to correctly implement
273per-vertex attributes.
274
2753. Divide the linear_id by a constant. In order to correctly implement
276instance divisors, we have to divide linear_id by padded_num_vertices times
277to user-specified divisor. So first we compute padded_num_vertices, again
278following the exact same algorithm that the hardware uses, then multiply it
279by the GL-level divisor to get the hardware-level divisor. This case is
280further divided into two more cases. If the hardware-level divisor is a
281power of two, then we just need to shift. The shift amount is specified by
282the shift field, so that the hardware-level divisor is just
283:math:`2^\text{shift}`.
284
285If it isn't a power of two, then we have to divide by an arbitrary integer.
286For that, we use the well-known technique of multiplying by an approximation
287of the inverse. The driver must compute the magic multiplier and shift
288amount, and then the hardware does the multiplication and shift. The
289hardware and driver also use the "round-down" optimization as described in
290https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
291The hardware further assumes the multiplier is between :math:`2^{31}` and
292:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
293to 0 by the driver -- presumably this simplifies the hardware multiplier a
294little. The hardware first multiplies linear_id by the multiplier and
295takes the high 32 bits, then applies the round-down correction if
296extra_flags = 1, then finally shifts right by the shift field.
297
298There are some differences between ridiculousfish's algorithm and the Mali
299hardware algorithm, which means that the reference code from ridiculousfish
300doesn't always produce the right constants. Mali does not use the pre-shift
301optimization, since that would make a hardware implementation slower (it
302would have to always do the pre-shift, multiply, and post-shift operations).
303It also forces the multiplier to be at least :math:`2^{31}`, which means
304that the exponent is entirely fixed, so there is no trial-and-error.
305Altogether, given the divisor d, the algorithm the driver must follow is:
306
3071. Set shift = :math:`\lfloor \log_2(d) \rfloor`.
3082. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
3093. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set
310   magic_divisor = m - 1 and extra_flags = 1.  4. Otherwise, set magic_divisor =
311   m and extra_flags = 0.
312