xref: /third_party/mesa3d/docs/drivers/vc4.rst (revision bf215546)
1bf215546Sopenharmony_ciVC4
2bf215546Sopenharmony_ci===
3bf215546Sopenharmony_ci
4bf215546Sopenharmony_ciMesa's ``vc4`` graphics driver supports multiple implementations of
5bf215546Sopenharmony_ciBroadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
6bf215546Sopenharmony_cithrough Raspberry Pi 3 hardware, and the driver is included as an
7bf215546Sopenharmony_cioption as of the 2016-02-09 Rasbpian release using ``raspi-config``.
8bf215546Sopenharmony_ciOn most other distributions such as Debian or Fedora, you need no
9bf215546Sopenharmony_ciconfiguration to enable the driver.
10bf215546Sopenharmony_ci
11bf215546Sopenharmony_ciThis Mesa driver talks directly to the `vc4
12bf215546Sopenharmony_ci<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
13bf215546Sopenharmony_cidriver for scheduling graphics commands, and that module also provides
14bf215546Sopenharmony_ciKMS display support.  The driver makes no use of the closed source VPU
15bf215546Sopenharmony_cifirmware on the VideoCore IV block, instead talking directly to the
16bf215546Sopenharmony_ciGPU block from Linux.
17bf215546Sopenharmony_ci
18bf215546Sopenharmony_ciGLES2 support
19bf215546Sopenharmony_ci-------------
20bf215546Sopenharmony_ci
21bf215546Sopenharmony_ciThe vc4 driver is a nearly conformant GLES2 driver, and the hardware
22bf215546Sopenharmony_cihas achieved GLES2 conformance with other driver stacks.
23bf215546Sopenharmony_ci
24bf215546Sopenharmony_ciOpenGL support
25bf215546Sopenharmony_ci--------------
26bf215546Sopenharmony_ci
27bf215546Sopenharmony_ciAlong with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
28bf215546Sopenharmony_cimostly correct but with a few caveats.
29bf215546Sopenharmony_ci
30bf215546Sopenharmony_ci* 4-byte index buffers.
31bf215546Sopenharmony_ci
32bf215546Sopenharmony_ciGLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
33bf215546Sopenharmony_cithem in vc4, we create a shadow copy of your index buffer with the
34bf215546Sopenharmony_ciindices truncated to 2 bytes. This is incorrect (and will assertion
35bf215546Sopenharmony_cifail in debug builds of Mesa) if any of the indices were >65535. To
36bf215546Sopenharmony_cifix that, we would need to detect this case and rewrite the index
37bf215546Sopenharmony_cibuffer and vertex buffers to do a series of draws each with small
38bf215546Sopenharmony_ciindices and new vertex attrib bindings.
39bf215546Sopenharmony_ci
40bf215546Sopenharmony_ciTo avoid this problem, ensure that all index buffers are written using
41bf215546Sopenharmony_ci``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
42bf215546Sopenharmony_ciwith updated vertex attrib bindings.
43bf215546Sopenharmony_ci
44bf215546Sopenharmony_ci* Occlusion queries
45bf215546Sopenharmony_ci
46bf215546Sopenharmony_ciThe VC4 hardware has no support for occlusion queries.  GL 2.0
47bf215546Sopenharmony_cirequires that you support the occlusion queries extension, but you can
48bf215546Sopenharmony_cireport 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
49bf215546Sopenharmony_ciGL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
50bf215546Sopenharmony_ci"we want the functions to be present everywhere, but we want it to be
51bf215546Sopenharmony_cioptional for hardware to support it. Sadly, gallium doesn't yet allow
52bf215546Sopenharmony_cithe driver to report 0 query bits.
53bf215546Sopenharmony_ci
54bf215546Sopenharmony_ci* Primitive mode
55bf215546Sopenharmony_ci
56bf215546Sopenharmony_ciVC4 doesn't support reducing triangles/quads/polygons to lines and
57bf215546Sopenharmony_cipoints like desktop GL. If front/back mode matched, we could rewrite
58bf215546Sopenharmony_cithe index buffer to the new primitive type, but we don't. If
59bf215546Sopenharmony_cifront/back mode don't match, we would need to run the vertex shader in
60bf215546Sopenharmony_cisoftware, classify the prims, write new index buffers, and emit
61bf215546Sopenharmony_ci(possibly many) new draw calls to rasterize the new prims in the same
62bf215546Sopenharmony_ciorder.
63bf215546Sopenharmony_ci
64bf215546Sopenharmony_ciBug Reporting
65bf215546Sopenharmony_ci-------------
66bf215546Sopenharmony_ci
67bf215546Sopenharmony_ciVC4 rendering bugs should go to Mesa's gitlab `issues
68bf215546Sopenharmony_ci<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.
69bf215546Sopenharmony_ci
70bf215546Sopenharmony_ciBy far the easiest way to communicate bug reports for rendering
71bf215546Sopenharmony_ciproblems is to take an apitrace. This passes exactly the drawing you
72bf215546Sopenharmony_cisaw to the developer, without the developer needing to download and
73bf215546Sopenharmony_cibuild the application and replicate whatever steps you took to produce
74bf215546Sopenharmony_cithe problem.  Traces attached to bug reports should ideally be small.
75bf215546Sopenharmony_ci
76bf215546Sopenharmony_ciFor GPU hangs, if you can get a short apitrace that produces the
77bf215546Sopenharmony_ciproblem, that's still the best.  If the problem takes a long time to
78bf215546Sopenharmony_cireproduce or you can't capture it in a trace, describing how to
79bf215546Sopenharmony_cireproduce and including a gpu hang dump would be the most
80bf215546Sopenharmony_ciuseful. Install `vc4-gpu-tools
81bf215546Sopenharmony_ci<https://github.com/anholt/vc4-gpu-tools/>` and use
82bf215546Sopenharmony_ci``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
83bf215546Sopenharmony_ciprovide useful information.
84bf215546Sopenharmony_ci
85bf215546Sopenharmony_ciTiled Rendering
86bf215546Sopenharmony_ci---------------
87bf215546Sopenharmony_ci
88bf215546Sopenharmony_ciVC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
89bf215546Sopenharmony_ci32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
90bf215546Sopenharmony_cilooks like::
91bf215546Sopenharmony_ci
92bf215546Sopenharmony_ci    (CPU) Allocate space to store a list of draw commands per tile
93bf215546Sopenharmony_ci    (CPU) Set up a command list per tile that does:
94bf215546Sopenharmony_ci        Either load the current tile's color buffer from memory, or clear it.
95bf215546Sopenharmony_ci        Either load the current tile's depth buffer from memory, or clear it.
96bf215546Sopenharmony_ci        Branch into the draw list for the tile
97bf215546Sopenharmony_ci        Store the depth buffer if anybody might read it.
98bf215546Sopenharmony_ci        Store the color buffer if anybody might read it.
99bf215546Sopenharmony_ci    (GPU) Initialize the per-tile draw call lists to empty.
100bf215546Sopenharmony_ci    (GPU) Run all draw calls collecting vertex data
101bf215546Sopenharmony_ci    (GPU) For each tile covered by a draw call's primitive.
102bf215546Sopenharmony_ci        Emit state packets to the list to update it to the current draw call's state.
103bf215546Sopenharmony_ci        Emit a primitive description into the tile's draw call list.
104bf215546Sopenharmony_ci
105bf215546Sopenharmony_ciTiled rendering avoids the need for large render target caches, at the
106bf215546Sopenharmony_ciexpense of increasing the cost of vertex processing. Unlike some tiled
107bf215546Sopenharmony_cirenderers, VC4 has no non-tiled rendering mode.
108bf215546Sopenharmony_ci
109bf215546Sopenharmony_ciPerformance Tricks
110bf215546Sopenharmony_ci------------------
111bf215546Sopenharmony_ci
112bf215546Sopenharmony_ci* Reducing memory bandwidth by clearing.
113bf215546Sopenharmony_ci
114bf215546Sopenharmony_ciEven if your drawing is going to cover the entire render target, it's
115bf215546Sopenharmony_cimore efficient for VC4 if you emit a ``glClear()`` of the color and
116bf215546Sopenharmony_cidepth buffers. This means we can skip the load of the previous state
117bf215546Sopenharmony_cifrom memory, in favor of a cheap GPU-side ``memset()`` of the tile
118bf215546Sopenharmony_cibuffer before we start running the draw calls.
119bf215546Sopenharmony_ci
120bf215546Sopenharmony_ci* Reducing memory bandwidth with scissoring.
121bf215546Sopenharmony_ci
122bf215546Sopenharmony_ciIf all draw calls for the frame are with a ``glScissor()`` to only
123bf215546Sopenharmony_cipart of the screen, then we can skip setting up the tiles for that
124bf215546Sopenharmony_ciarea, which means a little less memory used setting up the empty bins,
125bf215546Sopenharmony_ciand a lot less memory used loading/storing the unchanged tiles.
126bf215546Sopenharmony_ci
127bf215546Sopenharmony_ci* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.
128bf215546Sopenharmony_ci
129bf215546Sopenharmony_ciIf we don't know who might use the contents of the framebuffer's depth
130bf215546Sopenharmony_cior color in the future, then we have to store it for later. If you use
131bf215546Sopenharmony_ciglInvalidateFramebuffer() before accessing the results of your
132bf215546Sopenharmony_cirendering, then we can skip the store of the depth or color
133bf215546Sopenharmony_cibuffer. Note that this is unimplemented.
134bf215546Sopenharmony_ci
135bf215546Sopenharmony_ci* Avoid non-constant GLSL array indexing
136bf215546Sopenharmony_ci
137bf215546Sopenharmony_ciIn VC4 the only non-constant-index array access supported in hardware
138bf215546Sopenharmony_ciis uniforms. For everything else (inputs, outputs, temporaries), we
139bf215546Sopenharmony_cihave to lower them to an IF ladder like::
140bf215546Sopenharmony_ci
141bf215546Sopenharmony_ci  if (index == 0)
142bf215546Sopenharmony_ci     return array[0]
143bf215546Sopenharmony_ci  else if (index == 1)
144bf215546Sopenharmony_ci    return array[1]
145bf215546Sopenharmony_ci  ...
146bf215546Sopenharmony_ci
147bf215546Sopenharmony_ciThis is very expensive as we probably have to execute every branch of
148bf215546Sopenharmony_cievery IF statement due to it being a SIMD machine. So, it is
149bf215546Sopenharmony_cirecommended (if you can) to avoid non-uniform non-constant array
150bf215546Sopenharmony_ciindexing.
151bf215546Sopenharmony_ci
152bf215546Sopenharmony_ciNote that if you do variable indexing within a bounded loop that Mesa
153bf215546Sopenharmony_cican unroll, that can actually count as constant indexing.
154bf215546Sopenharmony_ci
155bf215546Sopenharmony_ci* Increasing GPU memory Increase CMA pool size
156bf215546Sopenharmony_ci
157bf215546Sopenharmony_ciThe memory for the VC4 driver is allocated from the standard Linux cma
158bf215546Sopenharmony_cipool. The size of this pool defaults to 64 MB.  To increase this, pass
159bf215546Sopenharmony_cian additional parameter on the kernel command line.  Edit the boot
160bf215546Sopenharmony_cipartition's ``cmdline.txt`` to add::
161bf215546Sopenharmony_ci
162bf215546Sopenharmony_ci  cma=256M@256M
163bf215546Sopenharmony_ci
164bf215546Sopenharmony_ci``cmdline.txt`` is a single line with whitespace separated parameters.
165bf215546Sopenharmony_ci
166bf215546Sopenharmony_ciThe first value is the size of the pool and the second parameter is
167bf215546Sopenharmony_cithe start address of the pool. The pool size can be increased further,
168bf215546Sopenharmony_cibut it must fit into the memory, so size + start address must be below
169bf215546Sopenharmony_ci1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
170bf215546Sopenharmony_cireduces the memory available to Linux.
171bf215546Sopenharmony_ci
172bf215546Sopenharmony_ci* Decrease firmware memory
173bf215546Sopenharmony_ci
174bf215546Sopenharmony_ciThe firmware allocates a fixed chunk of memory before booting
175bf215546Sopenharmony_ciLinux. If firmware functions are not required, this amount can be
176bf215546Sopenharmony_cireduced.
177bf215546Sopenharmony_ci
178bf215546Sopenharmony_ciIn ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
179bf215546Sopenharmony_ciedit gpu_mem to 64 if you need video decoding.
180bf215546Sopenharmony_ci
181bf215546Sopenharmony_ciPerformance debugging
182bf215546Sopenharmony_ci---------------------
183bf215546Sopenharmony_ci
184bf215546Sopenharmony_ci* Step 1: Known issues
185bf215546Sopenharmony_ci
186bf215546Sopenharmony_ciThe first tool to look at is running your application with the
187bf215546Sopenharmony_cienvironment variable ``VC4_DEBUG=perf`` set. This will report debug
188bf215546Sopenharmony_ciinformation for many known causes of performance problems on the
189bf215546Sopenharmony_ciconsole. Not all of them will cause visible performance improvements
190bf215546Sopenharmony_ciwhen fixed, but it's a good first step to see what might going wrong.
191bf215546Sopenharmony_ci
192bf215546Sopenharmony_ci* Step 2: CPU vs GPU
193bf215546Sopenharmony_ci
194bf215546Sopenharmony_ciThe primary question is figuring out whether the CPU is busy in your
195bf215546Sopenharmony_ciapplication, the CPU is busy in the GL driver, the GPU is waiting for
196bf215546Sopenharmony_cithe CPU, or the CPU is waiting for the GPU. Ideally, you get to the
197bf215546Sopenharmony_cipoint where the CPU is waiting for the GPU infrequently but for a
198bf215546Sopenharmony_cisignificant amount of time (however long it takes the GPU to draw a
199bf215546Sopenharmony_ciframe).
200bf215546Sopenharmony_ci
201bf215546Sopenharmony_ciStart with top while your application is running. Is the CPU usage
202bf215546Sopenharmony_ciaround 90%+? If so, then our performance analysis will be with
203bf215546Sopenharmony_cisysprof. If it's not very high, is the GPU staying busy? We don't have
204bf215546Sopenharmony_cia clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
205bf215546Sopenharmony_ciuseful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
206bf215546Sopenharmony_cimeans that the GPU is currently busy processing some rendering job.
207bf215546Sopenharmony_ci
208bf215546Sopenharmony_ci* sysprof for CPU usage
209bf215546Sopenharmony_ci
210bf215546Sopenharmony_ciIf the CPU is totally busy and the GPU isn't terribly busy, there is
211bf215546Sopenharmony_cian excellent tool for debugging: sysprof. Install, run as root (so you
212bf215546Sopenharmony_cican get system-wide profiling), hit play and later stop. The top-left
213bf215546Sopenharmony_ciarea shows the flat profile sorted by total time of that symbol plus
214bf215546Sopenharmony_ciits descendants. The top few are generally uninteresting (main() and
215bf215546Sopenharmony_ciits descendants consuming a lot), but eventually you can get down to
216bf215546Sopenharmony_cisomething interesting. Click it, and to the right you get the
217bf215546Sopenharmony_cicallchains to descendants -- where all that time actually went. On the
218bf215546Sopenharmony_ciother hand, the lower left shows callers -- double-clicking those
219bf215546Sopenharmony_ciselects that as the symbol to view, instead.
220bf215546Sopenharmony_ci
221bf215546Sopenharmony_ciNote that you need debug symbols for the callgraphs in sysprof to
222bf215546Sopenharmony_ciwork, which is where most of its value is. Most distributions offer
223bf215546Sopenharmony_cidebug symbol packages from their builds which can be installed
224bf215546Sopenharmony_ciseparately, and sysprof will find them. I've found that on arm, the
225bf215546Sopenharmony_cidebug packages are not enough, and if someone could determine what is
226bf215546Sopenharmony_cinecessary for callgraphs in debugging, that would be really helpful.
227bf215546Sopenharmony_ci
228bf215546Sopenharmony_ci* perf for CPU waits on GPU
229bf215546Sopenharmony_ci
230bf215546Sopenharmony_ciIf the CPU is not very busy and the GPU is not very busy, then we're
231bf215546Sopenharmony_ciprobably ping-ponging between the two. Most cases of this would be
232bf215546Sopenharmony_cinoticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
233bf215546Sopenharmony_cithis happens, use the perf tool from the Linux kernel (note: unrelated
234bf215546Sopenharmony_cito ``VC4_DEBUG=perf``)::
235bf215546Sopenharmony_ci
236bf215546Sopenharmony_ci    sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena
237bf215546Sopenharmony_ci
238bf215546Sopenharmony_ciIf you want to see the whole system's stalls for a period of time
239bf215546Sopenharmony_ci(very useful!), use the -a flag instead of a particular command
240bf215546Sopenharmony_ciname. Just ``^C`` when you're done capturing data.
241bf215546Sopenharmony_ci
242bf215546Sopenharmony_ciAt exit, you'll have ``perf.data`` in the current directory. You can print
243bf215546Sopenharmony_ciout the results with::
244bf215546Sopenharmony_ci
245bf215546Sopenharmony_ci    perf report | less
246bf215546Sopenharmony_ci
247bf215546Sopenharmony_ci* Debugging for GPU fully busy
248bf215546Sopenharmony_ci
249bf215546Sopenharmony_ciAs of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
250bf215546Sopenharmony_ciperformance counters in OpenGL. Install apitrace, and trace your
251bf215546Sopenharmony_ciapplication with::
252bf215546Sopenharmony_ci
253bf215546Sopenharmony_ci    apitrace trace <application>          # for GLX applications
254bf215546Sopenharmony_ci    apitrace trace -a egl <application>   # for EGL applications
255bf215546Sopenharmony_ci
256bf215546Sopenharmony_ciOnce you've captured a trace, you can see what counters are available
257bf215546Sopenharmony_ciand replay it while looking while looking at some of those counters::
258bf215546Sopenharmony_ci
259bf215546Sopenharmony_ci    apitrace replay <application>.trace --list-metrics
260bf215546Sopenharmony_ci
261bf215546Sopenharmony_ci    apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading
262bf215546Sopenharmony_ci
263bf215546Sopenharmony_ciMultiple counters can be captured at once with commas separating them.
264bf215546Sopenharmony_ci
265bf215546Sopenharmony_ciOnce you've found what draw calls are surprisingly expensive in one of
266bf215546Sopenharmony_cithe counters, you can work out which ones they were at the GL level by
267bf215546Sopenharmony_ciopening the trace up in qapitrace and using ``^-G`` to jump to that call
268bf215546Sopenharmony_cinumber and ``^-L`` to look up the GL state at that call.
269bf215546Sopenharmony_ci
270bf215546Sopenharmony_cishader-db
271bf215546Sopenharmony_ci---------
272bf215546Sopenharmony_ci
273bf215546Sopenharmony_cishader-db is often used as a proxy for real-world app performance when
274bf215546Sopenharmony_ciworking on the compiler in Mesa.  On vc4, there is a lot of
275bf215546Sopenharmony_cistate-dependent code in the shaders (like blending or vertex attribute
276bf215546Sopenharmony_ciformat handling), so the typical `shader-db
277bf215546Sopenharmony_ci<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
278bf215546Sopenharmony_ciareas for optimization.  Instead, anholt wrote a `new one
279bf215546Sopenharmony_ci<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on
280bf215546Sopenharmony_ciapitraces.  Once you have a collection of traces, starting from
281bf215546Sopenharmony_ci`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__,
282bf215546Sopenharmony_ciyou can test a compiler change in this shader-db with::
283bf215546Sopenharmony_ci
284bf215546Sopenharmony_ci  ./run.py > before
285bf215546Sopenharmony_ci  (cd ../mesa && make install)
286bf215546Sopenharmony_ci  ./run.py > after
287bf215546Sopenharmony_ci  ./report.py before after
288bf215546Sopenharmony_ci
289bf215546Sopenharmony_ciHardware Documentation
290bf215546Sopenharmony_ci----------------------
291bf215546Sopenharmony_ci
292bf215546Sopenharmony_ciFor driver developers, Broadcom publicly released a `specification
293bf215546Sopenharmony_ci<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
294bf215546Sopenharmony_ciis closely related to the vc4 GPU present in the Raspberry Pi.  They
295bf215546Sopenharmony_cialso released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
296bf215546Sopenharmony_ciof a corresponding Android graphics driver.  That graphics driver was
297bf215546Sopenharmony_ciported to Raspbian for a demo, but was not expected to have ongoing
298bf215546Sopenharmony_cidevelopment.
299bf215546Sopenharmony_ci
300bf215546Sopenharmony_ciDevelopers with NDA access with Broadcom or Raspberry Pi can
301bf215546Sopenharmony_cipotentially get access to "simpenrose", the C software simulator of
302bf215546Sopenharmony_cithe GPU.  The Mesa driver includes a backend (`vc4_simulator.c`) to
303bf215546Sopenharmony_ciuse simpenrose from an x86 system with the i915 graphics driver with
304bf215546Sopenharmony_ciall of the vc4 rendering commands emulated on simpenrose and memcpyed
305bf215546Sopenharmony_cito the real GPU.
306