1bf215546Sopenharmony_ciVC4 2bf215546Sopenharmony_ci=== 3bf215546Sopenharmony_ci 4bf215546Sopenharmony_ciMesa's ``vc4`` graphics driver supports multiple implementations of 5bf215546Sopenharmony_ciBroadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 6bf215546Sopenharmony_cithrough Raspberry Pi 3 hardware, and the driver is included as an 7bf215546Sopenharmony_cioption as of the 2016-02-09 Rasbpian release using ``raspi-config``. 8bf215546Sopenharmony_ciOn most other distributions such as Debian or Fedora, you need no 9bf215546Sopenharmony_ciconfiguration to enable the driver. 10bf215546Sopenharmony_ci 11bf215546Sopenharmony_ciThis Mesa driver talks directly to the `vc4 12bf215546Sopenharmony_ci<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM 13bf215546Sopenharmony_cidriver for scheduling graphics commands, and that module also provides 14bf215546Sopenharmony_ciKMS display support. The driver makes no use of the closed source VPU 15bf215546Sopenharmony_cifirmware on the VideoCore IV block, instead talking directly to the 16bf215546Sopenharmony_ciGPU block from Linux. 17bf215546Sopenharmony_ci 18bf215546Sopenharmony_ciGLES2 support 19bf215546Sopenharmony_ci------------- 20bf215546Sopenharmony_ci 21bf215546Sopenharmony_ciThe vc4 driver is a nearly conformant GLES2 driver, and the hardware 22bf215546Sopenharmony_cihas achieved GLES2 conformance with other driver stacks. 23bf215546Sopenharmony_ci 24bf215546Sopenharmony_ciOpenGL support 25bf215546Sopenharmony_ci-------------- 26bf215546Sopenharmony_ci 27bf215546Sopenharmony_ciAlong with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is 28bf215546Sopenharmony_cimostly correct but with a few caveats. 29bf215546Sopenharmony_ci 30bf215546Sopenharmony_ci* 4-byte index buffers. 31bf215546Sopenharmony_ci 32bf215546Sopenharmony_ciGLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support 33bf215546Sopenharmony_cithem in vc4, we create a shadow copy of your index buffer with the 34bf215546Sopenharmony_ciindices truncated to 2 bytes. This is incorrect (and will assertion 35bf215546Sopenharmony_cifail in debug builds of Mesa) if any of the indices were >65535. To 36bf215546Sopenharmony_cifix that, we would need to detect this case and rewrite the index 37bf215546Sopenharmony_cibuffer and vertex buffers to do a series of draws each with small 38bf215546Sopenharmony_ciindices and new vertex attrib bindings. 39bf215546Sopenharmony_ci 40bf215546Sopenharmony_ciTo avoid this problem, ensure that all index buffers are written using 41bf215546Sopenharmony_ci``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls 42bf215546Sopenharmony_ciwith updated vertex attrib bindings. 43bf215546Sopenharmony_ci 44bf215546Sopenharmony_ci* Occlusion queries 45bf215546Sopenharmony_ci 46bf215546Sopenharmony_ciThe VC4 hardware has no support for occlusion queries. GL 2.0 47bf215546Sopenharmony_cirequires that you support the occlusion queries extension, but you can 48bf215546Sopenharmony_cireport 0 from ``glGetQueryiv(GL_SAMPLES_PASSED, 49bf215546Sopenharmony_ciGL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles 50bf215546Sopenharmony_ci"we want the functions to be present everywhere, but we want it to be 51bf215546Sopenharmony_cioptional for hardware to support it. Sadly, gallium doesn't yet allow 52bf215546Sopenharmony_cithe driver to report 0 query bits. 53bf215546Sopenharmony_ci 54bf215546Sopenharmony_ci* Primitive mode 55bf215546Sopenharmony_ci 56bf215546Sopenharmony_ciVC4 doesn't support reducing triangles/quads/polygons to lines and 57bf215546Sopenharmony_cipoints like desktop GL. If front/back mode matched, we could rewrite 58bf215546Sopenharmony_cithe index buffer to the new primitive type, but we don't. If 59bf215546Sopenharmony_cifront/back mode don't match, we would need to run the vertex shader in 60bf215546Sopenharmony_cisoftware, classify the prims, write new index buffers, and emit 61bf215546Sopenharmony_ci(possibly many) new draw calls to rasterize the new prims in the same 62bf215546Sopenharmony_ciorder. 63bf215546Sopenharmony_ci 64bf215546Sopenharmony_ciBug Reporting 65bf215546Sopenharmony_ci------------- 66bf215546Sopenharmony_ci 67bf215546Sopenharmony_ciVC4 rendering bugs should go to Mesa's gitlab `issues 68bf215546Sopenharmony_ci<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page. 69bf215546Sopenharmony_ci 70bf215546Sopenharmony_ciBy far the easiest way to communicate bug reports for rendering 71bf215546Sopenharmony_ciproblems is to take an apitrace. This passes exactly the drawing you 72bf215546Sopenharmony_cisaw to the developer, without the developer needing to download and 73bf215546Sopenharmony_cibuild the application and replicate whatever steps you took to produce 74bf215546Sopenharmony_cithe problem. Traces attached to bug reports should ideally be small. 75bf215546Sopenharmony_ci 76bf215546Sopenharmony_ciFor GPU hangs, if you can get a short apitrace that produces the 77bf215546Sopenharmony_ciproblem, that's still the best. If the problem takes a long time to 78bf215546Sopenharmony_cireproduce or you can't capture it in a trace, describing how to 79bf215546Sopenharmony_cireproduce and including a gpu hang dump would be the most 80bf215546Sopenharmony_ciuseful. Install `vc4-gpu-tools 81bf215546Sopenharmony_ci<https://github.com/anholt/vc4-gpu-tools/>` and use 82bf215546Sopenharmony_ci``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will 83bf215546Sopenharmony_ciprovide useful information. 84bf215546Sopenharmony_ci 85bf215546Sopenharmony_ciTiled Rendering 86bf215546Sopenharmony_ci--------------- 87bf215546Sopenharmony_ci 88bf215546Sopenharmony_ciVC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or 89bf215546Sopenharmony_ci32x32 (MSAA) tiles and rendering the scene per tile. Rasterization 90bf215546Sopenharmony_cilooks like:: 91bf215546Sopenharmony_ci 92bf215546Sopenharmony_ci (CPU) Allocate space to store a list of draw commands per tile 93bf215546Sopenharmony_ci (CPU) Set up a command list per tile that does: 94bf215546Sopenharmony_ci Either load the current tile's color buffer from memory, or clear it. 95bf215546Sopenharmony_ci Either load the current tile's depth buffer from memory, or clear it. 96bf215546Sopenharmony_ci Branch into the draw list for the tile 97bf215546Sopenharmony_ci Store the depth buffer if anybody might read it. 98bf215546Sopenharmony_ci Store the color buffer if anybody might read it. 99bf215546Sopenharmony_ci (GPU) Initialize the per-tile draw call lists to empty. 100bf215546Sopenharmony_ci (GPU) Run all draw calls collecting vertex data 101bf215546Sopenharmony_ci (GPU) For each tile covered by a draw call's primitive. 102bf215546Sopenharmony_ci Emit state packets to the list to update it to the current draw call's state. 103bf215546Sopenharmony_ci Emit a primitive description into the tile's draw call list. 104bf215546Sopenharmony_ci 105bf215546Sopenharmony_ciTiled rendering avoids the need for large render target caches, at the 106bf215546Sopenharmony_ciexpense of increasing the cost of vertex processing. Unlike some tiled 107bf215546Sopenharmony_cirenderers, VC4 has no non-tiled rendering mode. 108bf215546Sopenharmony_ci 109bf215546Sopenharmony_ciPerformance Tricks 110bf215546Sopenharmony_ci------------------ 111bf215546Sopenharmony_ci 112bf215546Sopenharmony_ci* Reducing memory bandwidth by clearing. 113bf215546Sopenharmony_ci 114bf215546Sopenharmony_ciEven if your drawing is going to cover the entire render target, it's 115bf215546Sopenharmony_cimore efficient for VC4 if you emit a ``glClear()`` of the color and 116bf215546Sopenharmony_cidepth buffers. This means we can skip the load of the previous state 117bf215546Sopenharmony_cifrom memory, in favor of a cheap GPU-side ``memset()`` of the tile 118bf215546Sopenharmony_cibuffer before we start running the draw calls. 119bf215546Sopenharmony_ci 120bf215546Sopenharmony_ci* Reducing memory bandwidth with scissoring. 121bf215546Sopenharmony_ci 122bf215546Sopenharmony_ciIf all draw calls for the frame are with a ``glScissor()`` to only 123bf215546Sopenharmony_cipart of the screen, then we can skip setting up the tiles for that 124bf215546Sopenharmony_ciarea, which means a little less memory used setting up the empty bins, 125bf215546Sopenharmony_ciand a lot less memory used loading/storing the unchanged tiles. 126bf215546Sopenharmony_ci 127bf215546Sopenharmony_ci* Reducing memory bandwidth with ``glInvalidateFramebuffer()``. 128bf215546Sopenharmony_ci 129bf215546Sopenharmony_ciIf we don't know who might use the contents of the framebuffer's depth 130bf215546Sopenharmony_cior color in the future, then we have to store it for later. If you use 131bf215546Sopenharmony_ciglInvalidateFramebuffer() before accessing the results of your 132bf215546Sopenharmony_cirendering, then we can skip the store of the depth or color 133bf215546Sopenharmony_cibuffer. Note that this is unimplemented. 134bf215546Sopenharmony_ci 135bf215546Sopenharmony_ci* Avoid non-constant GLSL array indexing 136bf215546Sopenharmony_ci 137bf215546Sopenharmony_ciIn VC4 the only non-constant-index array access supported in hardware 138bf215546Sopenharmony_ciis uniforms. For everything else (inputs, outputs, temporaries), we 139bf215546Sopenharmony_cihave to lower them to an IF ladder like:: 140bf215546Sopenharmony_ci 141bf215546Sopenharmony_ci if (index == 0) 142bf215546Sopenharmony_ci return array[0] 143bf215546Sopenharmony_ci else if (index == 1) 144bf215546Sopenharmony_ci return array[1] 145bf215546Sopenharmony_ci ... 146bf215546Sopenharmony_ci 147bf215546Sopenharmony_ciThis is very expensive as we probably have to execute every branch of 148bf215546Sopenharmony_cievery IF statement due to it being a SIMD machine. So, it is 149bf215546Sopenharmony_cirecommended (if you can) to avoid non-uniform non-constant array 150bf215546Sopenharmony_ciindexing. 151bf215546Sopenharmony_ci 152bf215546Sopenharmony_ciNote that if you do variable indexing within a bounded loop that Mesa 153bf215546Sopenharmony_cican unroll, that can actually count as constant indexing. 154bf215546Sopenharmony_ci 155bf215546Sopenharmony_ci* Increasing GPU memory Increase CMA pool size 156bf215546Sopenharmony_ci 157bf215546Sopenharmony_ciThe memory for the VC4 driver is allocated from the standard Linux cma 158bf215546Sopenharmony_cipool. The size of this pool defaults to 64 MB. To increase this, pass 159bf215546Sopenharmony_cian additional parameter on the kernel command line. Edit the boot 160bf215546Sopenharmony_cipartition's ``cmdline.txt`` to add:: 161bf215546Sopenharmony_ci 162bf215546Sopenharmony_ci cma=256M@256M 163bf215546Sopenharmony_ci 164bf215546Sopenharmony_ci``cmdline.txt`` is a single line with whitespace separated parameters. 165bf215546Sopenharmony_ci 166bf215546Sopenharmony_ciThe first value is the size of the pool and the second parameter is 167bf215546Sopenharmony_cithe start address of the pool. The pool size can be increased further, 168bf215546Sopenharmony_cibut it must fit into the memory, so size + start address must be below 169bf215546Sopenharmony_ci1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this 170bf215546Sopenharmony_cireduces the memory available to Linux. 171bf215546Sopenharmony_ci 172bf215546Sopenharmony_ci* Decrease firmware memory 173bf215546Sopenharmony_ci 174bf215546Sopenharmony_ciThe firmware allocates a fixed chunk of memory before booting 175bf215546Sopenharmony_ciLinux. If firmware functions are not required, this amount can be 176bf215546Sopenharmony_cireduced. 177bf215546Sopenharmony_ci 178bf215546Sopenharmony_ciIn ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding, 179bf215546Sopenharmony_ciedit gpu_mem to 64 if you need video decoding. 180bf215546Sopenharmony_ci 181bf215546Sopenharmony_ciPerformance debugging 182bf215546Sopenharmony_ci--------------------- 183bf215546Sopenharmony_ci 184bf215546Sopenharmony_ci* Step 1: Known issues 185bf215546Sopenharmony_ci 186bf215546Sopenharmony_ciThe first tool to look at is running your application with the 187bf215546Sopenharmony_cienvironment variable ``VC4_DEBUG=perf`` set. This will report debug 188bf215546Sopenharmony_ciinformation for many known causes of performance problems on the 189bf215546Sopenharmony_ciconsole. Not all of them will cause visible performance improvements 190bf215546Sopenharmony_ciwhen fixed, but it's a good first step to see what might going wrong. 191bf215546Sopenharmony_ci 192bf215546Sopenharmony_ci* Step 2: CPU vs GPU 193bf215546Sopenharmony_ci 194bf215546Sopenharmony_ciThe primary question is figuring out whether the CPU is busy in your 195bf215546Sopenharmony_ciapplication, the CPU is busy in the GL driver, the GPU is waiting for 196bf215546Sopenharmony_cithe CPU, or the CPU is waiting for the GPU. Ideally, you get to the 197bf215546Sopenharmony_cipoint where the CPU is waiting for the GPU infrequently but for a 198bf215546Sopenharmony_cisignificant amount of time (however long it takes the GPU to draw a 199bf215546Sopenharmony_ciframe). 200bf215546Sopenharmony_ci 201bf215546Sopenharmony_ciStart with top while your application is running. Is the CPU usage 202bf215546Sopenharmony_ciaround 90%+? If so, then our performance analysis will be with 203bf215546Sopenharmony_cisysprof. If it's not very high, is the GPU staying busy? We don't have 204bf215546Sopenharmony_cia clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be 205bf215546Sopenharmony_ciuseful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that 206bf215546Sopenharmony_cimeans that the GPU is currently busy processing some rendering job. 207bf215546Sopenharmony_ci 208bf215546Sopenharmony_ci* sysprof for CPU usage 209bf215546Sopenharmony_ci 210bf215546Sopenharmony_ciIf the CPU is totally busy and the GPU isn't terribly busy, there is 211bf215546Sopenharmony_cian excellent tool for debugging: sysprof. Install, run as root (so you 212bf215546Sopenharmony_cican get system-wide profiling), hit play and later stop. The top-left 213bf215546Sopenharmony_ciarea shows the flat profile sorted by total time of that symbol plus 214bf215546Sopenharmony_ciits descendants. The top few are generally uninteresting (main() and 215bf215546Sopenharmony_ciits descendants consuming a lot), but eventually you can get down to 216bf215546Sopenharmony_cisomething interesting. Click it, and to the right you get the 217bf215546Sopenharmony_cicallchains to descendants -- where all that time actually went. On the 218bf215546Sopenharmony_ciother hand, the lower left shows callers -- double-clicking those 219bf215546Sopenharmony_ciselects that as the symbol to view, instead. 220bf215546Sopenharmony_ci 221bf215546Sopenharmony_ciNote that you need debug symbols for the callgraphs in sysprof to 222bf215546Sopenharmony_ciwork, which is where most of its value is. Most distributions offer 223bf215546Sopenharmony_cidebug symbol packages from their builds which can be installed 224bf215546Sopenharmony_ciseparately, and sysprof will find them. I've found that on arm, the 225bf215546Sopenharmony_cidebug packages are not enough, and if someone could determine what is 226bf215546Sopenharmony_cinecessary for callgraphs in debugging, that would be really helpful. 227bf215546Sopenharmony_ci 228bf215546Sopenharmony_ci* perf for CPU waits on GPU 229bf215546Sopenharmony_ci 230bf215546Sopenharmony_ciIf the CPU is not very busy and the GPU is not very busy, then we're 231bf215546Sopenharmony_ciprobably ping-ponging between the two. Most cases of this would be 232bf215546Sopenharmony_cinoticed by ``VC4_DEBUG=perf``, but not all. To see all cases where 233bf215546Sopenharmony_cithis happens, use the perf tool from the Linux kernel (note: unrelated 234bf215546Sopenharmony_cito ``VC4_DEBUG=perf``):: 235bf215546Sopenharmony_ci 236bf215546Sopenharmony_ci sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena 237bf215546Sopenharmony_ci 238bf215546Sopenharmony_ciIf you want to see the whole system's stalls for a period of time 239bf215546Sopenharmony_ci(very useful!), use the -a flag instead of a particular command 240bf215546Sopenharmony_ciname. Just ``^C`` when you're done capturing data. 241bf215546Sopenharmony_ci 242bf215546Sopenharmony_ciAt exit, you'll have ``perf.data`` in the current directory. You can print 243bf215546Sopenharmony_ciout the results with:: 244bf215546Sopenharmony_ci 245bf215546Sopenharmony_ci perf report | less 246bf215546Sopenharmony_ci 247bf215546Sopenharmony_ci* Debugging for GPU fully busy 248bf215546Sopenharmony_ci 249bf215546Sopenharmony_ciAs of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's 250bf215546Sopenharmony_ciperformance counters in OpenGL. Install apitrace, and trace your 251bf215546Sopenharmony_ciapplication with:: 252bf215546Sopenharmony_ci 253bf215546Sopenharmony_ci apitrace trace <application> # for GLX applications 254bf215546Sopenharmony_ci apitrace trace -a egl <application> # for EGL applications 255bf215546Sopenharmony_ci 256bf215546Sopenharmony_ciOnce you've captured a trace, you can see what counters are available 257bf215546Sopenharmony_ciand replay it while looking while looking at some of those counters:: 258bf215546Sopenharmony_ci 259bf215546Sopenharmony_ci apitrace replay <application>.trace --list-metrics 260bf215546Sopenharmony_ci 261bf215546Sopenharmony_ci apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading 262bf215546Sopenharmony_ci 263bf215546Sopenharmony_ciMultiple counters can be captured at once with commas separating them. 264bf215546Sopenharmony_ci 265bf215546Sopenharmony_ciOnce you've found what draw calls are surprisingly expensive in one of 266bf215546Sopenharmony_cithe counters, you can work out which ones they were at the GL level by 267bf215546Sopenharmony_ciopening the trace up in qapitrace and using ``^-G`` to jump to that call 268bf215546Sopenharmony_cinumber and ``^-L`` to look up the GL state at that call. 269bf215546Sopenharmony_ci 270bf215546Sopenharmony_cishader-db 271bf215546Sopenharmony_ci--------- 272bf215546Sopenharmony_ci 273bf215546Sopenharmony_cishader-db is often used as a proxy for real-world app performance when 274bf215546Sopenharmony_ciworking on the compiler in Mesa. On vc4, there is a lot of 275bf215546Sopenharmony_cistate-dependent code in the shaders (like blending or vertex attribute 276bf215546Sopenharmony_ciformat handling), so the typical `shader-db 277bf215546Sopenharmony_ci<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important 278bf215546Sopenharmony_ciareas for optimization. Instead, anholt wrote a `new one 279bf215546Sopenharmony_ci<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on 280bf215546Sopenharmony_ciapitraces. Once you have a collection of traces, starting from 281bf215546Sopenharmony_ci`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__, 282bf215546Sopenharmony_ciyou can test a compiler change in this shader-db with:: 283bf215546Sopenharmony_ci 284bf215546Sopenharmony_ci ./run.py > before 285bf215546Sopenharmony_ci (cd ../mesa && make install) 286bf215546Sopenharmony_ci ./run.py > after 287bf215546Sopenharmony_ci ./report.py before after 288bf215546Sopenharmony_ci 289bf215546Sopenharmony_ciHardware Documentation 290bf215546Sopenharmony_ci---------------------- 291bf215546Sopenharmony_ci 292bf215546Sopenharmony_ciFor driver developers, Broadcom publicly released a `specification 293bf215546Sopenharmony_ci<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which 294bf215546Sopenharmony_ciis closely related to the vc4 GPU present in the Raspberry Pi. They 295bf215546Sopenharmony_cialso released a `snapshot <https://docs.broadcom.com/docs/12358546>`__ 296bf215546Sopenharmony_ciof a corresponding Android graphics driver. That graphics driver was 297bf215546Sopenharmony_ciported to Raspbian for a demo, but was not expected to have ongoing 298bf215546Sopenharmony_cidevelopment. 299bf215546Sopenharmony_ci 300bf215546Sopenharmony_ciDevelopers with NDA access with Broadcom or Raspberry Pi can 301bf215546Sopenharmony_cipotentially get access to "simpenrose", the C software simulator of 302bf215546Sopenharmony_cithe GPU. The Mesa driver includes a backend (`vc4_simulator.c`) to 303bf215546Sopenharmony_ciuse simpenrose from an x86 system with the i915 graphics driver with 304bf215546Sopenharmony_ciall of the vc4 rendering commands emulated on simpenrose and memcpyed 305bf215546Sopenharmony_cito the real GPU. 306