Open, per-instruction SQTT instruction-stitch for Linux/RADV .rgp captures: an
RGP-equivalent "Instruction Timing" view without the closed Radeon GPU Profiler GUI, focused on
graphics (PS/GS/VS) shaders on gfx11/RDNA3.
"RGP" / "Radeon GPU Profiler" are trademarks of AMD. This is an independent, open-source tool built on AMD's open
rocprof-trace-decoder; it is not affiliated with or endorsed by AMD.
The SQTT decode engine is AMD's open
rocprof-trace-decoder.
rgp-cli does not reimplement it. rgp-cli is:
- A patch (
patches/graphics-stitch.patch) that makes that decoder stitch gfx11 graphics frames. The stock decoder is tuned for compute and derails on real graphics workloads. - A validation harness (
src/oracle_isa.c) that feeds byte-exactamdgpu-disdisassembly to the decoder and joins every traced instruction back to its real ISA line (reproducing RGP's instruction view).UNRES=1classifies unresolved tokens;PERFDUMP=1prints a frame-timing breakdown. - A capture pipeline (
tools/) that turns a.rgp(B00P/RADV, or AMD_RDF/Windows via the optionalrdf_spikereader) into the raw SQTT streams + an absolute-address ISA map.
The patch is intended to go upstream to ROCm so every consumer benefits.
taowen/rgp-analyzer-cli wraps the stock decoder
for compute tuning and reports a stitch-confidence number; it has no exec-mask / graphics
support. rgp-cli is complementary: it fixes the decoder for graphics.
Verified on gfx11 (RX 7800 XT, RADV 26.1.1):
| Capture | Stock decoder | rgp-cli (patched) |
|---|---|---|
| vkcube / vkgears (gfx11 demos) | 100% | 100% |
| gfx12 nBody (compute) | n/a | 99.9% |
| Real-game gfx11 frame | 40.6% | 99.95% (7,833,986 / 7,837,716 instructions) |
The headline fixes:
s_waitcnt_depctr→ IMMED: gfx11 emits a timed token for it, where the stock gfx12 analogy marked it SKIP and orphaned the token.- GS/PS shader-base disambiguation: PS/GS/HS bases share one slot (last-write-wins), so GS waves inherited the PS entry and derailed; the stitcher now disambiguates per wave by token-category fit.
- Matcher robustness for sparse graphics waves: don't derail on exec-mask control flow, skip tokens that carry no instruction, and recover instructions a loop re-executed via a backward scan.
The residual gap is category-matcher imprecision under sparse SQTT anchors; closing it fully would need an exact per-instruction sequencer rather than more heuristics.
src/oracle_isa.c stitch validation harness (the "oracle")
tools/build_capture.py .rgp -> se*_raw.bin + co_*.elf + isa_map.tsv (orchestrator)
tools/build_codeobjects.py code-object extraction (B00P + AMD_RDF)
tools/build_isa_map.py amdgpu-dis disassembly -> absolute-address ISA map
patches/graphics-stitch.patch the decoder fixes (apply onto the pinned commit)
decoder/setup.sh clone ROCm decoder @pinned commit, apply patch, build the .so
- A C toolchain +
cmake+ninja, and an LLVM with the AMDGPU backend (LLVM_DIR, default points at gentoollvm-22). amdgpu-dis(ships with the Radeon Developer Tool Suite / ROCm); setAMDGPU_DIS=/path/to/amdgpu-dis.python3.
make decoder # one-time: clone + patch + build the decoder .so
make oracle # build bin/oracle_isa
make run CAPTURE=path/to/frame.rgp # build capture data (into /dev/shm/rgpcli) + stitch
# from the OUT dir, with ROCPROF_SO pointing at the patched .so:
UNRES=1 bin/oracle_isa se0_raw.bin # classify unresolved tokens
PERFDUMP=1 bin/oracle_isa se0_raw.bin # per-frame timing / stall breakdownMIT (LICENSE). The decoder patch modifies MIT-licensed ROCm code and is intended for upstream
contribution.