This project studies and optimizes Transformer MatMul kernels (starting with BERT) using Apache TVM (TIR + MetaSchedule).
The primary goals are to:
- Extract real Transformer MatMul workloads
- Construct canonical TIR kernels
- Systematically evaluate manual scheduling strategies
- Compare against automated schedule search (MetaSchedule)
- Derive a rule-based schedule from empirical evidence that is deterministic, zero-cost, and interpretable
- Produce reproducible, quantitative performance results
The project emphasizes correctness, controlled experimentation, and explainable performance gains.
This section documents the complete reasoning chain — from raw benchmark observations through MetaSchedule trace analysis to the final rule-based schedule design. Every rule is traceable to quantitative evidence collected on the target hardware.
This research is validated across two distinct CPU environments (both exposing 12 processing threads):
| Property | Environment A (Mobile/Native) | Environment B (Desktop/VM) |
|---|---|---|
| CPU | Intel Core i5-1235U (Alder Lake, 12th Gen) | Intel Core i7-13700 (Raptor Lake, 13th Gen) |
| Core topology | 2 Performance cores (HT) + 8 Efficiency = 12 threads | VirtualBox VM: 12 vCPUs (1 thread/core allocated) |
| RAM | Native capacity | 16 GB (VM allocation) |
| L1-D cache | 48 KB (P-core), 32 KB (E-core) | 80 KB per P-core (32 KB I + 48 KB D); 96 KB per E-core (64 KB I + 32 KB D) — host Raptor Lake values exposed to VM |
| L2 cache | 1.25 MB (P-core), 2 MB (shared E-core cluster) | 1.25–2 MB per P-core; 2–4 MB per E-core cluster (varies by SKU) |
| SIMD | AVX2 — 256-bit registers, 8 × float32 per instruction | AVX2 — 256-bit registers, 8 × float32 per instruction |
| OS / Compiler | Linux, LLVM backend via TVM | Linux (VirtualBox VM), LLVM backend via TVM |
The three MatMul kernels studied correspond to BERT-base Transformer layers:
| Kernel | Shape (M × K × N) | K | N | Role |
|---|---|---|---|---|
| QKV | M × 768 × 768 | 768 | 768 | Query / Key / Value projection |
| MLP-expand | M × 768 × 3072 | 768 | 3072 | Feed-forward expansion |
| MLP-reduce | M × 3072 × 768 | 3072 | 768 | Feed-forward compression |
M (sequence length / batch rows) is swept across [16, 32, 64, 96, 128, 192, 256, 384] to cover realistic inference batch sizes.
The rule-based schedule is not based on theoretical models or GPU-oriented heuristics. Every rule is derived from quantitative benchmarks collected on the target CPU across four evidence sources:
-
Single-transform manual schedules (
baseline,k4,k8,k16,k32,k64,parallel,vec_j,parallel_k16,parallel_vec_j,vec_j_k16,full) — isolate the gain from each optimisation and reveal cross-transform interactions. -
MetaSchedule auto-tuning (256 trials × 3 iterations per shape) — establishes a performance ceiling and exposes optimal tile-size ranges.
-
Cache working-set analysis — validates that chosen tiles fit in the smallest L1-D on the chip (32 KB, E-core).
-
MetaSchedule trace analysis — parsing the best tuning records from MetaSchedule's JSON logs revealed three structural transforms (
cache_write,decompose_reduction,pragma_auto_unroll_max_step) universally present in top-performing schedules. This was the key insight that closed the majority of the performance gap.
Each manual schedule variant isolates one or two TIR schedule
transforms applied to the canonical matmul_tir(M, K, N) kernel:
| Variant | Transforms applied |
|---|---|
baseline |
None — triple-nested loop as written |
k4 / k8 / k16 / k32 / k64 |
split(k, TK) + reorder(i, j, k0, k1) + vectorize(j) |
parallel |
parallel(i) + vectorize(j) |
vec_j |
vectorize(j) |
vec_k |
split(k, 8) + vectorize(k1) (expected failure — reduction-axis vectorisation is illegal) |
parallel_k16 |
split(k, 16) + reorder(i, j, k0, k1) + parallel(i) + unroll(k1) |
parallel_vec_j |
split(j, 8) + reorder(i, j0, j1, k) + parallel(i) + vectorize(j1) |
vec_j_k16 |
split(j, 8) + split(k, 16) + reorder(i, j0, k0, j1, k1) + vectorize(j1) + unroll(k1) |
full |
split(j, 8) + split(k, 16) + reorder(i, j0, k0, j1, k1) + parallel(i) + vectorize(j1) + unroll(k1) |
Each variant is benchmarked across all 3 kernels × 8 M values = 24 shapes.
Latency is measured as the median of 50 runs after 5 warm-up executions.
All results are stored in research/results/bert_matmul_results.json.
On modern hybrid mobile architectures (like the target Alder Lake CPU with Performance and Efficiency cores), aggressive power limits (PL1/PL2 thresholds) and OS-level thread scheduling can introduce significant thermal noise during back-to-back kernel execution. Because deterministic matrix multiplication algorithms cannot mathematically run faster than their instruction ceiling, any execution variance is exclusively skewed upward by external system factors.
In our workflow we observed that host environments such as WSL (and other native multi-scheduler setups) can exhibit occasional P↔E core migrations, frequency throttling, and scheduler-driven jitter that increase measurement variance. To obtain more consistent, reproducible microbenchmark results we therefore run the main experiments inside an isolated VirtualBox VM on the i7-13700 host. The VM provides a stable 12‑vCPU allocation and a controlled execution context (16 GB RAM) that reduces host-level thread migration and thermal interference, while leveraging the stronger host hardware. When publishing results we annotate the profile (for example, i7-13700) and the VM allocation so readers can reproduce the environment.
As such, observing the standard deviation alongside the absolute minimum execution time becomes a critical indicator of true hardware-level algorithmic efficiency. A high standard deviation signifies that the CPU encountered frequency throttling or thread migration to slower cores during a test batch. Recognizing this variance is essential when evaluating micro-optimizations, as relying purely on a mean or median metric can obfuscate genuine architectural gains under temporary thermal strain.
Continuous Execution and Thermal Throttling: During extensive continuous execution (such as running many iterations or sweeping all variants back-to-back), modern CPUs quickly exhaust their turbo boost time bounds (e.g., PL2 state) and step down to lower sustained power limits (PL1). This thermal and power throttling can inflate latencies by up to 2× starting from the 2nd or 3rd consecutive iteration. To ensure baseline consistency and measure true algorithmic limits rather than the cooling capacity of the host system, benchmark runners must incorporate artificial cooldown periods (e.g., 3-second sleep) between heavy test batches, allowing the CPU to shed heat and reset boost timers.
Intel Hybrid Architecture Thread Allocation and Core Affinity Limitations:
A critical challenge when working sequentially or in parallel over Intel hybrid processor designs (like Alder Lake 2P+8E configurations) or their hypervisors is thread scheduling. By default, standard C++ runtimes and TVM's execution backend frequently halve the estimated hardware thread count (e.g. ignoring hyperthreads or defaulting to base performance core estimations, dividing a 12-thread capacity down to 6). Furthermore, the thread pool limits task placements to kBig or kLittle core affinities without comprehensively mapping independent computation blocks globally across both types. To fully saturate all computing resources dynamically across available thread contexts, we explicitly uncap the hardcoded hardware concurrency restrictions in the C++ threading backend, enabling custom kSpecifyThreadShareAllCore logic. Alongside this, overriding TVM’s environment parameters (TVM_NUM_THREADS=12, TVM_BIND_THREADS=0) guarantees that our manually tuned and parallelized TIR schedules can successfully distribute dense execution blocks over all logic processing threads, yielding immediate latency improvements.
| Kernel (K) | k16 / k32 | k16 / k64 |
|---|---|---|
| QKV (768) | 0.39–0.58× | 0.58–0.72× |
| MLP-expand (768) | 0.44–0.53× | 0.47–0.56× |
| MLP-reduce (3072) | 0.43–0.49× | 0.42–0.51× |
Ratios < 1 mean k16 is faster.
When testing manual, single-transform schedules in isolation, k16 empirically emerged as the fastest split. It consistently outperformed k32 by 1.7–2.6× and k64 by 1.4–2.4× across all three kernels — including MLP-reduce where K = 3072. The raw reason is cache locality: with TK = 16, the B-strip loaded per reduction step is naturally smaller, keeping it strictly in L1 and reducing cache misses.
While TK=16 is optimal as an isolated single transform, combining it with a full compilation stack (tiling M and N, spatial vectorization _VEC_WIDTH=8, decompose_reduction, and local block caching via cache_write) creates a paradox. A separate Ablation Study (research/analysis/analysis_tk.py) executed straight inside the full rule-based pipeline exposes a different truth:
Geometric Mean (Normalized Speedup vs Baseline TK=8) |
|---|
TK=4 : 0.927x |
TK=8 : 1.000x |
TK=16: 0.862x |
TK=32: 0.759x |
TK=64: 0.638x |
Why does a full pipeline demand TK=8 when manual sweeps love TK=16?
-
L1 Cache and Register Spills: When
cache_writeblock constraints are coupled withdecompose_reductionloops,TK=16generates an inner accumulation workload that frequently spills CPU registers. -
SIMD Alignment: Using
TK=8identically matches the AVX2 spatial vector register footprint (VEC_WIDTH=8). The compiler can execute tightly coupled FMA (Fused Multiply-Add) operations without splitting vectors awkwardly over reduction boundaries. -
Footprint Scaling: A
TK=8tile keeps the localized B-strip footprint at$8 \times 64 \times 4 = 2,048$ bytes — just ~6% of the tiny Alder Lake E-core 32KB L1 cache.TK=16doubles this pressure during decomposed unrolling, suffocating the parallel local C-accumulation block.
→ Rule R1: TK = 8 universally coupled with cache_write bounds, safely balancing cache pressure with maximal vector lane usage.
| Kernel | parallel / baseline |
|---|---|
| QKV | 6–8× |
| MLP-expand | 9–10× |
| MLP-reduce | 21–28× |
parallel(i) alone delivers the largest single-transform speed-up.
MLP-reduce benefits most because K = 3072 makes the baseline loop
extremely slow and parallelism eliminates the primary bottleneck.
On the tested 12-thread layouts (both the i5-1235U with 2P+8E cores, and the 12-vCPU i7-13700 VM allocation), the parallel outer loop distributes M rows across available threads. Even modest M values (M = 16) provide enough iterations for reasonable utilisation.
→ Rule R2: Always parallelise the outer loop.
| Kernel | vec_j / baseline | parallel+vec_j / baseline |
|---|---|---|
| QKV | 1.5–1.8× | 10–13× |
| MLP-expand | 2.8–3.8× | 19–25× |
| MLP-reduce | 5.4–6.1× | 14–18× |
Combining parallel + vec_j yields speed-ups close to the
product of their individual gains — the transforms are nearly
orthogonal. Pure vectorisation alone is moderate (1.5–6×), but when
paired with parallelism the inner SIMD utilisation of each thread
multiplies throughput.
AVX2 processes 8 × float32 = 256 bits per SIMD instruction. The innermost column loop (j) is split so its innermost lane has exactly 8 elements, matching the hardware vector width.
At this stage, we introduced j_pack to solve a different problem than
SIMD lane width. VEC_WIDTH=8 is fixed by hardware; j_pack decides how
many vector-width chunks are grouped into one inner j micro-kernel tile.
Goal of j_pack:
- increase useful work per inner iteration (higher ILP, less loop-control overhead),
- keep contiguous vector-friendly memory access,
- stop before register pressure and write-back overhead dominate.
So j_pack is a software blocking knob (j_pack = VEC_WIDTH * pack_mult),
while VEC_WIDTH remains a hardware constant. Trend 8 later selects the
best fixed pack_mult empirically.
→ Rule R3: Vectorise the innermost j-lane at AVX2 width (8 × float32).
parallel_k16 is slower than parallel alone for QKV and MLP-reduce:
splitting the reduction axis and reordering without the j-axis column
split worsens memory access patterns. The k-split reorders the loop
nest so that adjacent memory accesses on the j (column) dimension are
no longer contiguous, breaking spatial locality.
The k-split becomes beneficial only when combined with a j-split +
vectorise (as in full), where the j-tiling restores column
locality within each tile.
→ Rule R4: Never apply k-tiling without j-tiling and vectorisation.
During rule derivation we deliberately placed decompose_reduction after
most structural transforms (tiling, reorder, cache_write, fuse,
parallel, vectorize, and unroll pragmas) rather than immediately
after tiling/reorder. This location is not arbitrary — it is required
by TVM's TensorIR scheduling invariants and verified by targeted A/B
experiments.
What we considered and why we did not move it earlier
- Idea: run
sch.decompose_reductionimmediately after tiling andsch.reorderso later passes (cache write, reverse_compute_at, vectorize/unroll) operate on the cleanC_init/C_updatesplit. This seems reasonable because the canonical matmul kernel includes an explicitT.init()path and decomposing early appears to simplify the block structure. - Why this looked attractive: an early decomposition isolates the
update-only block that the
cache_writecould target more precisely, potentially producing cleaner local buffers and simpler vector/unroll decisions for the update path.
What we tested (A/B) and concrete failures observed
- We created minimal TIR reproducer scripts and ran two variants:
decompose_reductionright afterreorderand beforecache_write.decompose_reductionaftercache_writebut beforefuse/parallel/vectorize.
- Results:
-
Position 1 (early, before
cache_write) crashes at thecache_writeprimitive with the diagnostic:"ScheduleError: The buffer C is expected to be written by single block, but got 2 blocks who write it."
Explanation:
decompose_reductionsplits the singleCwriter intoC_initandC_update, socache_writecan no longer assume a unique writer for the output buffer — an invariantcache_writerequires. -
Position 2 (after
cache_write, beforeparallel) crashes at theparallelprimitive with the diagnostic:"ScheduleError: The queried subtree root ... does not have compact dataflow, because its child block ... is neither a local complete block nor a local reduction block."
Explanation: TVM requires the loop subtree targeted by
parallel(and similar structural transforms) to have "compact dataflow": the block under that subtree must be a properly formed local reduction block (i.e., it must still contain theinitstatement) or a local complete block. Earlydecompose_reductionremovesT.init()from the update block and breaks these invariants;parallelrefuses the transformation.
-
Why we place decompose_reduction late (current design)
- Keeping the accumulation (
T.init()+ update) unified through the tiling/reorder/cache-write/fuse/parallel/vectorize/unroll stages preserves TVM's block invariants. This allowscache_writeto see a single authoritative writer and allowsparallel/vectorizeto detect a legitimate reduction block and apply structural transformations safely. - When
decompose_reductionis executed after these passes, TVM automatically duplicates the applied loop structure and pragmas onto the newly createdC_initblock. This preserves the intended vectorisation, unrolling, and pragma annotations for both init and update paths, producing correct and stable codegen.
Practical takeaways
- The earlier intuition is correct for many compiler frameworks but
not for TVM's current TIR schedule semantics:
decompose_reductioncannot be freely moved ahead ofcache_writeorparallelwithout violating internal invariants. - Therefore the rule-based schedule intentionally defers
decompose_reductionuntil after the structural passes; this is the only placement that both (a) preserves TVM's correctness checks and (b) retains the micro-kernel semantics we want (vectorised, unrolled update + matching init path).
Files used for verification
/tmp/test_decompose_positions.py— minimal TIR reproducer used to verify both positions and capture the error traces cited above.research/workloads/common/rule_based_schedule.py— the rule-based schedule that appliesdecompose_reductionafter loop-level structural transforms (current canonical ordering).
If you want, I can add the minimal reproducer script into the repo's
research/ folder and link the exact terminal outputs into
research/results/ for reproducibility.
| Kernel | full / baseline | rule_based / baseline | Gain |
|---|---|---|---|
| QKV | 10–12× | 21–30× | ~2.3× |
| MLP-expand | 27–34× | 52–73× | ~2.1× |
| MLP-reduce | 16–18× | 28–33× | ~1.9× |
The full manual schedule only parallelises the raw i loop. For
small M (e.g. M = 16), this yields only 16 parallel tasks — under-
subscribing a 12-thread topology and leaving load imbalance (e.g.,
between P- and E-cores on native hybrid silicon).
The rule-based schedule tiles both i and j, then fuses the
outer tile loops before calling parallel. This generates:
| M | TM | N | TN | Parallel tasks |
|---|---|---|---|---|
| 16 | 16 | 768 | 64 | 1 × 12 = 12 |
| 32 | 32 | 768 | 64 | 1 × 12 = 12 |
| 64 | 64 | 3072 | 64 | 1 × 48 = 48 |
| 128 | 64 | 768 | 64 | 2 × 12 = 24 |
| 384 | 64 | 3072 | 64 | 6 × 48 = 288 |
Even at M = 16, the fused loop provides exactly 12 tasks — one per thread — which is sufficient for the 12-thread topology. For larger M, oversubscription further improves load balancing.
→ Rule R5: Tile i and j, fuse outer tiles, then parallelise.
The initial v1 rule-based schedule (with 2-level tiling + parallel + vectorise + unroll) was 1.5–2.4× slower than MetaSchedule on average:
| Kernel | v1 rule_based / metaschedule |
|---|---|
| QKV | 1.46× |
| MLP-expand | 2.35× |
| MLP-reduce | 1.57× |
To understand why, we parsed MetaSchedule's tuning records
(database_tuning_record.json files in research/results/metaschedule/).
Each record contains the full schedule trace: a list of TIR schedule
instructions and the decisions (tile factors, annotation values) that
produced the best latency.
Key structural findings from trace analysis:
-
Every top-performing trace uses
cache_write. MetaSchedule'sCacheWriteinstruction creates a local buffer for the C output tile. Instead of accumulating partial sums directly in the global C matrix (causing repeated stores to a large, potentially L2/L3-resident array), the local buffer fits in registers or L1. A single write-back occurs after all reduction iterations complete. -
Every trace uses
DecomposeReduction. This separates the zero-initialisation of the C tile from the accumulation (multiply-add) loop. Without decomposition, the init is fused into the reduction loop body, requiring a conditional branch on every iteration to check whether this is the first k-step. -
Every trace annotates with
pragma_auto_unroll_max_step. MetaSchedule picks from {0, 16, 64, 512} per shape. This pragma tells the LLVM backend to automatically unroll small inner loops (e.g. thej_inner_outerloop with TN/VEC = 8 iterations). -
4-level spatial tiling (SSRSRS pattern). MetaSchedule splits each spatial axis into 4 factors and interleaves them with 2 reduction factors:
i0, j0, i1, j1, k0, i2, j2, k1, i3, j3. This gives finer control over register blocking than our 2-level split.
We adopted findings 1–3 (structural transforms) into the rule-based schedule, while keeping our simpler 2-level tiling structure:
| Transform | What it does | TVM API call |
|---|---|---|
cache_write |
Accumulate C tile in local buffer; single write-back per tile | sch.cache_write(block, 0, "global") + sch.reverse_compute_at(C_write, j_outer) |
decompose_reduction |
Separate zero-init from accumulation loop | sch.decompose_reduction(block, k_outer) |
pragma_auto_unroll |
Let LLVM unroll small inner spatial loops | sch.annotate(fused, "pragma_auto_unroll_max_step", 64) + sch.annotate(fused, "pragma_unroll_explicit", 1) |
Combined with the TK = 8 finding from the cache_write-enabled sweep (Trend 1), these changes yielded dramatic improvements:
| Kernel | v1 / meta | v2 / meta | Improvement factor |
|---|---|---|---|
| QKV | 1.46× | 1.23× | 1.19× |
| MLP-expand | 2.35× | 1.32× | 1.78× |
| MLP-reduce | 1.57× | 1.29× | 1.22× |
MLP-expand saw the largest gain (1.78×) because it has the widest N dimension (3072), making the cache_write transform most impactful — the C tile (TM × 3072 × 4 bytes) is far too large for L1 without local buffering.
The historical residual ~1.70× gap to MetaSchedule is explained by three factors inherent to the auto-tuning approach:
- 4-level spatial tiling (SSRSRS) vs our 2-level — MetaSchedule has finer register blocking with 4 i-splits and 4 j-splits.
- Per-shape tile tuning — MetaSchedule tries 256 random configurations per shape and picks the empirical best; our rules use fixed heuristics.
- Per-shape unroll factors — MetaSchedule picks from {0, 16, 64, 512} per shape; we use a fixed 64.
The rule-based system intentionally trades this residual gap for determinism (same schedule every run), zero tuning cost (no search trials needed), and interpretability (every decision is traceable to a documented rule).
After regenerating the manual-schedule dataset in
research/results/bert_matmul_results.json, we re-ran analysis and
rule-based benchmarks end-to-end.
Key findings:
-
Manual-only trend update: in regenerated manual data, among pure K-tiling variants (
k4,k8,k16,k32,k64),k16is fastest across all 24 shapes. This does not invalidate R1, because those manual recipes do not include the full rule-based transform stack (cache_write+decompose_reduction+ fused parallel tiling). -
Rule-based re-benchmark (fresh run, all 24 shapes):
- Geometric-mean speedup vs baseline: 132.61×
- Geometric-mean speedup vs
full: 4.36× - Geometric-mean speedup vs best manual variant per shape: 4.05×
- Geometric-mean ratio vs MetaSchedule: 1.05×
-
Rule-ablation check: small changes tested after regeneration (
TK=4,TK=16, and widerTNvalues) did not produce a stable improvement over the current rule set; the existingTK=8, TN=64, TM-divisibilitypolicy remains the most robust deterministic choice.
The later j_pack refinement, including motivation and measured
uplift, is documented in Trend 8 to keep all j_pack evidence in one place.
For M values that do not divide evenly by TM, the last outer tile under-utilises its register allocation. For example, M = 96 with TM = 64 gives one full tile (64 rows) + one 50%-utilised tile (32 rows in a 64-row allocation) — wasting register/L1 capacity.
The heuristic therefore prefers TM values that divide M cleanly:
| M | TM | Outer i-tiles | Clean division? |
|---|---|---|---|
| ≤ 32 | M | 1 | ✓ |
| 64 | 64 | 1 | ✓ |
| 96 | 32 | 3 | ✓ |
| 128 | 64 | 2 | ✓ |
| 192 | 64 | 3 | ✓ |
| 256 | 64 | 4 | ✓ |
| 384 | 64 | 6 | ✓ |
For M ≤ 32, TM = M processes the entire row dimension in a single
tile, eliminating outer-loop overhead and improving A-strip reuse.
This is safe because cache_write keeps the C tile in a local
buffer rather than L1, so the larger spatial tile doesn't cause L1
pressure.
→ Rule R7: TM = M for M ≤ 32; TM = 64 if M % 64 == 0; else TM = 32.
We re-checked best_schedules.json and quantified innermost j split decisions
across the 24 best records:
16: 12/2432: 5/248: 3/2464: 2/241: 2/24
While 16 is the mode, 32 appears frequently enough to suggest that the
compiler benefits from a wider inner packed lane on several shapes. In the
2-level deterministic schedule, changing the inner partition from 16 to 32
increases instruction-level parallelism in the inner micro-kernel without
changing TM, TN, or TK.
As introduced in Trend 3, j_pack is the software blocking factor above
fixed AVX2 lane width (VEC_WIDTH=8). The remainder of this section
selects the best fixed j_pack value for this rule-based schedule.
For the current rule-based skeleton (TN=64, 2-level tiling, fixed loop order),
we ran paired ABBA tests for fixed j-pack choices. Reported numbers are
geometric-mean candidate / j_pack32 (so < 1 is better than 32):
| Candidate j-pack | Geomean ratio vs 32 | Interpretation |
|---|---|---|
| 8 (1x AVX2) | 1.1387 | 13.9% slower |
| 16 (2x AVX2) | 1.0358 | 3.6% slower |
| 64 (8x AVX2) | 1.0972 | 9.7% slower |
Interpretation:
8under-utilises ILP inside the micro-kernel.16improves over8but still leaves throughput on the table.64is too coarse for this schedule shape (higher register pressure and less effective inner blocking behavior).32is the best fixed-point trade-off in this deterministic 2-level design.
We also tested dynamic policies inferred from best traces. Those were applied
to the current rule-based skeleton and compared via ABBA against fixed 32.
Reported numbers are geometric-mean dynamic / fixed32:
| Dynamic policy | Geomean ratio vs fixed 32 | Interpretation |
|---|---|---|
| Exact per-(kernel,M) trace value | 1.0958 | 9.6% slower |
| Exact per-(kernel,M), floor at 8 | 1.0583 | 5.8% slower |
| Exact per-(kernel,M), clamp to [8,32] | 1.0133 | 1.3% slower |
| M-only majority map, clamp to [8,32] | 1.0550 | 5.5% slower |
Why this regresses (and why we do not transplant the full MetaSchedule structure here):
- We implemented a MetaSchedule-structured 4-level SSRSRS-like rule-based variant and benchmarked it end-to-end; it was legal after fixes but still much slower (geomean about 1.69x vs current rule-based).
- We then tested single-structure hybrids to isolate cause. Interleaving-only and deeper i-tiling-only variants also regressed strongly (about 2.35x and 1.36x geomean vs baseline rule-based, respectively).
- We tested alternate cache-write anchoring in isolation; dynamic anchor selection did not improve throughput (about 1.03x geomean, i.e., slower than baseline). More aggressive fuse-frontier anchoring attempts also hit schedule legality constraints (
fusesibling/predicate restrictions). - Conclusion: these MetaSchedule decisions are co-tuned as a package with loop structure, unroll choices, and cache placement. Porting only the
jfactor (or only one structural piece) into the 2-level deterministic skeleton breaks that co-optimization. - Therefore, dynamic
j_packinferred from traces is not adopted: in this schedule context it is slower, less stable, and more complex than fixedj_pack=32.
Therefore we keep fixed j_pack=32: it is faster, simpler, and more robust
for the current deterministic schedule design.
Re-validation (fresh all-kernel run, 24 shapes) after switching to j_pack = 32:
- Geometric-mean speedup vs previous rule-based: 1.25×
- QKV: 1.38× faster
- MLP-expand: 1.30× faster
- MLP-reduce: 1.09× faster
- rule_based/meta geometric-mean ratio: 1.90× → 1.05×
→ Rule R8: Set j_vec inner partition to _VEC_WIDTH * 4 (32 for AVX2).
After adopting fixed j_pack=32 for compute, we tested whether the write-back
path from C_write should use the same packing width.
Tested write-back variants (strict ABBA, 24 shapes, same evaluator settings as main benchmarks):
| Write-back strategy | Geomean ratio vs previous write-back | Interpretation |
|---|---|---|
Split by j_pack=32 + vectorize inner (new_writeback_jpack) |
0.9605 | 3.9% faster overall |
Split by AVX2 width 8 + vectorize inner (alt_writeback_vec8) |
0.9850 | 1.5% faster overall |
Per-kernel geomean for adopted strategy (new/old):
- QKV:
0.9728(2.7% faster) - MLP-expand:
0.8939(10.6% faster) - MLP-reduce:
1.0190(1.9% slower)
Interpretation:
- The previous write-back vectorization (
vectorize(last_loop)) could generate a wider-than-needed vectorized write-back lane in this schedule shape. - Explicitly splitting write-back by
j_packkeeps compute and store blocking consistent and gives the best overall geomean on this suite.
Adopted update in rule_based_schedule.py:
- from:
sch.vectorize(write_loops[-1]) - to:
split(write_loops[-1], factors=[None, j_pack])+vectorize(write_inner)
The following potential enhancements were experimentally evaluated but not adopted because they did not yield consistent improvements:
| Enhancement | Tested configuration | Result | Reason not adopted |
|---|---|---|---|
cache_read for B |
sch.cache_read(block, 1, "global") + compute_at(B_read, k_outer) |
Neutral to 8% slower | B-strip (TK×TN×4 = 2 KB) already fits in L1; copying to a local buffer adds overhead without benefit. |
| TN = 128 | Double column tile width | Neutral (0.99–1.03×) | Halves the number of j-outer tiles, reducing parallel tasks without improving inner-loop efficiency. |
| TK = 4 | Half the current reduction tile | No stable win after regenerated-data re-validation | Increased variance and inconsistent cross-kernel gains vs TK=8. |
| Dynamic j-pack by M / trace | Per-shape j factors inferred from MetaSchedule best traces |
1.01–1.10× slower vs fixed j_pack=32 | j factors depend on a different (4-level) schedule context; transplanting them alone regresses this 2-level rule-based schedule. |
| Write-back split by vec8 | Split C_write innermost loop by 8 then vectorize |
0.9850× vs previous write-back (weaker win) | Improves less than j_pack=32 write-back split (0.9605×), so not selected as default. |
Note on TK = 4: Earlier experiments suggested potential gains in some ranges, but regenerated-data re-validation did not show a stable cross-kernel improvement. TK = 8 remains the default for consistency and reproducibility.
The final rule-based schedule applies 12 rules derived from the trends above:
| Rule | Parameter | Value | Source trend | Justification |
|---|---|---|---|---|
| R1 | TK (reduction tile) | 8 | Trend 1 | TK=8 + cache_write beats TK=16 by 25–40%; B-strip = 2 KB fits in L1 |
| R2 | Parallelism | Always | Trend 2 | 6–28× gain; highest-impact single transform |
| R3 | VEC_WIDTH | 8 | Trend 3 | AVX2 = 256 bit / 32-bit float; vectorise innermost j-lane |
| R4 | Loop order | Fixed | Trend 4 | fused(io,jo) → ko → ii → ji_o → ki → j_vec; k-tile only with j-tile |
| R5 | Outer fusion | Always | Trend 5 | Fuse io×jo for sufficient thread utilisation (≥ 12 tasks for 12 threads) |
| R6 | TN (column tile) | 64 | Trends 3,5 | 8 × VEC; good A-reuse vs parallel-task balance for N ∈ {768, 3072} |
| R7 | TM (row tile) | M-dep | Trend 7 | M (≤32) / 64 (M%64==0) / 32 (else); ensures clean tile division |
| R8 | j-pack (compute + write-back) | 32 | Trend 8 | 4× AVX2 pack width; best fixed-point trade-off and better write-back geomean than previous write-back vectorization |
| R9 | Unroll ki | Always | Trend 1 | TK = 8 ≤ UNROLL_LIMIT; eliminates branch overhead in hot loop |
| R10 | cache_write | Always | Trend 6 | Local C accumulation → register/L1 resident; single write-back per tile |
| R11 | decompose_reduction | Always | Trend 6 | Separate init from accumulation; removes branch from hot loop |
| R12 | auto_unroll | 64 | Trend 6 | pragma_auto_unroll_max_step = 64; lets LLVM unroll inner spatial loops |
The schedule is constructed in the following order within
apply_rule_based_schedule():
Step 1: Split i → (i_outer, i_inner) with factor TM
Step 2: Split j → (j_outer, j_inner) with factor TN
Step 3: Split k → (k_outer, k_inner) with factor TK
Step 4: Split j_inner → (j_inner_outer, j_vec) with factor J_PACK=32
Step 5: Reorder → io, jo, ko, ii, ji_o, ki, j_vec
Step 6: cache_write(C, 0, "global") + reverse_compute_at(C_write, jo)
Step 7: Fuse(io, jo) → fused; parallel(fused)
Step 8: Vectorize(j_vec)
Step 9: Split write-back innermost loop by J_PACK=32, then vectorize inner write-back lane
Step 10: Unroll(k_inner)
Step 11: Annotate(fused, pragma_auto_unroll_max_step, 64)
Step 12: Annotate(fused, pragma_unroll_explicit, 1)
Step 13: decompose_reduction(block, k_outer)
With cache_write, the C tile is held in a local buffer (registers / L1)
and written back once after all reduction is complete. L1 pressure during
the hot accumulation loop comes from A-strip + B-strip only; the C
tile competes briefly during write-back.
| Config (TM, TN, TK) | A strip | B strip | C local | A+B (hot) | A+B+C | % of 32 KB L1-D |
|---|---|---|---|---|---|---|
| (16, 64, 8) | 512 B | 2 048 B | 4 096 B | 2 560 B | 6 656 B | 20.3% |
| (32, 64, 8) | 1 024 B | 2 048 B | 8 192 B | 3 072 B | 11 264 B | 34.4% |
| (64, 64, 8) | 2 048 B | 2 048 B | 16 384 B | 4 096 B | 20 480 B | 62.5% |
Formulas:
- A strip = TM × TK × 4 bytes
- B strip = TK × TN × 4 bytes
- C local = TM × TN × 4 bytes
All configurations fit within the smallest L1-D on the chip (32 KB E-core). The hot working set during accumulation (A-strip + B-strip) uses only 8–12.5% of L1, leaving ample room for C accumulation, prefetch buffers, and OS overhead.
The rule-based schedule prioritises three properties over raw peak performance:
-
Determinism — The same (M, K, N, kernel) always produces the same schedule. No random search, no stochastic variation between runs.
-
Zero tuning cost — No trials, no warm-up iterations, no database of tuning logs. The schedule is computed analytically from shape parameters in microseconds.
-
Interpretability — Every decision traces to a numbered rule, which traces to a documented trend, which traces to benchmark data. This makes the system suitable for academic publication and reproducible research.
On the regenerated dataset and fresh re-runs, the current rule-based system is typically ~1.05× of MetaSchedule performance while satisfying all three properties.
While the deterministic rule-based schedule provides a zero-search-cost, highly interpretable baseline, it historically leaves a ~1.5× performance gap compared to deep evolutionary search methods like MetaSchedule. The ML-Guided approach sits between these paradigms: it seeks to predict the most impactful structural transformations and schedule parameters (knobs) a priori using extremely lightweight machine learning models, achieving near-MetaSchedule performance without the extensive trial-and-error compile times.
We employ LightGBM (using both LGBMRegressor and LGBMClassifier) to map canonical kernel geometries (M, K, N) to optimal structural schedule configurations. LightGBM was selected for its robustness on small, sparse tabular datasets and its fast inference time, meaning schedule prediction remains essentially instantaneous at compile time.
The training pipeline operates in three stages:
- Data Extraction: Raw MetaSchedule tuning logs (
best_schedules.json) are parsed to extract the exact TIR schedule instructions and loop configurations that yielded the highest throughput per shape. - Handling Sparse Datasets: To prevent degenerate model convergence on tiny datasets, the trainer dynamically adjusts
min_child_samplesandmin_data_in_binthresholds, drops constant features, and falls back to pure scalar payloads for invariant targets. - Training & Persistence: Distinct models are serialized for each target schedule knob.
The pipeline predicts four distinct schedule transformation knobs that define the loop structure before tiling rules are applied:
| Feature | Type | Model Type | Relevance |
|---|---|---|---|
vector_width |
Continuous | Regressor | Defines the innermost spatial vectorization lane width. Typically maps to hardware AVX bounds (e.g., 32 or 64). |
unroll_factor |
Continuous | Regressor | Informs the pragma_auto_unroll_max_step bound, directly impacting LLVM's inner-loop instruction expansion (e.g., 16, 64, 512). |
cache_write_used |
Boolean | Classifier | Predicts whether to accumulate partial sums in a local register/L1 buffer before committing to the global C tensor. |
reduction_decompose_used |
Boolean | Classifier | Predicts whether the accumulation zero-initialization (T.init) should be decoupled from the core multiply-add loop. |
Through bulk compilation sweeps (generating predictions for all 24 MatMul M-shapes across QKV and MLP layers), the ML predictor generates an overwriteable predicted_knobs_all_shapes.json artifact.
The TVM ml_guided runtime dynamically ingests these knobs to construct the TIR schedule:
- Performance: Provides a data-driven warm-start that structurally mimics MetaSchedule's best topologies—specifically capturing non-linear threshold crossings (like when
unroll_factorshould drop off for very large batch rows). - Stability mechanism: To handle out-of-distribution shapes or missing metrics, the scheduler design is explicitly strictly defensive. If the loop handles are invalidated (e.g., trying to sequence
reverse_compute_atincorrectly after afuseoperation) or ML artifacts are missing, the compiler safely falls back to the deterministicrule_basedschedule.
This integration ultimately closes the loop between manual heuristics and black-box automation, creating a data-driven compiler pass that is fast, explainable, and resilient.
All commands are run from the Apache_TVM/ project root unless stated otherwise.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph SHAPES_AND_TEMPLATES["Shape Definitions & MatMul Template"]
E["matmul_templates.py → matmul_tir(M, K, N)<br>Build canonical TIR MatMul IRModule"]
D["bert_shapes.py<br>Expose qkv_shape, mlp_expanded_shape,<br>mlp_compressed_shape, M_LIST"]
end
A["env_check.py <br> Verify Python / PyTorch / Transformers"]
B["L0_canonical_verification.py<br>Verify TVM import, tvm.build, NDArray, LLVM"]
C["extract_matmul_shapes.py<br>Load pretrained BERT model<br>Inspect weight tensors<br>Write shapes to JSON"]
D --> E
A -- tvm_initialisation_checks --> B
A -- schedule_analysis --> C
H["schedule_recipes.py → apply_schedule()<br>Select variant: baseline / K-tiling / parallelisation / vectorisation / full / rule_based / ml_guided"]
H1["rule_based_schedule.py<br>apply_rule_based_schedule()<br>Auto-pick TM, TN, TK tiles<br>Split → Reorder → Fuse →<br>Parallelize → Vectorize → Unroll"]
G{"Choose scheduling<br>strategy"}
I["metaschedule_tune.py<br>ms.tir_integration.tune_tir()<br>Per kernel × per M<br>Store logs → research/results/metaschedule/"]
L["research/results/bert_matmul_results.json<br>All variant × kernel × M latencies"]
M["print_qkv_mlp_results.py<br>Load JSON → Tabulate by variant & M<br>Print summary"]
N["plot_qkv_mlp_results.py<br>Line plots + Heatmap<br>Optionally --save to file"]
T1["L1_vector_add.py<br>TIR vector add → build → verify vs NumPy"]
T2["L2_schedule_semantics.py<br>Schedule transforms → verify correctness"]
T3["L3_metaschedule.py<br>MetaSchedule smoke test"]
T4["L4_performance_and_ir.py<br>Performance measurement + IR dump"]
T5["L5_large_matmul.py<br>Large MatMul stress test + perf check"]
I1["metaschedule_log_parse.py<br>Parse tuning logs<br>Extract best latency"]
H2["Named manual recipe<br>Apply predefined schedule transforms"]
K["qkv_mlp_run.py<br>For each M in M_LIST:<br>• Create NDArrays (A, B, C)<br>• Warm-up runs<br>• Time rt_mod['main'] executions<br>• Append measurements"]
B --> T1
T1 --> T2
T2 --> T3
T3 --> T4
T4 --> T5
C --> G
G -- AutoTune (MetaSchedule) --> I
G -- Manual / Rule-based / ML-guided --> K
H -- rule_based --> H1
H -- baseline / K-tiling / parallelisation / vectorisation / full --> H2
I --> D
I --> I1
E --> I
E --> K
K --> D
K --> H
H1 --> L
H2 --> L
I1 --> L
L --> M
M --> N
%% ----------------------------
%% DARK MODE SAFE STYLES
%% ----------------------------
style A fill:#1f2a44,stroke:#58a6ff,color:#ffffff,stroke-width:2px
style B fill:#1f2a44,stroke:#58a6ff,color:#ffffff,stroke-width:2px
style C fill:#3a2f1f,stroke:#ffb86c,color:#ffffff,stroke-width:2px
style D fill:#3a2f1f,stroke:#ffb86c,color:#ffffff,stroke-width:2px
style E fill:#2d1f3a,stroke:#d2a8ff,color:#ffffff,stroke-width:2px
style H fill:#1f3a2a,stroke:#3fb950,color:#ffffff,stroke-width:2px
style H1 fill:#1f3a2a,stroke:#3fb950,color:#ffffff,stroke-width:2px
style H2 fill:#1f3a2a,stroke:#3fb950,color:#ffffff,stroke-width:2px
style G fill:#3a371f,stroke:#f2cc60,color:#ffffff,stroke-width:2px
style I fill:#3a1f2a,stroke:#ff7b72,color:#ffffff,stroke-width:2px
style I1 fill:#3a1f2a,stroke:#ff7b72,color:#ffffff,stroke-width:2px
style L fill:#243a1f,stroke:#7ee787,color:#ffffff,stroke-width:2px
style M fill:#2a1f3a,stroke:#a5a5ff,color:#ffffff,stroke-width:2px
style N fill:#2a1f3a,stroke:#a5a5ff,color:#ffffff,stroke-width:2px
style T1 fill:#2a2a2a,stroke:#8b949e,color:#ffffff,stroke-width:2px
style T2 fill:#2a2a2a,stroke:#8b949e,color:#ffffff,stroke-width:2px
style T3 fill:#2a2a2a,stroke:#8b949e,color:#ffffff,stroke-width:2px
style T4 fill:#2a2a2a,stroke:#8b949e,color:#ffffff,stroke-width:2px
style T5 fill:#2a2a2a,stroke:#8b949e,color:#ffffff,stroke-width:2px
style K fill:#1f3a3a,stroke:#56d4dd,color:#ffffff,stroke-width:2px
Execution Environment Setup — Run first
Before running any benchmark or schedule scripts, initialize the controlled environment and enter the pinned shell:
bash scripts/benchmark_settings.sh
Run all subsequent benchmark and schedule commands from within the shell started by this script. This locks CPU frequency, sets thread affinity, and reduces OS-level variability for stable microbenchmark results.
python3 -m research.analysis.print_qkv_mlp_results # all kernels
python3 -m research.analysis.print_qkv_mlp_results qkv # QKV only
python3 -m research.analysis.print_qkv_mlp_results mlp_expand
python3 -m research.analysis.print_qkv_mlp_results mlp_reduceWhy:
Prints a consolidated pivot table of recorded MatMul latencies (µs) per kernel, grouped by
variant and M value. Shows shape info (HIDDEN, FF, K, N) and M-sweep config.
At the end it prompts Show plots? [y/N] — answering y launches the plotting script below.
python3 -m research.analysis.plot_qkv_mlp_results # all kernels (interactive)
python3 -m research.analysis.plot_qkv_mlp_results qkv # single kernel
python3 -m research.analysis.plot_qkv_mlp_results --save # save PNGs (headless-safe)
python3 -m research.analysis.plot_qkv_mlp_results qkv --save # single kernel, save PNGWhy:
Generates one line chart per kernel (variant lines vs M, Y = latency µs) plus a
consolidated heatmap of all kernels on a single figure. Use --save to write PNGs to
research/results/plots/ instead of opening interactive windows (required for headless / no
DISPLAY environments).
source venv/bin/activate
python3 research/workloads/common/env_check.pypython3 research/workloads/bert/load_bert.pypython3 research/workloads/bert/extract_matmul_shapes.pyNote: filter_qkv.py is deprecated; extract_matmul_shapes.py now writes labelled shapes
directly to research/workloads/bert/bert_matmul_shapes.json.
python3 -m research.workloads.bert.matmul.qkv_mlp_run baseline --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run baseline --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run baseline --kernel mlp_reduce# k4
python3 -m research.workloads.bert.matmul.qkv_mlp_run k4 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run k4 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run k4 --kernel mlp_reduce
# k8
python3 -m research.workloads.bert.matmul.qkv_mlp_run k8 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run k8 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run k8 --kernel mlp_reduce
# k16
python3 -m research.workloads.bert.matmul.qkv_mlp_run k16 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run k16 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run k16 --kernel mlp_reduce
# k32
python3 -m research.workloads.bert.matmul.qkv_mlp_run k32 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run k32 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run k32 --kernel mlp_reduce
# k64
python3 -m research.workloads.bert.matmul.qkv_mlp_run k64 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run k64 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run k64 --kernel mlp_reduce# parallel
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel --kernel mlp_reduce
# vec_j
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j --kernel mlp_reduce
# parallel_k16
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_k16 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_k16 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_k16 --kernel mlp_reduce
# parallel_vec_j
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_vec_j --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_vec_j --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run parallel_vec_j --kernel mlp_reduce
# vec_j_k16
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j_k16 --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j_k16 --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run vec_j_k16 --kernel mlp_reduce
# full
python3 -m research.workloads.bert.matmul.qkv_mlp_run full --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run full --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run full --kernel mlp_reduceTo reproduce the TK ablation analysis used in this report (produces
research/results/tk_analysis_results.json and prints the concluding
summary), run:
python3 -m research.analysis.analysis_tkThe script prefers the tabulate package for prettier tables; install it with
pip install tabulate if desired. The script falls back to plain ASCII tables
when tabulate is not available.
# general syntax
python3 -m research.workloads.bert.matmul.qkv_mlp_run \
<variant|--all-variants|--all> [--kernel <kernel>|--all-kernels] [--iterations <n>]
# all variants across all kernels (alias: --all)
python3 -m research.workloads.bert.matmul.qkv_mlp_run --all-variants
python3 -m research.workloads.bert.matmul.qkv_mlp_run --all
# all kernels for one selected variant
python3 -m research.workloads.bert.matmul.qkv_mlp_run baseline --all-kernels
python3 -m research.workloads.bert.matmul.qkv_mlp_run k8 --all-kernels
# repeat runs N times for stability / averaging studies
python3 -m research.workloads.bert.matmul.qkv_mlp_run baseline --kernel qkv --iterations 3
python3 -m research.workloads.bert.matmul.qkv_mlp_run full --all-kernels --iterations 5
python3 -m research.workloads.bert.matmul.qkv_mlp_run --all-variants --iterations 2#general_syntax
python3 -m research.workloads.bert.metaschedule.metaschedule_tune [--all] [--kernel <kernel>] [--iterations <n>]
python3 -m research.workloads.bert.metaschedule.metaschedule_tune --all --iterations 3
python3 -m research.workloads.bert.metaschedule.metaschedule_tune --kernel qkv
python3 -m research.workloads.bert.metaschedule.metaschedule_tune --kernel mlp_expand
python3 -m research.workloads.bert.metaschedule.metaschedule_tune --kernel mlp_reduceResults are recorded directly from tuning logs into the unified results file:
research/results/bert_matmul_results.json
MetaSchedule best-trace snapshots are also written to:
research/results/metaschedule/best_schedules.json
Use the parser below to extract schedule transformations (and chosen values)
from best_schedules.json, print a tabular view, and overwrite a JSON summary.
# run from repository root (recommended)
python3 research/analysis/parse_best_schedule_transformations.py --verbose
# explicit paths (same defaults, shown for clarity)
python3 research/analysis/parse_best_schedule_transformations.py \
--input-json research/results/metaschedule/best_schedules.json \
--output-json research/results/metaschedule/best_schedule_transformations.json
# terminal-friendly compact table (default)
python3 research/analysis/parse_best_schedule_transformations.py --view compact --max-transform-cols 8
# long vertical view (one transformation per row)
python3 research/analysis/parse_best_schedule_transformations.py --view long
# full wide table + horizontal scroll via pager (requires 'less')
python3 research/analysis/parse_best_schedule_transformations.py --view wide --pager
# if your cwd is research/analysis
python3 parse_best_schedule_transformations.py --verboseOutput JSON (overwritten on each run):
research/results/metaschedule/best_schedule_transformations.json
Manual vs MetaSchedule performance comparison completed.
The rule-based schedule detects each operator's (M, K, N) shape and kernel type and automatically selects tiling, parallelism, vectorisation, and unrolling strategies tuned for multi-core CPUs. The heuristics are calibrated for 12-thread topologies, specifically verified on:
- Intel i5-1235U (Alder Lake native, 2 P-cores + 8 E-cores, AVX2)
- Intel i7-13700 (Raptor Lake VirtualBox VM, 12 isolated vCPUs, 16 GB RAM, AVX2)
# Run for each kernel (sweeps all M values in M_LIST automatically)
python3 -m research.workloads.bert.matmul.qkv_mlp_run rule_based --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run rule_based --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run rule_based --kernel mlp_reduceTile-size decisions are printed during the run for transparency.
Results are appended to the same unified results file and appear as the
rule_based variant in print / plot outputs.
This phase uses a lightweight LightGBM pipeline to predict strong initial
schedule knobs (vector_width, unroll_factor, cache_write_used,
reduction_decompose_used) from historical best MetaSchedule traces.
The ml_guided variant remains defensive: if model artifacts are missing or
prediction fails, it falls back to rule_based.
python3 research/workloads/bert/ml_schedule_predictor/extract_training_data.py --verboseOutput:
research/results/ml_schedule_predictor/training_dataset.csv
python3 research/workloads/bert/ml_schedule_predictor/train_lightgbm_knob_models.py --verboseOutputs:
research/results/ml_schedule_predictor/models/vector_width_model.pklresearch/results/ml_schedule_predictor/models/unroll_model.pklresearch/results/ml_schedule_predictor/models/cache_write_model.pklresearch/results/ml_schedule_predictor/models/decompose_model.pkl
# print all predictions and save JSON under research/results/ml_schedule_predictor/
python3 research/workloads/bert/ml_schedule_predictor/predict_knobs.py --all-shapes --verbose
# optional: override JSON output path
python3 research/workloads/bert/ml_schedule_predictor/predict_knobs.py --all-shapes --output-json research/results/ml_schedule_predictor/predicted_knobs_all_shapes.jsonDefault output:
research/results/ml_schedule_predictor/predicted_knobs_all_shapes.json
Note: the JSON file is overwritten on every run.
By default, --all-shapes also uploads the same snapshot to the data aggregator endpoint
/api/upload/best_schedule_predictions, using your resolved CPU profile so each CPU
lands in a separate profile-scoped table.
Useful flags:
--no-uploadto keep the run local-only--profile <name>to force a target profile--upload-url <url>to overrideDATA_AGGREGATOR_BEST_SCHEDULE_PREDICTIONS_URL--upload-timeout <seconds>to increase request timeout
# one kernel
python3 -m research.workloads.bert.matmul.qkv_mlp_run ml_guided --kernel qkv
python3 -m research.workloads.bert.matmul.qkv_mlp_run ml_guided --kernel mlp_expand
python3 -m research.workloads.bert.matmul.qkv_mlp_run ml_guided --kernel mlp_reduce
# all kernels in one run
python3 -m research.workloads.bert.matmul.qkv_mlp_run ml_guided --all-kernelspython3 research/workloads/bert/ml_schedule_predictor/extract_training_data.py
python3 research/workloads/bert/ml_schedule_predictor/train_lightgbm_knob_models.py- ✔ Environment validated
- ✔ BERT MatMul shapes extracted
- ✔ Canonical kernels created
- ✔ Manual schedules benchmarked (9 variants × 3 kernels × 8 M values)
- ✔ MetaSchedule comparison completed (256 trials × 3 iterations per shape)
- ✔ Rule-based v1 schedule implemented & data-driven rules derived
- ✔ MetaSchedule trace analysis (structural transforms identified)
- ✔ Rule-based v2 refactored (cache_write + decompose_reduction + auto-unroll + TK=8)
- ✔ MetaSchedule-inspired 4× j-pack adopted (
j_vec = 32) - ✔ Write-back vectorization aligned to
j_pack=32(strict ABBA geomean0.9605×vs previous write-back) - ✔ Performance gap improved to ~1.05× of MetaSchedule (latest full re-run)
- ✔ Further enhancement investigation (TK=4, cache_read, TN=128 — documented)
- ✔ ML-guided LightGBM warm-start pipeline integrated (
ml_guidedvariant + predictor scripts)
Next step: Phase 7 — Generalization to additional Transformer workloads