You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the follow-up to the partial closure of #5 in PR #12. That PR landed:
the ksuid_string_batch public API (libksuid/ksuid.h),
the scalar reference (libksuid/encode_batch.c::ksuid_string_batch_scalar),
the libsodium-style atomic-pointer dispatch trampoline (race-free, allocation-free),
the contract-pinning test suite (tests/test_string_batch.c).
The AVX2 8-wide kernel itself is the slot the dispatcher resolves into when the host advertises AVX2 + OS-saves-YMM. That kernel does not yet exist. Until it does, every host resolves to the scalar path with one acquire-load + one indirect call of overhead per call.
This issue tracks the kernel's implementation as its own architect + critic + reviewer cycle. The dispatch infrastructure is already in place, so this is a self-contained "plug in the kernel" change — no public API churn, no rework of the trampoline, no re-litigation of #5's design decisions.
Acceptance criteria
Functional
libksuid/encode_avx2.c (NEW) defines ksuid_string_batch_avx2(const ksuid_t *, char *, size_t) matching the prototype in libksuid/encode_batch.h.
meson.build adds the file to core_sources under host_cpu == 'x86_64' AND the new option('avx2_batch', type: 'feature', value: 'auto') is enabled (default auto).
meson.build defines KSUID_HAVE_AVX2_BATCH=1 so the dispatcher in encode_batch.c resolves to the AVX2 kernel when CPUID + XGETBV pass.
The kernel processes 8 KSUIDs per outer-loop iteration via SoA-internal lane layout (5 × __m256i limbs, each holding 8 lanes of u32).
The tail (n % 8) falls through to the scalar path inside the kernel.
Performance
Measured throughput on an AVX2-capable x86_64 host is >= 4x the scalar path for n >= 256, with 4-6x being the design target.
The AVX2 TU contributes <= 4 KB to the stripped library's .text section (Critic risk register R8). Enforced post-build by a CI step that runs size --format=sysv build/libksuid.so.* | awk before and after, fails if delta > 4096 bytes. If exceeded, default avx2_batch to disabled and document the decision.
Parity
tests/test_string_batch.c gains a differential test that explicitly invokes ksuid_string_batch_scalar AND ksuid_string_batch_avx2 (when present at runtime) on the same input and asserts byte-for-byte equality across:
n ∈ {0, 1, 7, 8, 9, 16, 17, 64, 257, 1024} for tail-boundary coverage,
the pinned corners KSUID_NIL, KSUID_MAX, all-0xff, all-0x00,
= 2^20 random KSUIDs to catch off-by-one in the divide-by-62 reciprocal-multiplication magic constant,
eight distinct KSUIDs with non-trivial high limbs to catch lane-swap bugs in the transpose step,
ksuid_sequence_t-generated runs (consecutive payloads at one timestamp).
The reviewer + meta-reviewers in PR #12 explicitly carried these forward as the implementer's pre-commit checklist:
R3 — AVX-SSE transition penalty
Failure mode: 100+ cycle stall on Sandy/Ivy Bridge whenever AVX2 code returns to surrounding SSE2 code (e.g. base62_sse2.c, compare_sse2.c). Mitigation: AVX2 kernel lives in its OWN translation unit compiled with -mavx2 only (mirror the existing base62_sse2.c / compare_sse2.c registration). End the kernel with _mm256_zeroupper() before returning. Verify with objdump -d build/libksuid.so.* | grep -c "vex" that no VEX encoding leaks into non-AVX2 TUs.
R4 — Tail handling overrun
Failure mode: 8-wide loop reads ids[n..n+7] past caller's array; SEGV on guard page. Mitigation: explicit size_t bulk = n & ~(size_t)7; for (i=0; i<bulk; i+=8) avx2_kernel(...); for (; i<n; ++i) ksuid_format(...);. Never use masked loads for the tail — use the scalar path. Add a fuzzer test for n in 0..23.
Failure mode: SIMD encode disagrees with scalar at one specific 5-limb input. AVX2 has no 64-bit divide; the SIMD long-division-by-62 must use _mm256_mul_epu32 + _mm256_srli_epi64 to compute quotient = (value * M) >> N for divisor 62. Mitigation: derive M ≈ 0x4ec4ec4ec4ec4ec5 from Hacker's Delight 10-9 OR libdivide. Document the derivation in a comment block at the top of encode_avx2.c. Pin parity against the scalar reference over >= 2^20 random KSUIDs plus the corner cases listed in acceptance criterion 8.
R6 — Lane transpose / output-byte mapping
Failure mode: AVX2 encode produces correct base62 digits but writes them to wrong output offsets across the 8 KSUIDs. Scalar writes right-to-left out[--n] = digit; the SIMD version emits 8 digits per outer iteration for 8 distinct KSUIDs, requiring a transpose at store time. Mitigation: structure the kernel as SoA-internal (5 limbs × 8 IDs in 5 __m256i), emit one __m256i of 8 digits per iteration, and store with strided per-lane writes (out[i*27 + col]). The 8-distinct-KSUID parity test (acceptance criterion 8) is the regression detector.
R8 — Footprint budget
Failure mode: AVX2 TU bloats libksuid.so past the +4 KB cap silently — nobody notices for two releases. Mitigation: meson postbuild step described in acceptance criterion 7. CI gate must fail the build if the AVX2 TU's contribution exceeds 4 KB. Local builds print a summary.
R11 — KSUID_FORCE_SCALAR env var
Failure mode: env-var read on every ksuid_string_batch call → getenv is not lock-free under glibc. Mitigation: read KSUID_FORCE_SCALAR once inside the trampoline before resolving the pointer. If set (any non-empty value), the resolved pointer is &ksuid_string_batch_scalar. Document as "captured at first call, not reread." Useful for benchmarking + debugging without rebuilding.
Implementation plan (architect carry-over from #5)
Single PR with a 3-commit split:
Commit
Purpose
1
encode: AVX2 8-wide bulk-encode kernel + first-call atomic dispatch — new libksuid/encode_avx2.c, meson avx2_batch option, __attribute__((target("avx2"))) annotation OR per-TU c_args : ['-mavx2']. The dispatch resolver in encode_batch.c swaps &ksuid_string_batch_scalar for &ksuid_string_batch_avx2 when CPUID + XGETBV pass.
2
tests: SIMD parity matrix covering bulk-encode tail boundaries + degenerate inputs — extends tests/test_string_batch.c with a KSUID_TESTING-gated ksuid_string_batch_scalar / _avx2 direct extern, drives both implementations on the inputs in acceptance criterion 8.
3
ci: footprint gate for AVX2 bulk encode + KSUID_FORCE_SCALAR env override — meson postbuild check + R11 env var.
KSUID_FORCE_SCALAR could land in commit 1 alongside the dispatcher swap; the split keeps each commit independently reviewable.
Cross-platform considerations
Linux GCC + Clang on x86_64: primary target, full CI coverage.
macOS Clang on x86_64: AVX2 paths are present in older Intel Macs; Apple Silicon runners do not exercise this code (host_cpu == aarch64 → KSUID_HAVE_AVX2_BATCH undefined → scalar path).
Hosts without AVX2 (older Atom, Steam Deck running x86_64 baseline, etc.): the dispatch resolver simply keeps &ksuid_string_batch_scalar and the AVX2 TU's symbols stay unreferenced → linker discards via --gc-sections if PIC. Verify the AVX2 TU is referenced ONLY through the function-pointer assignment, never through a direct call from another TU, so non-AVX2 hosts incur zero runtime cost.
Verification
The dispatch scaffolding from PR #12 already satisfies R1 (race-free dispatch), R2 (__builtin_cpu_init + _xgetbv), R7 (n/a — compare-side), R9 (NOLINT pattern available), R10 (parity test infrastructure exists), R12 (n==0 early-out). This issue inherits those guarantees without re-litigation.
The wipe-fallback CI lane (PR #10) and the auto-build disasm gate (PR #10 + PR #11) must continue to pass; they are orthogonal to the AVX2 path.
Out of scope
AVX-512 / VBMI bulk encode (separate follow-up if ever justified).
AVX2 bulk parse (the issue body originally noted "same API for parse if a benchmark justifies it"; benchmark first, then file).
Compare path AVX2 acceleration (compare is fixed at 20 bytes; SSE2 already handles the head, AVX2 buys nothing).
Status
This is the follow-up to the partial closure of #5 in PR #12. That PR landed:
ksuid_string_batchpublic API (libksuid/ksuid.h),libksuid/encode_batch.c::ksuid_string_batch_scalar),tests/test_string_batch.c).The AVX2 8-wide kernel itself is the slot the dispatcher resolves into when the host advertises AVX2 + OS-saves-YMM. That kernel does not yet exist. Until it does, every host resolves to the scalar path with one acquire-load + one indirect call of overhead per call.
This issue tracks the kernel's implementation as its own architect + critic + reviewer cycle. The dispatch infrastructure is already in place, so this is a self-contained "plug in the kernel" change — no public API churn, no rework of the trampoline, no re-litigation of #5's design decisions.
Acceptance criteria
Functional
libksuid/encode_avx2.c(NEW) definesksuid_string_batch_avx2(const ksuid_t *, char *, size_t)matching the prototype inlibksuid/encode_batch.h.meson.buildadds the file tocore_sourcesunderhost_cpu == 'x86_64'AND the newoption('avx2_batch', type: 'feature', value: 'auto')isenabled(defaultauto).meson.builddefinesKSUID_HAVE_AVX2_BATCH=1so the dispatcher inencode_batch.cresolves to the AVX2 kernel when CPUID + XGETBV pass.__m256ilimbs, each holding 8 lanes of u32).Performance
n >= 256, with 4-6x being the design target..textsection (Critic risk register R8). Enforced post-build by a CI step that runssize --format=sysv build/libksuid.so.* | awkbefore and after, fails if delta > 4096 bytes. If exceeded, defaultavx2_batchtodisabledand document the decision.Parity
tests/test_string_batch.cgains a differential test that explicitly invokesksuid_string_batch_scalarANDksuid_string_batch_avx2(when present at runtime) on the same input and asserts byte-for-byte equality across:n∈ {0, 1, 7, 8, 9, 16, 17, 64, 257, 1024} for tail-boundary coverage,KSUID_NIL,KSUID_MAX, all-0xff, all-0x00,ksuid_sequence_t-generated runs (consecutive payloads at one timestamp).Critic risk register (carry-over from #5)
The reviewer + meta-reviewers in PR #12 explicitly carried these forward as the implementer's pre-commit checklist:
R3 — AVX-SSE transition penalty
Failure mode: 100+ cycle stall on Sandy/Ivy Bridge whenever AVX2 code returns to surrounding SSE2 code (e.g.
base62_sse2.c,compare_sse2.c).Mitigation: AVX2 kernel lives in its OWN translation unit compiled with
-mavx2only (mirror the existingbase62_sse2.c/compare_sse2.cregistration). End the kernel with_mm256_zeroupper()before returning. Verify withobjdump -d build/libksuid.so.* | grep -c "vex"that no VEX encoding leaks into non-AVX2 TUs.R4 — Tail handling overrun
Failure mode: 8-wide loop reads
ids[n..n+7]past caller's array; SEGV on guard page.Mitigation: explicit
size_t bulk = n & ~(size_t)7; for (i=0; i<bulk; i+=8) avx2_kernel(...); for (; i<n; ++i) ksuid_format(...);. Never use masked loads for the tail — use the scalar path. Add a fuzzer test forn in 0..23.R5 — Divide-by-62 reciprocal-multiplication magic constant
Failure mode: SIMD encode disagrees with scalar at one specific 5-limb input. AVX2 has no 64-bit divide; the SIMD long-division-by-62 must use
_mm256_mul_epu32+_mm256_srli_epi64to computequotient = (value * M) >> Nfor divisor 62.Mitigation: derive
M ≈ 0x4ec4ec4ec4ec4ec5from Hacker's Delight 10-9 ORlibdivide. Document the derivation in a comment block at the top ofencode_avx2.c. Pin parity against the scalar reference over >= 2^20 random KSUIDs plus the corner cases listed in acceptance criterion 8.R6 — Lane transpose / output-byte mapping
Failure mode: AVX2 encode produces correct base62 digits but writes them to wrong output offsets across the 8 KSUIDs. Scalar writes right-to-left
out[--n] = digit; the SIMD version emits 8 digits per outer iteration for 8 distinct KSUIDs, requiring a transpose at store time.Mitigation: structure the kernel as SoA-internal (5 limbs × 8 IDs in 5
__m256i), emit one__m256iof 8 digits per iteration, and store with strided per-lane writes (out[i*27 + col]). The 8-distinct-KSUID parity test (acceptance criterion 8) is the regression detector.R8 — Footprint budget
Failure mode: AVX2 TU bloats
libksuid.sopast the +4 KB cap silently — nobody notices for two releases.Mitigation: meson postbuild step described in acceptance criterion 7. CI gate must fail the build if the AVX2 TU's contribution exceeds 4 KB. Local builds print a summary.
R11 —
KSUID_FORCE_SCALARenv varFailure mode: env-var read on every
ksuid_string_batchcall →getenvis not lock-free under glibc.Mitigation: read
KSUID_FORCE_SCALARonce inside the trampoline before resolving the pointer. If set (any non-empty value), the resolved pointer is&ksuid_string_batch_scalar. Document as "captured at first call, not reread." Useful for benchmarking + debugging without rebuilding.Implementation plan (architect carry-over from #5)
Single PR with a 3-commit split:
encode: AVX2 8-wide bulk-encode kernel + first-call atomic dispatch— newlibksuid/encode_avx2.c, mesonavx2_batchoption,__attribute__((target("avx2")))annotation OR per-TUc_args : ['-mavx2']. The dispatch resolver inencode_batch.cswaps&ksuid_string_batch_scalarfor&ksuid_string_batch_avx2when CPUID + XGETBV pass.tests: SIMD parity matrix covering bulk-encode tail boundaries + degenerate inputs— extendstests/test_string_batch.cwith aKSUID_TESTING-gatedksuid_string_batch_scalar/_avx2direct extern, drives both implementations on the inputs in acceptance criterion 8.ci: footprint gate for AVX2 bulk encode + KSUID_FORCE_SCALAR env override— meson postbuild check + R11 env var.KSUID_FORCE_SCALARcould land in commit 1 alongside the dispatcher swap; the split keeps each commit independently reviewable.Cross-platform considerations
KSUID_HAVE_AVX2_BATCHundefined → scalar path).__attribute__((target("avx2")))is not available; use/arch:AVX2only on the AVX2 TU via per-targetc_argsin meson. The CPUID + XGETBV path inencode_batch.calready has the MSVC branch (__cpuidex+_xgetbv) from PR compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred) #12.&ksuid_string_batch_scalarand the AVX2 TU's symbols stay unreferenced → linker discards via--gc-sectionsif PIC. Verify the AVX2 TU is referenced ONLY through the function-pointer assignment, never through a direct call from another TU, so non-AVX2 hosts incur zero runtime cost.Verification
The dispatch scaffolding from PR #12 already satisfies R1 (race-free dispatch), R2 (
__builtin_cpu_init+_xgetbv), R7 (n/a — compare-side), R9 (NOLINT pattern available), R10 (parity test infrastructure exists), R12 (n==0 early-out). This issue inherits those guarantees without re-litigation.The
wipe-fallbackCI lane (PR #10) and the auto-build disasm gate (PR #10 + PR #11) must continue to pass; they are orthogonal to the AVX2 path.Out of scope
Labels
enhancement,simd,x86_64,performance,follow-up