Skip to content
This repository was archived by the owner on May 1, 2026. It is now read-only.
This repository was archived by the owner on May 1, 2026. It is now read-only.

AVX2 8-wide ksuid_string_batch kernel (follow-up to #5) #13

Description

@justinjoy

Status

This is the follow-up to the partial closure of #5 in PR #12. That PR landed:

  • the ksuid_string_batch public API (libksuid/ksuid.h),
  • the scalar reference (libksuid/encode_batch.c::ksuid_string_batch_scalar),
  • the libsodium-style atomic-pointer dispatch trampoline (race-free, allocation-free),
  • the contract-pinning test suite (tests/test_string_batch.c).

The AVX2 8-wide kernel itself is the slot the dispatcher resolves into when the host advertises AVX2 + OS-saves-YMM. That kernel does not yet exist. Until it does, every host resolves to the scalar path with one acquire-load + one indirect call of overhead per call.

This issue tracks the kernel's implementation as its own architect + critic + reviewer cycle. The dispatch infrastructure is already in place, so this is a self-contained "plug in the kernel" change — no public API churn, no rework of the trampoline, no re-litigation of #5's design decisions.

Acceptance criteria

Functional

  1. libksuid/encode_avx2.c (NEW) defines ksuid_string_batch_avx2(const ksuid_t *, char *, size_t) matching the prototype in libksuid/encode_batch.h.
  2. meson.build adds the file to core_sources under host_cpu == 'x86_64' AND the new option('avx2_batch', type: 'feature', value: 'auto') is enabled (default auto).
  3. meson.build defines KSUID_HAVE_AVX2_BATCH=1 so the dispatcher in encode_batch.c resolves to the AVX2 kernel when CPUID + XGETBV pass.
  4. The kernel processes 8 KSUIDs per outer-loop iteration via SoA-internal lane layout (5 × __m256i limbs, each holding 8 lanes of u32).
  5. The tail (n % 8) falls through to the scalar path inside the kernel.

Performance

  1. Measured throughput on an AVX2-capable x86_64 host is >= 4x the scalar path for n >= 256, with 4-6x being the design target.
  2. The AVX2 TU contributes <= 4 KB to the stripped library's .text section (Critic risk register R8). Enforced post-build by a CI step that runs size --format=sysv build/libksuid.so.* | awk before and after, fails if delta > 4096 bytes. If exceeded, default avx2_batch to disabled and document the decision.

Parity

  1. tests/test_string_batch.c gains a differential test that explicitly invokes ksuid_string_batch_scalar AND ksuid_string_batch_avx2 (when present at runtime) on the same input and asserts byte-for-byte equality across:
    • n ∈ {0, 1, 7, 8, 9, 16, 17, 64, 257, 1024} for tail-boundary coverage,
    • the pinned corners KSUID_NIL, KSUID_MAX, all-0xff, all-0x00,
    • = 2^20 random KSUIDs to catch off-by-one in the divide-by-62 reciprocal-multiplication magic constant,

    • eight distinct KSUIDs with non-trivial high limbs to catch lane-swap bugs in the transpose step,
    • ksuid_sequence_t-generated runs (consecutive payloads at one timestamp).

Critic risk register (carry-over from #5)

The reviewer + meta-reviewers in PR #12 explicitly carried these forward as the implementer's pre-commit checklist:

R3 — AVX-SSE transition penalty

Failure mode: 100+ cycle stall on Sandy/Ivy Bridge whenever AVX2 code returns to surrounding SSE2 code (e.g. base62_sse2.c, compare_sse2.c).
Mitigation: AVX2 kernel lives in its OWN translation unit compiled with -mavx2 only (mirror the existing base62_sse2.c / compare_sse2.c registration). End the kernel with _mm256_zeroupper() before returning. Verify with objdump -d build/libksuid.so.* | grep -c "vex" that no VEX encoding leaks into non-AVX2 TUs.

R4 — Tail handling overrun

Failure mode: 8-wide loop reads ids[n..n+7] past caller's array; SEGV on guard page.
Mitigation: explicit size_t bulk = n & ~(size_t)7; for (i=0; i<bulk; i+=8) avx2_kernel(...); for (; i<n; ++i) ksuid_format(...);. Never use masked loads for the tail — use the scalar path. Add a fuzzer test for n in 0..23.

R5 — Divide-by-62 reciprocal-multiplication magic constant

Failure mode: SIMD encode disagrees with scalar at one specific 5-limb input. AVX2 has no 64-bit divide; the SIMD long-division-by-62 must use _mm256_mul_epu32 + _mm256_srli_epi64 to compute quotient = (value * M) >> N for divisor 62.
Mitigation: derive M ≈ 0x4ec4ec4ec4ec4ec5 from Hacker's Delight 10-9 OR libdivide. Document the derivation in a comment block at the top of encode_avx2.c. Pin parity against the scalar reference over >= 2^20 random KSUIDs plus the corner cases listed in acceptance criterion 8.

R6 — Lane transpose / output-byte mapping

Failure mode: AVX2 encode produces correct base62 digits but writes them to wrong output offsets across the 8 KSUIDs. Scalar writes right-to-left out[--n] = digit; the SIMD version emits 8 digits per outer iteration for 8 distinct KSUIDs, requiring a transpose at store time.
Mitigation: structure the kernel as SoA-internal (5 limbs × 8 IDs in 5 __m256i), emit one __m256i of 8 digits per iteration, and store with strided per-lane writes (out[i*27 + col]). The 8-distinct-KSUID parity test (acceptance criterion 8) is the regression detector.

R8 — Footprint budget

Failure mode: AVX2 TU bloats libksuid.so past the +4 KB cap silently — nobody notices for two releases.
Mitigation: meson postbuild step described in acceptance criterion 7. CI gate must fail the build if the AVX2 TU's contribution exceeds 4 KB. Local builds print a summary.

R11 — KSUID_FORCE_SCALAR env var

Failure mode: env-var read on every ksuid_string_batch call → getenv is not lock-free under glibc.
Mitigation: read KSUID_FORCE_SCALAR once inside the trampoline before resolving the pointer. If set (any non-empty value), the resolved pointer is &ksuid_string_batch_scalar. Document as "captured at first call, not reread." Useful for benchmarking + debugging without rebuilding.

Implementation plan (architect carry-over from #5)

Single PR with a 3-commit split:

Commit Purpose
1 encode: AVX2 8-wide bulk-encode kernel + first-call atomic dispatch — new libksuid/encode_avx2.c, meson avx2_batch option, __attribute__((target("avx2"))) annotation OR per-TU c_args : ['-mavx2']. The dispatch resolver in encode_batch.c swaps &ksuid_string_batch_scalar for &ksuid_string_batch_avx2 when CPUID + XGETBV pass.
2 tests: SIMD parity matrix covering bulk-encode tail boundaries + degenerate inputs — extends tests/test_string_batch.c with a KSUID_TESTING-gated ksuid_string_batch_scalar / _avx2 direct extern, drives both implementations on the inputs in acceptance criterion 8.
3 ci: footprint gate for AVX2 bulk encode + KSUID_FORCE_SCALAR env override — meson postbuild check + R11 env var.

KSUID_FORCE_SCALAR could land in commit 1 alongside the dispatcher swap; the split keeps each commit independently reviewable.

Cross-platform considerations

  • Linux GCC + Clang on x86_64: primary target, full CI coverage.
  • macOS Clang on x86_64: AVX2 paths are present in older Intel Macs; Apple Silicon runners do not exercise this code (host_cpu == aarch64 → KSUID_HAVE_AVX2_BATCH undefined → scalar path).
  • Windows MSVC: __attribute__((target("avx2"))) is not available; use /arch:AVX2 only on the AVX2 TU via per-target c_args in meson. The CPUID + XGETBV path in encode_batch.c already has the MSVC branch (__cpuidex + _xgetbv) from PR compare: SSE2/NEON 20-byte compare + bulk-encode public API (AVX2 kernel deferred) #12.
  • Hosts without AVX2 (older Atom, Steam Deck running x86_64 baseline, etc.): the dispatch resolver simply keeps &ksuid_string_batch_scalar and the AVX2 TU's symbols stay unreferenced → linker discards via --gc-sections if PIC. Verify the AVX2 TU is referenced ONLY through the function-pointer assignment, never through a direct call from another TU, so non-AVX2 hosts incur zero runtime cost.

Verification

The dispatch scaffolding from PR #12 already satisfies R1 (race-free dispatch), R2 (__builtin_cpu_init + _xgetbv), R7 (n/a — compare-side), R9 (NOLINT pattern available), R10 (parity test infrastructure exists), R12 (n==0 early-out). This issue inherits those guarantees without re-litigation.

The wipe-fallback CI lane (PR #10) and the auto-build disasm gate (PR #10 + PR #11) must continue to pass; they are orthogonal to the AVX2 path.

Out of scope

  • AVX-512 / VBMI bulk encode (separate follow-up if ever justified).
  • AVX2 bulk parse (the issue body originally noted "same API for parse if a benchmark justifies it"; benchmark first, then file).
  • Compare path AVX2 acceleration (compare is fixed at 20 bytes; SSE2 already handles the head, AVX2 buys nothing).

Labels

enhancement, simd, x86_64, performance, follow-up

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions