AVX2 8-wide ksuid_string_batch kernel (follow-up to #5)

## Status

This is the follow-up to the partial closure of [#5](https://github.com/semantic-reasoning/libksuid/issues/5) in PR #12. That PR landed:

- the `ksuid_string_batch` public API (`libksuid/ksuid.h`),
- the scalar reference (`libksuid/encode_batch.c::ksuid_string_batch_scalar`),
- the libsodium-style atomic-pointer dispatch trampoline (race-free, allocation-free),
- the contract-pinning test suite (`tests/test_string_batch.c`).

The AVX2 8-wide kernel itself is the slot the dispatcher resolves into when the host advertises AVX2 + OS-saves-YMM. **That kernel does not yet exist.** Until it does, every host resolves to the scalar path with one acquire-load + one indirect call of overhead per call.

This issue tracks the kernel's implementation as its own architect + critic + reviewer cycle. The dispatch infrastructure is already in place, so this is a self-contained "plug in the kernel" change — no public API churn, no rework of the trampoline, no re-litigation of #5's design decisions.

## Acceptance criteria

### Functional
1. `libksuid/encode_avx2.c` (NEW) defines `ksuid_string_batch_avx2(const ksuid_t *, char *, size_t)` matching the prototype in `libksuid/encode_batch.h`.
2. `meson.build` adds the file to `core_sources` under `host_cpu == 'x86_64'` AND the new `option('avx2_batch', type: 'feature', value: 'auto')` is `enabled` (default `auto`).
3. `meson.build` defines `KSUID_HAVE_AVX2_BATCH=1` so the dispatcher in `encode_batch.c` resolves to the AVX2 kernel when CPUID + XGETBV pass.
4. The kernel processes 8 KSUIDs per outer-loop iteration via SoA-internal lane layout (5 × `__m256i` limbs, each holding 8 lanes of u32).
5. The tail (n % 8) falls through to the scalar path inside the kernel.

### Performance
6. Measured throughput on an AVX2-capable x86_64 host is **>= 4x** the scalar path for `n >= 256`, with 4-6x being the design target.
7. The AVX2 TU contributes **<= 4 KB** to the stripped library's `.text` section (Critic risk register R8). Enforced post-build by a CI step that runs `size --format=sysv build/libksuid.so.* | awk` before and after, fails if delta > 4096 bytes. If exceeded, default `avx2_batch` to `disabled` and document the decision.

### Parity
8. `tests/test_string_batch.c` gains a differential test that explicitly invokes `ksuid_string_batch_scalar` AND `ksuid_string_batch_avx2` (when present at runtime) on the same input and asserts byte-for-byte equality across:
   - `n` ∈ {0, 1, 7, 8, 9, 16, 17, 64, 257, 1024} for tail-boundary coverage,
   - the pinned corners `KSUID_NIL`, `KSUID_MAX`, all-`0xff`, all-`0x00`,
   - >= 2^20 random KSUIDs to catch off-by-one in the divide-by-62 reciprocal-multiplication magic constant,
   - **eight distinct KSUIDs with non-trivial high limbs** to catch lane-swap bugs in the transpose step,
   - `ksuid_sequence_t`-generated runs (consecutive payloads at one timestamp).

## Critic risk register (carry-over from #5)

The reviewer + meta-reviewers in PR #12 explicitly carried these forward as the implementer's pre-commit checklist:

### R3 — AVX-SSE transition penalty
**Failure mode**: 100+ cycle stall on Sandy/Ivy Bridge whenever AVX2 code returns to surrounding SSE2 code (e.g. `base62_sse2.c`, `compare_sse2.c`).
**Mitigation**: AVX2 kernel lives in its OWN translation unit compiled with `-mavx2` only (mirror the existing `base62_sse2.c` / `compare_sse2.c` registration). End the kernel with `_mm256_zeroupper()` before returning. Verify with `objdump -d build/libksuid.so.* | grep -c "vex"` that no VEX encoding leaks into non-AVX2 TUs.

### R4 — Tail handling overrun
**Failure mode**: 8-wide loop reads `ids[n..n+7]` past caller's array; SEGV on guard page.
**Mitigation**: explicit `size_t bulk = n & ~(size_t)7; for (i=0; i<bulk; i+=8) avx2_kernel(...); for (; i<n; ++i) ksuid_format(...);`. Never use masked loads for the tail — use the scalar path. Add a fuzzer test for `n in 0..23`.

### R5 — Divide-by-62 reciprocal-multiplication magic constant
**Failure mode**: SIMD encode disagrees with scalar at one specific 5-limb input. AVX2 has no 64-bit divide; the SIMD long-division-by-62 must use `_mm256_mul_epu32` + `_mm256_srli_epi64` to compute `quotient = (value * M) >> N` for divisor 62.
**Mitigation**: derive `M ≈ 0x4ec4ec4ec4ec4ec5` from Hacker's Delight 10-9 OR `libdivide`. Document the derivation in a comment block at the top of `encode_avx2.c`. Pin parity against the scalar reference over **>= 2^20 random KSUIDs** plus the corner cases listed in acceptance criterion 8.

### R6 — Lane transpose / output-byte mapping
**Failure mode**: AVX2 encode produces correct base62 digits but writes them to wrong output offsets across the 8 KSUIDs. Scalar writes right-to-left `out[--n] = digit`; the SIMD version emits 8 digits per outer iteration *for 8 distinct KSUIDs*, requiring a transpose at store time.
**Mitigation**: structure the kernel as SoA-internal (5 limbs × 8 IDs in 5 `__m256i`), emit one `__m256i` of 8 digits per iteration, and store with strided per-lane writes (`out[i*27 + col]`). The 8-distinct-KSUID parity test (acceptance criterion 8) is the regression detector.

### R8 — Footprint budget
**Failure mode**: AVX2 TU bloats `libksuid.so` past the +4 KB cap silently — nobody notices for two releases.
**Mitigation**: meson postbuild step described in acceptance criterion 7. CI gate must fail the build if the AVX2 TU's contribution exceeds 4 KB. Local builds print a summary.

### R11 — `KSUID_FORCE_SCALAR` env var
**Failure mode**: env-var read on every `ksuid_string_batch` call → `getenv` is not lock-free under glibc.
**Mitigation**: read `KSUID_FORCE_SCALAR` once inside the trampoline before resolving the pointer. If set (any non-empty value), the resolved pointer is `&ksuid_string_batch_scalar`. Document as "captured at first call, not reread." Useful for benchmarking + debugging without rebuilding.

## Implementation plan (architect carry-over from #5)

Single PR with a 3-commit split:

| Commit | Purpose |
|---|---|
| 1 | `encode: AVX2 8-wide bulk-encode kernel + first-call atomic dispatch` — new `libksuid/encode_avx2.c`, meson `avx2_batch` option, `__attribute__((target("avx2")))` annotation OR per-TU `c_args : ['-mavx2']`. The dispatch resolver in `encode_batch.c` swaps `&ksuid_string_batch_scalar` for `&ksuid_string_batch_avx2` when CPUID + XGETBV pass. |
| 2 | `tests: SIMD parity matrix covering bulk-encode tail boundaries + degenerate inputs` — extends `tests/test_string_batch.c` with a `KSUID_TESTING`-gated `ksuid_string_batch_scalar` / `_avx2` direct extern, drives both implementations on the inputs in acceptance criterion 8. |
| 3 | `ci: footprint gate for AVX2 bulk encode + KSUID_FORCE_SCALAR env override` — meson postbuild check + R11 env var. |

`KSUID_FORCE_SCALAR` could land in commit 1 alongside the dispatcher swap; the split keeps each commit independently reviewable.

## Cross-platform considerations

- **Linux GCC + Clang on x86_64**: primary target, full CI coverage.
- **macOS Clang on x86_64**: AVX2 paths are present in older Intel Macs; Apple Silicon runners do not exercise this code (host_cpu == aarch64 → `KSUID_HAVE_AVX2_BATCH` undefined → scalar path).
- **Windows MSVC**: `__attribute__((target("avx2")))` is not available; use `/arch:AVX2` only on the AVX2 TU via per-target `c_args` in meson. The CPUID + XGETBV path in `encode_batch.c` already has the MSVC branch (`__cpuidex` + `_xgetbv`) from PR #12.
- **Hosts without AVX2** (older Atom, Steam Deck running x86_64 baseline, etc.): the dispatch resolver simply keeps `&ksuid_string_batch_scalar` and the AVX2 TU's symbols stay unreferenced → linker discards via `--gc-sections` if PIC. Verify the AVX2 TU is referenced ONLY through the function-pointer assignment, never through a direct call from another TU, so non-AVX2 hosts incur zero runtime cost.

## Verification

The dispatch scaffolding from PR #12 already satisfies R1 (race-free dispatch), R2 (`__builtin_cpu_init` + `_xgetbv`), R7 (n/a — compare-side), R9 (NOLINT pattern available), R10 (parity test infrastructure exists), R12 (n==0 early-out). This issue inherits those guarantees without re-litigation.

The `wipe-fallback` CI lane (PR #10) and the auto-build disasm gate (PR #10 + PR #11) must continue to pass; they are orthogonal to the AVX2 path.

## Out of scope

- AVX-512 / VBMI bulk encode (separate follow-up if ever justified).
- AVX2 bulk **parse** (the issue body originally noted "same API for parse if a benchmark justifies it"; benchmark first, then file).
- Compare path AVX2 acceleration (compare is fixed at 20 bytes; SSE2 already handles the head, AVX2 buys nothing).

## Labels

`enhancement`, `simd`, `x86_64`, `performance`, `follow-up`


Commit	Purpose
1	`encode: AVX2 8-wide bulk-encode kernel + first-call atomic dispatch` — new `libksuid/encode_avx2.c`, meson `avx2_batch` option, `__attribute__((target("avx2")))` annotation OR per-TU `c_args : ['-mavx2']`. The dispatch resolver in `encode_batch.c` swaps `&ksuid_string_batch_scalar` for `&ksuid_string_batch_avx2` when CPUID + XGETBV pass.
2	`tests: SIMD parity matrix covering bulk-encode tail boundaries + degenerate inputs` — extends `tests/test_string_batch.c` with a `KSUID_TESTING`-gated `ksuid_string_batch_scalar` / `_avx2` direct extern, drives both implementations on the inputs in acceptance criterion 8.
3	`ci: footprint gate for AVX2 bulk encode + KSUID_FORCE_SCALAR env override` — meson postbuild check + R11 env var.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AVX2 8-wide ksuid_string_batch kernel (follow-up to #5) #13

Status

Acceptance criteria

Functional

Performance

Parity

Critic risk register (carry-over from #5)

R3 — AVX-SSE transition penalty

R4 — Tail handling overrun

R5 — Divide-by-62 reciprocal-multiplication magic constant

R6 — Lane transpose / output-byte mapping

R8 — Footprint budget

R11 — `KSUID_FORCE_SCALAR` env var

Implementation plan (architect carry-over from #5)

Cross-platform considerations

Verification

Out of scope

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

AVX2 8-wide ksuid_string_batch kernel (follow-up to #5) #13

Description

Status

Acceptance criteria

Functional

Performance

Parity

Critic risk register (carry-over from #5)

R3 — AVX-SSE transition penalty

R4 — Tail handling overrun

R5 — Divide-by-62 reciprocal-multiplication magic constant

R6 — Lane transpose / output-byte mapping

R8 — Footprint budget

R11 — KSUID_FORCE_SCALAR env var

Implementation plan (architect carry-over from #5)

Cross-platform considerations

Verification

Out of scope

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

R11 — `KSUID_FORCE_SCALAR` env var