semantic-reasoning · justinjoy · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/README.md b/README.md
@@ -127,17 +127,25 @@ only the throughput differs.
 
 The implementation dispatches at first call to a kernel selected
 from CPU features via an atomic function pointer (libsodium-style
-trampoline; race-free without locks or allocation). Today that
-kernel is the per-ID scalar path on every supported host. The AVX2
-8-wide kernel that the issue tracker proposed
-([#5](https://github.com/semantic-reasoning/libksuid/issues/5)) is
-**not yet shipped** -- the implementation requires SIMD long-
-division-by-62 with reciprocal-multiplication magic constants
-verified against a parity corpus, and the dispatch infrastructure
-landed in this PR is the foundation that will be added in a
-follow-up. Until then, `ksuid_string_batch` is functionally
-equivalent to a `ksuid_format` loop with one acquire-load + one
-indirect call of overhead per call.
+trampoline; race-free without locks or allocation):
+
+- **x86_64 + AVX2**: an 8-wide AVX2 kernel that processes eight
+  KSUIDs per outer iteration via a Granlund-Möller floor-reciprocal
+  multiply divide-by-62 ([#13](https://github.com/semantic-reasoning/libksuid/issues/13)).
+  The magic constant is auto-generated by `tools/derive-magic.py`,
+  pinned in `libksuid/divisor_magic.h`, and verified against
+  `__uint128_t` integer division on every CI run.
+- **Other hosts** (non-AVX2 x86_64, aarch64, arm, ...): a per-ID
+  scalar loop equivalent to calling `ksuid_format` N times.
+
+Output is byte-identical across kernels; the differential parity
+test in `tests/test_string_batch.c` cross-checks the AVX2 kernel
+against the scalar reference over ≥ 2²⁰ pseudo-random KSUIDs and a
+lane-swap detection corpus.
+
+Setting the environment variable `KSUID_FORCE_SCALAR=1` pins the
+dispatcher to the scalar path at first call (runtime kill switch
+without rebuilding the library).
 
 ## Layout
 

diff --git a/libksuid/divisor_magic.h b/libksuid/divisor_magic.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: LGPL-3.0-or-later
+ *
+ * AUTO-GENERATED by tools/derive-magic.py 64 62.
+ * DO NOT HAND-EDIT. Re-run the deriver and commit the new output if
+ * the divisor or bit width changes.
+ *
+ * Granlund-Moeller magic constant for unsigned divide-by-62 via
+ * the high N bits of an N x N -> 2N multiply.
+ *
+ * For every value < 2^64 the reciprocal computation
+ *
+ *     q' = (value * KSUID_DIV62_M) >> 64
+ *     r' = value - q' * 62
+ *     if (r' >= 62) { ++q'; r' -= 62; }
+ *
+ * yields the same q', r' as the straight integer division
+ * value / 62, value % 62. mulhi(value, M) may
+ * underestimate the true quotient by exactly 1 for some inputs and
+ * the correction step recovers it; mulhi never overestimates,
+ * which is why M is the FLOOR of 2^N / d (not the ceiling -- the
+ * ceiling form would overestimate for some inputs and the standard
+ * correction would not catch it).
+ *
+ * Property pinned at compile time in the consuming TU:
+ *     2^64 - M * 62 = 16   (must satisfy 0 <= ... <= d - 1)
+ *
+ * This file is regenerated by `tools/derive-magic.py` and verified
+ * by `tests/test_divisor_magic.c` on every CI run.
+ */
+#ifndef KSUID_DIVISOR_MAGIC_H
+#define KSUID_DIVISOR_MAGIC_H
+
+#include <stdint.h>
+
+#define KSUID_DIV62_M_BITS 64
+#define KSUID_DIV62_M ((uint64_t) UINT64_C (0x421084210842108))
+#define KSUID_DIV62_M_2N_MINUS_M_TIMES_D 16
+
+#endif /* KSUID_DIVISOR_MAGIC_H */
diff --git a/libksuid/encode_avx2.c b/libksuid/encode_avx2.c
@@ -0,0 +1,216 @@
+/* SPDX-License-Identifier: LGPL-3.0-or-later
+ *
+ * AVX2 8-wide ksuid_string_batch kernel.
+ *
+ * Replaces the per-ID scalar `value / 62` and `value % 62` with a
+ * Granlund-Moeller multiply-high + correction step run in parallel
+ * across 8 KSUID lanes. The magic constant KSUID_DIV62_M = floor(2^64
+ * / 62) lives in libksuid/divisor_magic.h (auto-generated by
+ * tools/derive-magic.py and verified against integer division by
+ * tests/test_divisor_magic.c on every CI run; FLOOR not ceiling --
+ * see the deriver script for the reasoning).
+ *
+ * Lane layout (SoA). 8 KSUIDs x 5 base-2^32 limbs = 40 limb values.
+ * AVX2 _mm256_mul_epu32 takes the LOW 32 bits of each 64-bit lane
+ * (4 lanes per __m256i) and produces a 64-bit product per lane, so
+ * we carry each limb across two __m256i registers: the "lo" pack
+ * holds lanes 0..3 (one limb value per 64-bit lane) and the "hi"
+ * pack holds lanes 4..7. State: 5 limbs * 2 packs = 10 ymm
+ * registers. Haswell exposes 16 architectural ymm registers, so
+ * everything plus M, k62, and a few temporaries stays in registers.
+ *
+ * Outer iteration count is fixed at KSUID_STRING_LEN = 27. The
+ * scalar reference (libksuid/base62.c ksuid_base62_encode) suppresses
+ * leading zeros and stops when bp_len reaches 0, then '0'-pads the
+ * head. Lanes finish at different rates in SIMD, so we run the full
+ * 27 iterations unconditionally; once all 5 limbs of a lane are
+ * zero, every subsequent iteration produces (q=0, r=0) and writes
+ * '0' to that lane's output column -- which is exactly the head
+ * padding the scalar would have done.
+ *
+ * Tail (n & 7): falls through to ksuid_string_batch_scalar after
+ * the bulk loop. _mm256_zeroupper() is issued before any non-VEX
+ * code path is entered (Intel SDM Vol 1 sec 14.1.2 -- a stale
+ * upper-ymm causes a transition penalty when subsequent SSE-encoded
+ * instructions execute on the same lane).
+ *
+ * Issue #13 risk register cross-references:
+ *   C2 (mulhi64 carry propagation) -- handled in
+ *     ksuid_mulhi64_avx2 below; mid_low is the sum of three u32
+ *     values (fits in 34 bits, no u64 overflow) and the upper
+ *     accumulator carries are added explicitly.
+ *   C3 (mul_epu32 lane positioning) -- value is constructed so
+ *     limb is in the LOW 32 of each 64-bit lane and remainder is
+ *     in the HIGH 32; we issue four mul_epu32 invocations
+ *     (lo*lo, lo*hi, hi*lo, hi*hi) to recover the full 64x64
+ *     product since each mul_epu32 only reads the low 32.
+ *   C9 (zeroupper placement) -- emitted before scalar tail call
+ *     AND before function return.
+ *   R11 (force-scalar override) -- handled in encode_batch.c.
+ */
+#if defined(__x86_64__) || defined(_M_X64)
+#  include <immintrin.h>
+#  include <stddef.h>
+#  include <stdint.h>
+
+#  include <libksuid/byteorder.h>
+#  include <libksuid/divisor_magic.h>
+#  include <libksuid/encode_batch.h>
+#  include <libksuid/ksuid.h>
+
+/* Same alphabet as libksuid/base62.c; redeclared here to keep this
+ * TU self-contained (the scalar table is `static` to that file). */
+static const char k_avx2_b62_alphabet[64]
+    = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+/* Multiply-high u64 x u64 -> high u64, 4 lanes wide.
+ *
+ * Schoolbook 64x64 -> 128 with the four 32x32 -> 64 cross-products.
+ * Let a = a_hi << 32 | a_lo, b = b_hi << 32 | b_lo.
+ *
+ *   a*b = a_hi*b_hi << 64
+ *       + (a_lo*b_hi + a_hi*b_lo) << 32
+ *       + a_lo*b_lo
+ *
+ * The high 64 bits accumulate (a_hi*b_hi) plus the carry from the
+ * middle terms. mid_low sums three 32-bit values (lh_lo + hl_lo +
+ * ll_hi) which fits in 34 bits and so cannot overflow u64;
+ * mid_low >> 32 is the low-half-to-high-half carry; lh_hi and hl_hi
+ * are the upper 32 of the middle products that propagate directly
+ * into the high accumulator.
+ *
+ * Saturation check: with a_hi = b_hi = 2^32-1 and a_lo = b_lo =
+ * 2^32-1, the high accumulator equals (2^64-1) -- the maximum legal
+ * u64 -- so no internal step overflows. */
+static inline __m256i
+ksuid_mulhi64_avx2 (__m256i a, __m256i b)
+{
+  __m256i mask32 = _mm256_set1_epi64x ((int64_t) 0xFFFFFFFFLL);
+  __m256i a_hi = _mm256_srli_epi64 (a, 32);
+  __m256i b_hi = _mm256_srli_epi64 (b, 32);
+
+  __m256i ll = _mm256_mul_epu32 (a, b);
+  __m256i lh = _mm256_mul_epu32 (a, b_hi);
+  __m256i hl = _mm256_mul_epu32 (a_hi, b);
+  __m256i hh = _mm256_mul_epu32 (a_hi, b_hi);
+
+  __m256i ll_hi = _mm256_srli_epi64 (ll, 32);
+  __m256i lh_lo = _mm256_and_si256 (lh, mask32);
+  __m256i hl_lo = _mm256_and_si256 (hl, mask32);
+  __m256i lh_hi = _mm256_srli_epi64 (lh, 32);
+  __m256i hl_hi = _mm256_srli_epi64 (hl, 32);
+
+  __m256i mid_low = _mm256_add_epi64 (_mm256_add_epi64 (lh_lo, hl_lo), ll_hi);
+  __m256i mid_carry = _mm256_srli_epi64 (mid_low, 32);
+
+  return _mm256_add_epi64 (_mm256_add_epi64 (hh, lh_hi),
+      _mm256_add_epi64 (hl_hi, mid_carry));
+}
+
+/* (q, r) = (value/62, value%62) per 64-bit lane via Granlund-Moeller:
+ *   q = mulhi(value, M)        -- may underestimate by 1
+ *   r = value - q*62           -- always in [0, 2*62-1]
+ *   if (r >= 62) { ++q; r -= 62; }
+ *
+ * Domain in this kernel: value < 63 * 2^32 < 2^38 (limb in low 32,
+ * remainder in [0,61] in high 32). Therefore q < 2^33 and q*62 fits
+ * in 39 bits with room to spare. We compute q*62 = (q<<6) - (q<<1)
+ * with shifts -- AVX2 has no native u64*u64-to-u64 truncating
+ * multiply that would not require another mul_epu32 carry chain. */
+static inline void
+ksuid_divmod62_avx2 (__m256i value, __m256i mag, __m256i *q_out, __m256i *r_out)
+{
+  __m256i q = ksuid_mulhi64_avx2 (value, mag);
+  __m256i q62 =
+      _mm256_sub_epi64 (_mm256_slli_epi64 (q, 6), _mm256_slli_epi64 (q, 1));
+  __m256i r = _mm256_sub_epi64 (value, q62);
+
+  /* _mm256_cmpgt_epi64 is signed; r and 61 are both small positive
+   * integers (< 2^7) so signed comparison agrees with unsigned. */
+  __m256i ge62 = _mm256_cmpgt_epi64 (r, _mm256_set1_epi64x (61));
+  __m256i k62 = _mm256_set1_epi64x (62);
+  __m256i one = _mm256_set1_epi64x (1);
+  *r_out = _mm256_sub_epi64 (r, _mm256_and_si256 (ge62, k62));
+  *q_out = _mm256_add_epi64 (q, _mm256_and_si256 (ge62, one));
+}
+
+void
+ksuid_string_batch_avx2 (const ksuid_t *ids, char *out_27n, size_t n)
+{
+  size_t bulk = n & ~(size_t) 7;
+  __m256i mag = _mm256_set1_epi64x ((int64_t) KSUID_DIV62_M);
+
+  for (size_t base = 0; base < bulk; base += 8) {
+    /* SoA pack: 5 limbs x (lo-4-lanes, hi-4-lanes). The limb value
+     * lives in the low 32 of each 64-bit lane; the high 32 is zero
+     * initially and carries the running remainder during the inner
+     * loop. */
+    __m256i Llo[5], Lhi[5];
+    {
+      uint64_t plo[4], phi[4];
+      const uint8_t *base_p = (const uint8_t *) &ids[base];
+      for (size_t j = 0; j < 5; ++j) {
+        for (size_t lane = 0; lane < 4; ++lane) {
+          plo[lane] =
+              (uint64_t) ksuid_be32_load (base_p + lane * KSUID_BYTES + j * 4);
+          phi[lane] =
+              (uint64_t) ksuid_be32_load (base_p + (lane + 4) * KSUID_BYTES +
+              j * 4);
+        }
+        /* _mm256_loadu_si256 is the AVX2 unaligned-load intrinsic;
+         * the (__m256i *) cast is documented and does not require
+         * 32-byte alignment. */
+        /* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
+        Llo[j] = _mm256_loadu_si256 ((const __m256i *) plo);
+        /* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
+        Lhi[j] = _mm256_loadu_si256 ((const __m256i *) phi);
+      }
+    }
+
+    /* 27 outer iterations -- one per output character. The output
+     * column index walks right-to-left because long division by
+     * the radix produces digits LSB-first (the rightmost output
+     * character is the lowest-order base62 digit of the integer). */
+    for (int col = KSUID_STRING_LEN - 1; col >= 0; --col) {
+      __m256i rem_lo = _mm256_setzero_si256 ();
+      __m256i rem_hi = _mm256_setzero_si256 ();
+
+      for (size_t j = 0; j < 5; ++j) {
+        /* value = limb | (rem << 32). Both halves fit in u32 so
+         * the OR composes them without overlap. */
+        __m256i v_lo = _mm256_or_si256 (Llo[j], _mm256_slli_epi64 (rem_lo, 32));
+        __m256i v_hi = _mm256_or_si256 (Lhi[j], _mm256_slli_epi64 (rem_hi, 32));
+
+        __m256i q_lo, r_lo, q_hi, r_hi;
+        ksuid_divmod62_avx2 (v_lo, mag, &q_lo, &r_lo);
+        ksuid_divmod62_avx2 (v_hi, mag, &q_hi, &r_hi);
+
+        Llo[j] = q_lo;
+        Lhi[j] = q_hi;
+        rem_lo = r_lo;
+        rem_hi = r_hi;
+      }
+
+      /* Final remainder is the digit; alphabet-lookup, scatter. */
+      uint64_t rems[8];
+      /* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
+      _mm256_storeu_si256 ((__m256i *) & rems[0], rem_lo);
+      /* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
+      _mm256_storeu_si256 ((__m256i *) & rems[4], rem_hi);
+      for (size_t lane = 0; lane < 8; ++lane) {
+        out_27n[(base + lane) * KSUID_STRING_LEN + (size_t) col]
+            = k_avx2_b62_alphabet[rems[lane]];
+      }
+    }
+  }
+
+  /* Emit zeroupper before the scalar tail call AND before
+   * returning; both transitions can land us in non-VEX code. */
+  _mm256_zeroupper ();
+
+  if (bulk < n)
+    ksuid_string_batch_scalar (ids + bulk,
+        out_27n + bulk * KSUID_STRING_LEN, n - bulk);
+}
+
+#endif /* x86_64 */
diff --git a/libksuid/encode_batch.c b/libksuid/encode_batch.c
@@ -31,6 +31,8 @@
 
 #include <stdatomic.h>
 #include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
 
 #if defined(__x86_64__) || defined(_M_X64)
 #  if defined(__GNUC__) || defined(__clang__)
@@ -99,13 +101,37 @@ ksuid_string_batch_init_trampoline (const ksuid_t * ids, char *out_27n,
 static _Atomic ksuid_string_batch_fn g_batch_impl =
     &ksuid_string_batch_init_trampoline;
 
+/* KSUID_FORCE_SCALAR override (Critic R11). Reading getenv on the
+ * first dispatch only is safe -- the resolved pointer is cached for
+ * the lifetime of the process and the env var is consulted exactly
+ * once. The override exists so production deployments can pin the
+ * scalar path at startup if a future regression in the AVX2 kernel
+ * is discovered after rollout, without rebuilding the library.
+ *
+ * Recognised values: any non-empty, non-"0", non-"false" string
+ * disables the AVX2 kernel. NULL or unset = use the best kernel
+ * available on the host. */
+static int
+ksuid_force_scalar_env (void)
+{
+  const char *v = getenv ("KSUID_FORCE_SCALAR");
+  if (v == NULL || v[0] == '\0')
+    return 0;
+  if (strcmp (v, "0") == 0 || strcmp (v, "false") == 0
+      || strcmp (v, "FALSE") == 0)
+    return 0;
+  return 1;
+}
+
 static void
 ksuid_string_batch_init_trampoline (const ksuid_t *ids, char *out_27n, size_t n)
 {
   ksuid_string_batch_fn resolved = &ksuid_string_batch_scalar;
 #if defined(KSUID_HAVE_AVX2_BATCH)
-  if (ksuid_cpu_supports_avx2 ())
+  if (!ksuid_force_scalar_env () && ksuid_cpu_supports_avx2 ())
     resolved = &ksuid_string_batch_avx2;
+#else
+  (void) ksuid_force_scalar_env;        /* silence unused-static warning */
 #endif
   atomic_store_explicit (&g_batch_impl, resolved, memory_order_release);
   resolved (ids, out_27n, n);

diff --git a/libksuid/ksuid.h b/libksuid/ksuid.h
@@ -166,14 +166,20 @@ extern "C"
  *
  * The dispatch is resolved lazily on the first call (atomic, thread-
  * safe) and the resolved pointer is reused for the lifetime of the
- * process. The eventual AVX2 8-wide kernel will plug into this
- * dispatcher and deliver ~4-6x throughput on x86_64 hosts that
- * support it; in the current release every host resolves to the
- * per-ID scalar path (a ksuid_format loop), and the dispatch is
- * one acquire-load + one indirect call of overhead per call. The
- * AVX2 kernel itself is tracked as a follow-up to libksuid issue
- * #5; until it ships, ksuid_string_batch is functionally
- * equivalent to a ksuid_format loop with no measurable overhead.
+ * process. On x86_64 hosts that advertise AVX2, the dispatcher
+ * resolves to an 8-wide AVX2 kernel that processes 8 KSUIDs per
+ * outer iteration via a Granlund-Moeller reciprocal-multiply
+ * divide-by-62; on other hosts (including non-AVX2 x86_64) the
+ * dispatcher resolves to a per-ID scalar loop equivalent to
+ * ksuid_format() N times. The dispatch overhead -- one acquire-
+ * load + one indirect call per ksuid_string_batch invocation -- is
+ * lost in the noise compared to even the per-ID scalar work.
+ *
+ * Output is byte-identical across kernels. The KSUID_FORCE_SCALAR
+ * environment variable, if set to a non-empty non-"0"/"false"
+ * value, pins the dispatcher to the scalar path at first dispatch
+ * (a runtime kill switch for the AVX2 kernel without rebuilding
+ * the library).
  *
  * No error path: every 20-byte ksuid_t encodes by construction. n == 0
  * is a no-op. The call is thread-safe for concurrent invocations on