Skip to content
This repository was archived by the owner on May 1, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 19 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,17 +127,25 @@ only the throughput differs.

The implementation dispatches at first call to a kernel selected
from CPU features via an atomic function pointer (libsodium-style
trampoline; race-free without locks or allocation). Today that
kernel is the per-ID scalar path on every supported host. The AVX2
8-wide kernel that the issue tracker proposed
([#5](https://github.com/semantic-reasoning/libksuid/issues/5)) is
**not yet shipped** -- the implementation requires SIMD long-
division-by-62 with reciprocal-multiplication magic constants
verified against a parity corpus, and the dispatch infrastructure
landed in this PR is the foundation that will be added in a
follow-up. Until then, `ksuid_string_batch` is functionally
equivalent to a `ksuid_format` loop with one acquire-load + one
indirect call of overhead per call.
trampoline; race-free without locks or allocation):

- **x86_64 + AVX2**: an 8-wide AVX2 kernel that processes eight
KSUIDs per outer iteration via a Granlund-Möller floor-reciprocal
multiply divide-by-62 ([#13](https://github.com/semantic-reasoning/libksuid/issues/13)).
The magic constant is auto-generated by `tools/derive-magic.py`,
pinned in `libksuid/divisor_magic.h`, and verified against
`__uint128_t` integer division on every CI run.
- **Other hosts** (non-AVX2 x86_64, aarch64, arm, ...): a per-ID
scalar loop equivalent to calling `ksuid_format` N times.

Output is byte-identical across kernels; the differential parity
test in `tests/test_string_batch.c` cross-checks the AVX2 kernel
against the scalar reference over ≥ 2²⁰ pseudo-random KSUIDs and a
lane-swap detection corpus.

Setting the environment variable `KSUID_FORCE_SCALAR=1` pins the
dispatcher to the scalar path at first call (runtime kill switch
without rebuilding the library).

## Layout

Expand Down
39 changes: 39 additions & 0 deletions libksuid/divisor_magic.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/* SPDX-License-Identifier: LGPL-3.0-or-later
*
* AUTO-GENERATED by tools/derive-magic.py 64 62.
* DO NOT HAND-EDIT. Re-run the deriver and commit the new output if
* the divisor or bit width changes.
*
* Granlund-Moeller magic constant for unsigned divide-by-62 via
* the high N bits of an N x N -> 2N multiply.
*
* For every value < 2^64 the reciprocal computation
*
* q' = (value * KSUID_DIV62_M) >> 64
* r' = value - q' * 62
* if (r' >= 62) { ++q'; r' -= 62; }
*
* yields the same q', r' as the straight integer division
* value / 62, value % 62. mulhi(value, M) may
* underestimate the true quotient by exactly 1 for some inputs and
* the correction step recovers it; mulhi never overestimates,
* which is why M is the FLOOR of 2^N / d (not the ceiling -- the
* ceiling form would overestimate for some inputs and the standard
* correction would not catch it).
*
* Property pinned at compile time in the consuming TU:
* 2^64 - M * 62 = 16 (must satisfy 0 <= ... <= d - 1)
*
* This file is regenerated by `tools/derive-magic.py` and verified
* by `tests/test_divisor_magic.c` on every CI run.
*/
#ifndef KSUID_DIVISOR_MAGIC_H
#define KSUID_DIVISOR_MAGIC_H

#include <stdint.h>

#define KSUID_DIV62_M_BITS 64
#define KSUID_DIV62_M ((uint64_t) UINT64_C (0x421084210842108))
#define KSUID_DIV62_M_2N_MINUS_M_TIMES_D 16

#endif /* KSUID_DIVISOR_MAGIC_H */
216 changes: 216 additions & 0 deletions libksuid/encode_avx2.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
/* SPDX-License-Identifier: LGPL-3.0-or-later
*
* AVX2 8-wide ksuid_string_batch kernel.
*
* Replaces the per-ID scalar `value / 62` and `value % 62` with a
* Granlund-Moeller multiply-high + correction step run in parallel
* across 8 KSUID lanes. The magic constant KSUID_DIV62_M = floor(2^64
* / 62) lives in libksuid/divisor_magic.h (auto-generated by
* tools/derive-magic.py and verified against integer division by
* tests/test_divisor_magic.c on every CI run; FLOOR not ceiling --
* see the deriver script for the reasoning).
*
* Lane layout (SoA). 8 KSUIDs x 5 base-2^32 limbs = 40 limb values.
* AVX2 _mm256_mul_epu32 takes the LOW 32 bits of each 64-bit lane
* (4 lanes per __m256i) and produces a 64-bit product per lane, so
* we carry each limb across two __m256i registers: the "lo" pack
* holds lanes 0..3 (one limb value per 64-bit lane) and the "hi"
* pack holds lanes 4..7. State: 5 limbs * 2 packs = 10 ymm
* registers. Haswell exposes 16 architectural ymm registers, so
* everything plus M, k62, and a few temporaries stays in registers.
*
* Outer iteration count is fixed at KSUID_STRING_LEN = 27. The
* scalar reference (libksuid/base62.c ksuid_base62_encode) suppresses
* leading zeros and stops when bp_len reaches 0, then '0'-pads the
* head. Lanes finish at different rates in SIMD, so we run the full
* 27 iterations unconditionally; once all 5 limbs of a lane are
* zero, every subsequent iteration produces (q=0, r=0) and writes
* '0' to that lane's output column -- which is exactly the head
* padding the scalar would have done.
*
* Tail (n & 7): falls through to ksuid_string_batch_scalar after
* the bulk loop. _mm256_zeroupper() is issued before any non-VEX
* code path is entered (Intel SDM Vol 1 sec 14.1.2 -- a stale
* upper-ymm causes a transition penalty when subsequent SSE-encoded
* instructions execute on the same lane).
*
* Issue #13 risk register cross-references:
* C2 (mulhi64 carry propagation) -- handled in
* ksuid_mulhi64_avx2 below; mid_low is the sum of three u32
* values (fits in 34 bits, no u64 overflow) and the upper
* accumulator carries are added explicitly.
* C3 (mul_epu32 lane positioning) -- value is constructed so
* limb is in the LOW 32 of each 64-bit lane and remainder is
* in the HIGH 32; we issue four mul_epu32 invocations
* (lo*lo, lo*hi, hi*lo, hi*hi) to recover the full 64x64
* product since each mul_epu32 only reads the low 32.
* C9 (zeroupper placement) -- emitted before scalar tail call
* AND before function return.
* R11 (force-scalar override) -- handled in encode_batch.c.
*/
#if defined(__x86_64__) || defined(_M_X64)
# include <immintrin.h>
# include <stddef.h>
# include <stdint.h>

# include <libksuid/byteorder.h>
# include <libksuid/divisor_magic.h>
# include <libksuid/encode_batch.h>
# include <libksuid/ksuid.h>

/* Same alphabet as libksuid/base62.c; redeclared here to keep this
* TU self-contained (the scalar table is `static` to that file). */
static const char k_avx2_b62_alphabet[64]
= "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

/* Multiply-high u64 x u64 -> high u64, 4 lanes wide.
*
* Schoolbook 64x64 -> 128 with the four 32x32 -> 64 cross-products.
* Let a = a_hi << 32 | a_lo, b = b_hi << 32 | b_lo.
*
* a*b = a_hi*b_hi << 64
* + (a_lo*b_hi + a_hi*b_lo) << 32
* + a_lo*b_lo
*
* The high 64 bits accumulate (a_hi*b_hi) plus the carry from the
* middle terms. mid_low sums three 32-bit values (lh_lo + hl_lo +
* ll_hi) which fits in 34 bits and so cannot overflow u64;
* mid_low >> 32 is the low-half-to-high-half carry; lh_hi and hl_hi
* are the upper 32 of the middle products that propagate directly
* into the high accumulator.
*
* Saturation check: with a_hi = b_hi = 2^32-1 and a_lo = b_lo =
* 2^32-1, the high accumulator equals (2^64-1) -- the maximum legal
* u64 -- so no internal step overflows. */
static inline __m256i
ksuid_mulhi64_avx2 (__m256i a, __m256i b)
{
__m256i mask32 = _mm256_set1_epi64x ((int64_t) 0xFFFFFFFFLL);
__m256i a_hi = _mm256_srli_epi64 (a, 32);
__m256i b_hi = _mm256_srli_epi64 (b, 32);

__m256i ll = _mm256_mul_epu32 (a, b);
__m256i lh = _mm256_mul_epu32 (a, b_hi);
__m256i hl = _mm256_mul_epu32 (a_hi, b);
__m256i hh = _mm256_mul_epu32 (a_hi, b_hi);

__m256i ll_hi = _mm256_srli_epi64 (ll, 32);
__m256i lh_lo = _mm256_and_si256 (lh, mask32);
__m256i hl_lo = _mm256_and_si256 (hl, mask32);
__m256i lh_hi = _mm256_srli_epi64 (lh, 32);
__m256i hl_hi = _mm256_srli_epi64 (hl, 32);

__m256i mid_low = _mm256_add_epi64 (_mm256_add_epi64 (lh_lo, hl_lo), ll_hi);
__m256i mid_carry = _mm256_srli_epi64 (mid_low, 32);

return _mm256_add_epi64 (_mm256_add_epi64 (hh, lh_hi),
_mm256_add_epi64 (hl_hi, mid_carry));
}

/* (q, r) = (value/62, value%62) per 64-bit lane via Granlund-Moeller:
* q = mulhi(value, M) -- may underestimate by 1
* r = value - q*62 -- always in [0, 2*62-1]
* if (r >= 62) { ++q; r -= 62; }
*
* Domain in this kernel: value < 63 * 2^32 < 2^38 (limb in low 32,
* remainder in [0,61] in high 32). Therefore q < 2^33 and q*62 fits
* in 39 bits with room to spare. We compute q*62 = (q<<6) - (q<<1)
* with shifts -- AVX2 has no native u64*u64-to-u64 truncating
* multiply that would not require another mul_epu32 carry chain. */
static inline void
ksuid_divmod62_avx2 (__m256i value, __m256i mag, __m256i *q_out, __m256i *r_out)
{
__m256i q = ksuid_mulhi64_avx2 (value, mag);
__m256i q62 =
_mm256_sub_epi64 (_mm256_slli_epi64 (q, 6), _mm256_slli_epi64 (q, 1));
__m256i r = _mm256_sub_epi64 (value, q62);

/* _mm256_cmpgt_epi64 is signed; r and 61 are both small positive
* integers (< 2^7) so signed comparison agrees with unsigned. */
__m256i ge62 = _mm256_cmpgt_epi64 (r, _mm256_set1_epi64x (61));
__m256i k62 = _mm256_set1_epi64x (62);
__m256i one = _mm256_set1_epi64x (1);
*r_out = _mm256_sub_epi64 (r, _mm256_and_si256 (ge62, k62));
*q_out = _mm256_add_epi64 (q, _mm256_and_si256 (ge62, one));
}

void
ksuid_string_batch_avx2 (const ksuid_t *ids, char *out_27n, size_t n)
{
size_t bulk = n & ~(size_t) 7;
__m256i mag = _mm256_set1_epi64x ((int64_t) KSUID_DIV62_M);

for (size_t base = 0; base < bulk; base += 8) {
/* SoA pack: 5 limbs x (lo-4-lanes, hi-4-lanes). The limb value
* lives in the low 32 of each 64-bit lane; the high 32 is zero
* initially and carries the running remainder during the inner
* loop. */
__m256i Llo[5], Lhi[5];
{
uint64_t plo[4], phi[4];
const uint8_t *base_p = (const uint8_t *) &ids[base];
for (size_t j = 0; j < 5; ++j) {
for (size_t lane = 0; lane < 4; ++lane) {
plo[lane] =
(uint64_t) ksuid_be32_load (base_p + lane * KSUID_BYTES + j * 4);
phi[lane] =
(uint64_t) ksuid_be32_load (base_p + (lane + 4) * KSUID_BYTES +
j * 4);
}
/* _mm256_loadu_si256 is the AVX2 unaligned-load intrinsic;
* the (__m256i *) cast is documented and does not require
* 32-byte alignment. */
/* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
Llo[j] = _mm256_loadu_si256 ((const __m256i *) plo);
/* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
Lhi[j] = _mm256_loadu_si256 ((const __m256i *) phi);
}
}

/* 27 outer iterations -- one per output character. The output
* column index walks right-to-left because long division by
* the radix produces digits LSB-first (the rightmost output
* character is the lowest-order base62 digit of the integer). */
for (int col = KSUID_STRING_LEN - 1; col >= 0; --col) {
__m256i rem_lo = _mm256_setzero_si256 ();
__m256i rem_hi = _mm256_setzero_si256 ();

for (size_t j = 0; j < 5; ++j) {
/* value = limb | (rem << 32). Both halves fit in u32 so
* the OR composes them without overlap. */
__m256i v_lo = _mm256_or_si256 (Llo[j], _mm256_slli_epi64 (rem_lo, 32));
__m256i v_hi = _mm256_or_si256 (Lhi[j], _mm256_slli_epi64 (rem_hi, 32));

__m256i q_lo, r_lo, q_hi, r_hi;
ksuid_divmod62_avx2 (v_lo, mag, &q_lo, &r_lo);
ksuid_divmod62_avx2 (v_hi, mag, &q_hi, &r_hi);

Llo[j] = q_lo;
Lhi[j] = q_hi;
rem_lo = r_lo;
rem_hi = r_hi;
}

/* Final remainder is the digit; alphabet-lookup, scatter. */
uint64_t rems[8];
/* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
_mm256_storeu_si256 ((__m256i *) & rems[0], rem_lo);
/* NOLINTNEXTLINE(clang-diagnostic-cast-align) */
_mm256_storeu_si256 ((__m256i *) & rems[4], rem_hi);
for (size_t lane = 0; lane < 8; ++lane) {
out_27n[(base + lane) * KSUID_STRING_LEN + (size_t) col]
= k_avx2_b62_alphabet[rems[lane]];
}
}
}

/* Emit zeroupper before the scalar tail call AND before
* returning; both transitions can land us in non-VEX code. */
_mm256_zeroupper ();

if (bulk < n)
ksuid_string_batch_scalar (ids + bulk,
out_27n + bulk * KSUID_STRING_LEN, n - bulk);
}

#endif /* x86_64 */
28 changes: 27 additions & 1 deletion libksuid/encode_batch.c
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@

#include <stdatomic.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#if defined(__x86_64__) || defined(_M_X64)
# if defined(__GNUC__) || defined(__clang__)
Expand Down Expand Up @@ -99,13 +101,37 @@ ksuid_string_batch_init_trampoline (const ksuid_t * ids, char *out_27n,
static _Atomic ksuid_string_batch_fn g_batch_impl =
&ksuid_string_batch_init_trampoline;

/* KSUID_FORCE_SCALAR override (Critic R11). Reading getenv on the
* first dispatch only is safe -- the resolved pointer is cached for
* the lifetime of the process and the env var is consulted exactly
* once. The override exists so production deployments can pin the
* scalar path at startup if a future regression in the AVX2 kernel
* is discovered after rollout, without rebuilding the library.
*
* Recognised values: any non-empty, non-"0", non-"false" string
* disables the AVX2 kernel. NULL or unset = use the best kernel
* available on the host. */
static int
ksuid_force_scalar_env (void)
{
const char *v = getenv ("KSUID_FORCE_SCALAR");
if (v == NULL || v[0] == '\0')
return 0;
if (strcmp (v, "0") == 0 || strcmp (v, "false") == 0
|| strcmp (v, "FALSE") == 0)
return 0;
return 1;
}

static void
ksuid_string_batch_init_trampoline (const ksuid_t *ids, char *out_27n, size_t n)
{
ksuid_string_batch_fn resolved = &ksuid_string_batch_scalar;
#if defined(KSUID_HAVE_AVX2_BATCH)
if (ksuid_cpu_supports_avx2 ())
if (!ksuid_force_scalar_env () && ksuid_cpu_supports_avx2 ())
resolved = &ksuid_string_batch_avx2;
#else
(void) ksuid_force_scalar_env; /* silence unused-static warning */
#endif
atomic_store_explicit (&g_batch_impl, resolved, memory_order_release);
resolved (ids, out_27n, n);
Expand Down
22 changes: 14 additions & 8 deletions libksuid/ksuid.h
Original file line number Diff line number Diff line change
Expand Up @@ -166,14 +166,20 @@ extern "C"
*
* The dispatch is resolved lazily on the first call (atomic, thread-
* safe) and the resolved pointer is reused for the lifetime of the
* process. The eventual AVX2 8-wide kernel will plug into this
* dispatcher and deliver ~4-6x throughput on x86_64 hosts that
* support it; in the current release every host resolves to the
* per-ID scalar path (a ksuid_format loop), and the dispatch is
* one acquire-load + one indirect call of overhead per call. The
* AVX2 kernel itself is tracked as a follow-up to libksuid issue
* #5; until it ships, ksuid_string_batch is functionally
* equivalent to a ksuid_format loop with no measurable overhead.
* process. On x86_64 hosts that advertise AVX2, the dispatcher
* resolves to an 8-wide AVX2 kernel that processes 8 KSUIDs per
* outer iteration via a Granlund-Moeller reciprocal-multiply
* divide-by-62; on other hosts (including non-AVX2 x86_64) the
* dispatcher resolves to a per-ID scalar loop equivalent to
* ksuid_format() N times. The dispatch overhead -- one acquire-
* load + one indirect call per ksuid_string_batch invocation -- is
* lost in the noise compared to even the per-ID scalar work.
*
* Output is byte-identical across kernels. The KSUID_FORCE_SCALAR
* environment variable, if set to a non-empty non-"0"/"false"
* value, pins the dispatcher to the scalar path at first dispatch
* (a runtime kill switch for the AVX2 kernel without rebuilding
* the library).
*
* No error path: every 20-byte ksuid_t encodes by construction. n == 0
* is a no-op. The call is thread-safe for concurrent invocations on
Expand Down
Loading
Loading