x86_64 + HOL-Light: Replace rej_uniform intrinsics with assembly and HOL-Light CORRECT and MEMSAFE proofs by jakemas · Pull Request #1014 · pq-code-package/mldsa-native

jakemas · 2026-04-03T04:11:25Z

Summary

Resolves #926 and #418 (?)

Hol-light proof needs instructions from awslabs/s2n-bignum#387

Replace AVX2 intrinsics implementation of rej_uniform with hand-written x86_64 assembly
Table passed as parameter (consistent with aarch64 approach), avoiding external symbol references for simpasm compatibility
All constants constructed from immediates (no .rodata section), enabling future HOL-Light formal verification
Register name #defines with #undef cleanup for SCU builds (following mlkem-native pattern)
Adds poly_uniform to component benchmark
HOL-Light proof infrastructure included (bytecode, table definition, proof skeleton, Makefile)

ML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration.

Performance

AMD EPYC 3rd gen (c6a) — opt

Benchmark	Before	After	Change
ML-DSA-44 keypair	68,874	66,828	-3%
ML-DSA-44 sign	187,594	184,181	-2%
ML-DSA-44 verify	68,993	65,665	-5%
ML-DSA-65 keypair	119,089	112,640	-5%
ML-DSA-65 sign	299,488	294,836	-2%
ML-DSA-65 verify	115,385	108,494	-6%
ML-DSA-87 keypair	203,754	185,518	-9%
ML-DSA-87 sign	396,462	378,579	-5%
ML-DSA-87 verify	196,231	177,157	-10%

Proof

Includes HOL-Light and CBMC proofs, written by claude opus 4.7.

CORRECT
HOL-Light / x86_64 HOL Light proof for mldsa_rej_uniform.S (pull_request) Successful in 12m

CORRECT + MEMSAFE
HOL-Light / x86_64 HOL Light proof for rej_uniform_avx2_asm.S (pull_request)Successful in 21m

It took Claude Opus 4.7 1m around 3weeks to get the MEMSAFE proof. Since it records all token usage per prompt, I got it to go back to the prompt in which I asked it to create the MEMSAFE part of the proof to gather statistics:

 ~10.35 billion tokens
 20,257 turns(i.e distinct API calls)
 ~30 days of development 
 ~100-125 hours of active user engagement (based on user clustering message timestamps)
185 attempted build iterations,

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`113118` cycles	`113013` cycles	`1.00`
`ML-DSA-44 sign`	`355649` cycles	`355605` cycles	`1.00`
`ML-DSA-44 verify`	`117801` cycles	`117682` cycles	`1.00`
`ML-DSA-65 keypair`	`196381` cycles	`196214` cycles	`1.00`
`ML-DSA-65 sign`	`589557` cycles	`588943` cycles	`1.00`
`ML-DSA-65 verify`	`194604` cycles	`194375` cycles	`1.00`
`ML-DSA-87 keypair`	`322210` cycles	`322148` cycles	`1.00`
`ML-DSA-87 sign`	`752493` cycles	`752763` cycles	`1.00`
`ML-DSA-87 verify`	`320055` cycles	`319900` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`212361` cycles	`212622` cycles	`1.00`
`ML-DSA-44 sign`	`760716` cycles	`760066` cycles	`1.00`
`ML-DSA-44 verify`	`228743` cycles	`228987` cycles	`1.00`
`ML-DSA-65 keypair`	`379384` cycles	`379665` cycles	`1.00`
`ML-DSA-65 sign`	`1250617` cycles	`1249827` cycles	`1.00`
`ML-DSA-65 verify`	`371531` cycles	`372045` cycles	`1.00`
`ML-DSA-87 keypair`	`604335` cycles	`605426` cycles	`1.00`
`ML-DSA-87 sign`	`1593243` cycles	`1591413` cycles	`1.00`
`ML-DSA-87 verify`	`618270` cycles	`617375` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`66830` cycles	`68874` cycles	`0.97`
`ML-DSA-44 sign`	`184077` cycles	`187594` cycles	`0.98`
`ML-DSA-44 verify`	`65562` cycles	`68993` cycles	`0.95`
`ML-DSA-65 keypair`	`111959` cycles	`119089` cycles	`0.94`
`ML-DSA-65 sign`	`292002` cycles	`299488` cycles	`0.98`
`ML-DSA-65 verify`	`108472` cycles	`115385` cycles	`0.94`
`ML-DSA-87 keypair`	`185520` cycles	`203754` cycles	`0.91`
`ML-DSA-87 sign`	`379630` cycles	`396462` cycles	`0.96`
`ML-DSA-87 verify`	`177291` cycles	`196231` cycles	`0.90`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`68316` cycles	`68121` cycles	`1.00`
`ML-DSA-44 sign`	`202487` cycles	`202429` cycles	`1.00`
`ML-DSA-44 verify`	`70722` cycles	`70691` cycles	`1.00`
`ML-DSA-65 keypair`	`121061` cycles	`121050` cycles	`1.00`
`ML-DSA-65 sign`	`331574` cycles	`332242` cycles	`1.00`
`ML-DSA-65 verify`	`117810` cycles	`118169` cycles	`1.00`
`ML-DSA-87 keypair`	`198140` cycles	`198283` cycles	`1.00`
`ML-DSA-87 sign`	`427941` cycles	`428124` cycles	`1.00`
`ML-DSA-87 verify`	`194637` cycles	`194645` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a) (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`134578` cycles	`135123` cycles	`1.00`
`ML-DSA-44 sign`	`523923` cycles	`523989` cycles	`1.00`
`ML-DSA-44 verify`	`147640` cycles	`147421` cycles	`1.00`
`ML-DSA-65 keypair`	`228634` cycles	`227032` cycles	`1.01`
`ML-DSA-65 sign`	`864042` cycles	`860343` cycles	`1.00`
`ML-DSA-65 verify`	`236700` cycles	`234883` cycles	`1.01`
`ML-DSA-87 keypair`	`371955` cycles	`371568` cycles	`1.00`
`ML-DSA-87 sign`	`1080535` cycles	`1079389` cycles	`1.00`
`ML-DSA-87 verify`	`383811` cycles	`383403` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`56863` cycles	`56287` cycles	`1.01`
`ML-DSA-44 sign`	`181063` cycles	`181562` cycles	`1.00`
`ML-DSA-44 verify`	`61140` cycles	`61061` cycles	`1.00`
`ML-DSA-65 keypair`	`98291` cycles	`98770` cycles	`1.00`
`ML-DSA-65 sign`	`298368` cycles	`299116` cycles	`1.00`
`ML-DSA-65 verify`	`100343` cycles	`100251` cycles	`1.00`
`ML-DSA-87 keypair`	`152430` cycles	`153265` cycles	`0.99`
`ML-DSA-87 sign`	`354719` cycles	`355417` cycles	`1.00`
`ML-DSA-87 verify`	`153124` cycles	`153884` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4 (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`128315` cycles	`128272` cycles	`1.00`
`ML-DSA-44 sign`	`447513` cycles	`447600` cycles	`1.00`
`ML-DSA-44 verify`	`138123` cycles	`144678` cycles	`0.95`
`ML-DSA-65 keypair`	`220541` cycles	`220481` cycles	`1.00`
`ML-DSA-65 sign`	`726484` cycles	`726951` cycles	`1.00`
`ML-DSA-65 verify`	`222926` cycles	`223461` cycles	`1.00`
`ML-DSA-87 keypair`	`366142` cycles	`366604` cycles	`1.00`
`ML-DSA-87 sign`	`927541` cycles	`927414` cycles	`1.00`
`ML-DSA-87 verify`	`374016` cycles	`373875` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`72353` cycles	`72235` cycles	`1.00`
`ML-DSA-44 sign`	`212424` cycles	`212375` cycles	`1.00`
`ML-DSA-44 verify`	`75754` cycles	`75714` cycles	`1.00`
`ML-DSA-65 keypair`	`127646` cycles	`127612` cycles	`1.00`
`ML-DSA-65 sign`	`351030` cycles	`350845` cycles	`1.00`
`ML-DSA-65 verify`	`125627` cycles	`125755` cycles	`1.00`
`ML-DSA-87 keypair`	`205980` cycles	`208476` cycles	`0.99`
`ML-DSA-87 sign`	`444778` cycles	`450018` cycles	`0.99`
`ML-DSA-87 verify`	`205601` cycles	`205843` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i) (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`157499` cycles	`157541` cycles	`1.00`
`ML-DSA-44 sign`	`549244` cycles	`549413` cycles	`1.00`
`ML-DSA-44 verify`	`169448` cycles	`168865` cycles	`1.00`
`ML-DSA-65 keypair`	`268437` cycles	`268818` cycles	`1.00`
`ML-DSA-65 sign`	`903422` cycles	`903672` cycles	`1.00`
`ML-DSA-65 verify`	`275283` cycles	`274680` cycles	`1.00`
`ML-DSA-87 keypair`	`448241` cycles	`448464` cycles	`1.00`
`ML-DSA-87 sign`	`1158654` cycles	`1157970` cycles	`1.00`
`ML-DSA-87 verify`	`458704` cycles	`458043` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`42142` cycles	`40662` cycles	`1.04`
`ML-DSA-44 sign`	`134317` cycles	`132808` cycles	`1.01`
`ML-DSA-44 verify`	`44844` cycles	`43607` cycles	`1.03`
`ML-DSA-65 keypair`	`72940` cycles	`71859` cycles	`1.02`
`ML-DSA-65 sign`	`213861` cycles	`213367` cycles	`1.00`
`ML-DSA-65 verify`	`73729` cycles	`72847` cycles	`1.01`
`ML-DSA-87 keypair`	`107003` cycles	`109237` cycles	`0.98`
`ML-DSA-87 sign`	`250851` cycles	`254550` cycles	`0.99`
`ML-DSA-87 verify`	`107681` cycles	`109371` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a) (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`120754` cycles	`120325` cycles	`1.00`
`ML-DSA-44 sign`	`447570` cycles	`447576` cycles	`1.00`
`ML-DSA-44 verify`	`130511` cycles	`130561` cycles	`1.00`
`ML-DSA-65 keypair`	`205040` cycles	`205018` cycles	`1.00`
`ML-DSA-65 sign`	`728790` cycles	`729474` cycles	`1.00`
`ML-DSA-65 verify`	`210029` cycles	`209605` cycles	`1.00`
`ML-DSA-87 keypair`	`337610` cycles	`336678` cycles	`1.00`
`ML-DSA-87 sign`	`925517` cycles	`924223` cycles	`1.00`
`ML-DSA-87 verify`	`347563` cycles	`347399` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3 (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`138744` cycles	`138561` cycles	`1.00`
`ML-DSA-44 sign`	`483982` cycles	`484140` cycles	`1.00`
`ML-DSA-44 verify`	`148574` cycles	`162388` cycles	`0.91`
`ML-DSA-65 keypair`	`241921` cycles	`241950` cycles	`1.00`
`ML-DSA-65 sign`	`792702` cycles	`792591` cycles	`1.00`
`ML-DSA-65 verify`	`240763` cycles	`241288` cycles	`1.00`
`ML-DSA-87 keypair`	`396106` cycles	`397138` cycles	`1.00`
`ML-DSA-87 sign`	`1013453` cycles	`1013569` cycles	`1.00`
`ML-DSA-87 verify`	`403446` cycles	`403178` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`113189` cycles	`113255` cycles	`1.00`
`ML-DSA-44 sign`	`355791` cycles	`356042` cycles	`1.00`
`ML-DSA-44 verify`	`117978` cycles	`117969` cycles	`1.00`
`ML-DSA-65 keypair`	`196342` cycles	`196623` cycles	`1.00`
`ML-DSA-65 sign`	`589183` cycles	`589242` cycles	`1.00`
`ML-DSA-65 verify`	`194553` cycles	`194559` cycles	`1.00`
`ML-DSA-87 keypair`	`322537` cycles	`322281` cycles	`1.00`
`ML-DSA-87 sign`	`753613` cycles	`753546` cycles	`1.00`
`ML-DSA-87 verify`	`320115` cycles	`320070` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2 (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`213219` cycles	`212521` cycles	`1.00`
`ML-DSA-44 sign`	`761553` cycles	`760970` cycles	`1.00`
`ML-DSA-44 verify`	`241351` cycles	`234237` cycles	`1.03`
`ML-DSA-65 keypair`	`380573` cycles	`379762` cycles	`1.00`
`ML-DSA-65 sign`	`1252452` cycles	`1252199` cycles	`1.00`
`ML-DSA-65 verify`	`372839` cycles	`371797` cycles	`1.00`
`ML-DSA-87 keypair`	`607341` cycles	`604584` cycles	`1.00`
`ML-DSA-87 sign`	`1596680` cycles	`1595561` cycles	`1.00`
`ML-DSA-87 verify`	`619175` cycles	`618927` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 verify`	`241351` cycles	`234237` cycles	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot · 2026-04-03T04:29:59Z

CBMC Results (ML-DSA-44)

Full Results (206 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	1799s	1701s	+5.8%
`mld_invntt_layer`	✅	280s	265s	+6%
`rej_uniform_native`	✅	150s	148s	+1%
`polyvecl_pointwise_acc_montgomery_c`	✅	118s	113s	+4%
`poly_pointwise_montgomery_c`	✅	101s	91s	+11%
`mld_ct_memcmp`	✅	71s	64s	+11%
`mld_attempt_signature_generation`	✅	62s	60s	+3%
`mld_ntt_layer`	✅	43s	43s	+0%
`fqmul`	✅	42s	39s	+8%
`rej_uniform_native_x86_64`	✅	37s	-	new
`sign_verify_internal`	✅	28s	24s	+17%
`polyvec_matrix_expand`	✅	25s	25s	+0%
`keccakf1600x4_permute_native`	✅	24s	22s	+9%
`mld_ntt_butterfly_block`	✅	23s	22s	+5%
`poly_chknorm_c`	✅	23s	19s	+21%
`rej_uniform`	✅	20s	23s	-13%
`sign_signature_internal`	✅	18s	17s	+6%
`poly_add`	✅	16s	10s	+60%
`mld_check_pct`	✅	15s	14s	+7%
`polyt0_unpack`	✅	15s	17s	-12%
`compute_pack_t0_t1`	✅	14s	13s	+8%
`polyeta_unpack`	✅	14s	15s	-7%
`polyveck_chknorm`	✅	13s	13s	+0%
`rej_uniform_c`	✅	13s	13s	+0%
`poly_uniform_eta_4x`	✅	11s	13s	-15%
`polyz_unpack_c`	✅	11s	11s	+0%
`keccak_absorb_once_x4`	✅	10s	8s	+25%
`poly_uniform_4x`	✅	10s	12s	-17%
`polyvec_matrix_expand_serial`	✅	9s	8s	+12%
`polyvec_matrix_pointwise_montgomery_yvec`	✅	9s	9s	+0%
`mld_keccakf1600_permute_c`	✅	8s	7s	+14%
`poly_invntt_tomont_c`	✅	8s	7s	+14%
`polyveck_decompose`	✅	8s	6s	+33%
`sign`	✅	8s	7s	+14%
`mld_compute_pack_z`	✅	7s	7s	+0%
`mld_sample_s1_s2`	✅	7s	3s	+133%
`sign_signature_pre_hash_shake256`	✅	7s	6s	+17%
`keccak_absorb`	✅	6s	8s	-25%
`pointwise_acc_native_aarch64`	✅	6s	4s	+50%
`poly_caddq_c`	✅	6s	6s	+0%
`poly_use_hint_c`	✅	6s	6s	+0%
`polyveck_unpack_eta`	✅	6s	4s	+50%
`polyvecl_chknorm`	✅	6s	4s	+50%
`mld_ct_abs_i32`	✅	5s	3s	+67%
`pack_sig_h`	✅	5s	1s	+400%
`pointwise_acc_native_x86_64`	✅	5s	6s	-17%
`pointwise_native_aarch64`	✅	5s	1s	+400%
`poly_challenge`	✅	5s	5s	+0%
`poly_ntt`	✅	5s	5s	+0%
`poly_pointwise_montgomery_native`	✅	5s	2s	+150%
`poly_power2round`	✅	5s	6s	-17%
`polyveck_invntt_tomont`	✅	5s	3s	+67%
`polyvecl_pointwise_acc_montgomery_native`	✅	5s	2s	+150%
`polyw1_pack_88`	✅	5s	3s	+67%
`sign_open`	✅	5s	4s	+25%
`unpack_sk_s1hat`	✅	5s	4s	+25%
`keccak_squeezeblocks_x4`	✅	4s	4s	+0%
`keccakf1600_permute`	✅	4s	2s	+100%
`keccakf1600x4_xor_bytes_native`	✅	4s	4s	+0%
`make_hint`	✅	4s	3s	+33%
`mld_ct_cmask_nonzero_u32`	✅	4s	1s	+300%
`mld_polymat_expand_entry`	✅	4s	3s	+33%
`mld_value_barrier_i64`	✅	4s	3s	+33%
`mld_value_barrier_u32`	✅	4s	3s	+33%
`montgomery_reduce`	✅	4s	3s	+33%
`ntt_native_aarch64`	✅	4s	3s	+33%
`pack_sk_rho_key_tr_s2`	✅	4s	3s	+33%
`poly_chknorm_native`	✅	4s	2s	+100%
`poly_chknorm_native_aarch64`	✅	4s	2s	+100%
`poly_decompose_native`	✅	4s	1s	+300%
`poly_invntt_tomont`	✅	4s	4s	+0%
`poly_ntt_native`	✅	4s	2s	+100%
`poly_reduce`	✅	4s	2s	+100%
`poly_sub`	✅	4s	3s	+33%
`poly_uniform`	✅	4s	3s	+33%
`poly_uniform_gamma1`	✅	4s	2s	+100%
`poly_uniform_gamma1_4x`	✅	4s	3s	+33%
`poly_use_hint_native`	✅	4s	2s	+100%
`poly_use_hint_native_aarch64`	✅	4s	2s	+100%
`polyt0_pack`	✅	4s	5s	-20%
`polyvec_matrix_pointwise_montgomery_row`	✅	4s	3s	+33%
`polyveck_pack_eta`	✅	4s	3s	+33%
`polyvecl_ntt`	✅	4s	7s	-43%
`polyvecl_pack_eta`	✅	4s	4s	+0%
`polyvecl_uniform_gamma1`	✅	4s	2s	+100%
`polyw1_pack_32`	✅	4s	1s	+300%
`power2round`	✅	4s	4s	+0%
`shake128_absorb`	✅	4s	2s	+100%
`shake128_finalize`	✅	4s	3s	+33%
`shake128_release`	✅	4s	2s	+100%
`shake128x4_absorb_once`	✅	4s	4s	+0%
`shake256x4_absorb_once`	✅	4s	2s	+100%
`sign_keypair`	✅	4s	5s	-20%
`sign_pk_from_sk`	✅	4s	5s	-20%
`sign_verify_extmu`	✅	4s	4s	+0%
`sign_verify_pre_hash_shake256`	✅	4s	6s	-33%
`sk_s2hat_get_poly`	✅	4s	2s	+100%
`caddq`	✅	3s	3s	+0%
`intt_native_x86_64`	✅	3s	5s	-40%
`keccak_f1600_x4_native_aarch64_v84a`	✅	3s	2s	+50%
`keccak_f1600_x4_native_avx2`	✅	3s	4s	-25%
`keccak_squeeze`	✅	3s	3s	+0%
`keccakf1600_xor_bytes`	✅	3s	2s	+50%
`keccakf1600x4_extract_bytes_native`	✅	3s	3s	+0%
`keccakf1600x4_xor_bytes`	✅	3s	3s	+0%
`mld_ct_cmask_neg_i32`	✅	3s	1s	+200%
`mld_ct_get_optblocker_i64`	✅	3s	1s	+200%
`mld_h`	✅	3s	2s	+50%
`mld_keccakf1600x4_xor_bytes_c`	✅	3s	3s	+0%
`mld_prepare_domain_separation_prefix`	✅	3s	2s	+50%
`mld_sample_s1_s2_serial`	✅	3s	3s	+0%
`nttunpack_native_x86_64`	✅	3s	3s	+0%
`pack_sig_c`	✅	3s	2s	+50%
`pack_sig_z`	✅	3s	2s	+50%
`pointwise_native_x86_64`	✅	3s	5s	-40%
`poly_caddq`	✅	3s	3s	+0%
`poly_chknorm`	✅	3s	6s	-50%
`poly_decompose`	✅	3s	3s	+0%
`poly_decompose_88_native_aarch64`	✅	3s	3s	+0%
`poly_decompose_c`	✅	3s	2s	+50%
`poly_invntt_tomont_native`	✅	3s	2s	+50%
`poly_pointwise_montgomery`	✅	3s	3s	+0%
`poly_uniform_eta`	✅	3s	4s	-25%
`polyt1_pack`	✅	3s	4s	-25%
`polyveck_caddq`	✅	3s	3s	+0%
`polyveck_ntt`	✅	3s	5s	-40%
`polyveck_reduce`	✅	3s	4s	-25%
`polyw1_pack`	✅	3s	3s	+0%
`polyz_unpack_17_native_aarch64`	✅	3s	3s	+0%
`polyz_unpack_19_native_aarch64`	✅	3s	5s	-40%
`reduce32`	✅	3s	3s	+0%
`rej_eta_c`	✅	3s	3s	+0%
`rej_uniform_eta_native_aarch64`	✅	3s	5s	-40%
`rej_uniform_native_aarch64`	✅	3s	5s	-40%
`shake128_squeeze`	✅	3s	4s	-25%
`shake128x4_squeezeblocks`	✅	3s	2s	+50%
`shake256_absorb`	✅	3s	3s	+0%
`shake256_finalize`	✅	3s	2s	+50%
`shake256_release`	✅	3s	4s	-25%
`shake256_squeeze`	✅	3s	3s	+0%
`shake256x4_squeezeblocks`	✅	3s	3s	+0%
`sign_signature`	✅	3s	3s	+0%
`sign_signature_extmu`	✅	3s	5s	-40%
`sign_signature_pre_hash_internal`	✅	3s	4s	-25%
`sign_verify`	✅	3s	3s	+0%
`sign_verify_pre_hash_internal`	✅	3s	3s	+0%
`unpack_sk`	✅	3s	3s	+0%
`unpack_sk_t0hat`	✅	3s	4s	-25%
`use_hint`	✅	3s	3s	+0%
`intt_native_aarch64`	✅	2s	5s	-60%
`keccak_f1600_x1_native_aarch64`	✅	2s	3s	-33%
`keccak_finalize`	✅	2s	2s	+0%
`keccak_init`	✅	2s	2s	+0%
`keccakf1600_extract_bytes (big endian)`	✅	2s	3s	-33%
`keccakf1600_permute_native`	✅	2s	4s	-50%
`keccakf1600_xor_bytes (big endian)`	✅	2s	2s	+0%
`mld_ct_cmask_nonzero_u8`	✅	2s	4s	-50%
`mld_ct_get_optblocker_u8`	✅	2s	3s	-33%
`mld_keccakf1600_extract_bytes`	✅	2s	3s	-33%
`mld_keccakf1600x4_extract_bytes_c`	✅	2s	3s	-33%
`ntt_native_x86_64`	✅	2s	3s	-33%
`pack_sk_s1`	✅	2s	3s	-33%
`poly_caddq_native`	✅	2s	4s	-50%
`poly_caddq_native_aarch64`	✅	2s	3s	-33%
`poly_caddq_native_x86_64`	✅	2s	5s	-60%
`poly_chknorm_native_x86_64`	✅	2s	3s	-33%
`poly_decompose_32_native_aarch64`	✅	2s	2s	+0%
`poly_ntt_c`	✅	2s	4s	-50%
`poly_permute_bitrev_to_custom_optional`	✅	2s	4s	-50%
`poly_permute_bitrev_to_custom_optional_native`	✅	2s	3s	-33%
`poly_shiftl`	✅	2s	5s	-60%
`poly_use_hint`	✅	2s	4s	-50%
`polyeta_pack`	✅	2s	3s	-33%
`polyt1_unpack`	✅	2s	2s	+0%
`polyveck_pack_w1`	✅	2s	3s	-33%
`polyvecl_pointwise_acc_montgomery`	✅	2s	3s	-33%
`polyvecl_uniform_gamma1_serial`	✅	2s	4s	-50%
`polyvecl_unpack_eta`	✅	2s	2s	+0%
`polyvecl_unpack_z`	✅	2s	2s	+0%
`polyz_unpack_native_x86_64`	✅	2s	1s	+100%
`rej_eta_native`	✅	2s	3s	-33%
`shake128_init`	✅	2s	2s	+0%
`shake256_init`	✅	2s	1s	+100%
`sig_unpack_hints`	✅	2s	2s	+0%
`sign_keypair_internal`	✅	2s	2s	+0%
`sk_s1hat_get_poly`	✅	2s	2s	+0%
`sk_t0hat_get_poly`	✅	2s	5s	-60%
`unpack_pk_t1`	✅	2s	4s	-50%
`unpack_sk_s2hat`	✅	2s	4s	-50%
`yvec_get_poly`	✅	2s	2s	+0%
`yvec_init`	✅	2s	1s	+100%
`decompose`	✅	1s	2s	-50%
`fqscale`	✅	1s	2s	-50%
`keccak_f1600_x1_native_aarch64_v84a`	✅	1s	2s	-50%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	1s	2s	-50%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	1s	2s	-50%
`keccakf1600x4_extract_bytes`	✅	1s	2s	-50%
`keccakf1600x4_permute`	✅	1s	2s	-50%
`mld_ct_get_optblocker_u32`	✅	1s	2s	-50%
`mld_ct_sel_int32`	✅	1s	2s	-50%
`mld_value_barrier_u8`	✅	1s	2s	-50%
`polyz_pack`	✅	1s	2s	-50%
`polyz_unpack`	✅	1s	4s	-75%
`polyz_unpack_native`	✅	1s	1s	+0%
`rej_eta`	✅	1s	3s	-67%
`shake256`	✅	1s	3s	-67%
`sys_check_capability`	✅	1s	1s	+0%

oqs-bot · 2026-04-03T04:30:34Z

CBMC Results (ML-DSA-87)

Full Results (206 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2406s	2603s	-7.6%
`polyvecl_pointwise_acc_montgomery_c`	✅	314s	356s	-12%
`mld_invntt_layer`	✅	282s	318s	-11%
`polyvec_matrix_expand`	✅	219s	249s	-12%
`rej_uniform_native`	✅	149s	158s	-6%
`mld_attempt_signature_generation`	✅	104s	111s	-6%
`poly_pointwise_montgomery_c`	✅	101s	110s	-8%
`mld_ct_memcmp`	✅	65s	78s	-17%
`sign_signature_internal`	✅	63s	66s	-5%
`sign_verify_internal`	✅	63s	60s	+5%
`polyvec_matrix_expand_serial`	✅	48s	48s	+0%
`mld_ntt_layer`	✅	44s	45s	-2%
`fqmul`	✅	43s	43s	+0%
`rej_uniform_native_x86_64`	✅	38s	-	new
`compute_pack_t0_t1`	✅	32s	35s	-9%
`polyvec_matrix_pointwise_montgomery_yvec`	✅	30s	32s	-6%
`keccakf1600x4_permute_native`	✅	25s	23s	+9%
`rej_uniform`	✅	23s	23s	+0%
`mld_ntt_butterfly_block`	✅	22s	22s	+0%
`poly_chknorm_c`	✅	21s	21s	+0%
`mld_check_pct`	✅	16s	16s	+0%
`polyeta_unpack`	✅	14s	18s	-22%
`polyt0_unpack`	✅	14s	19s	-26%
`poly_uniform_eta_4x`	✅	12s	15s	-20%
`rej_uniform_c`	✅	12s	15s	-20%
`polyveck_decompose`	✅	11s	10s	+10%
`pointwise_acc_native_aarch64`	✅	10s	7s	+43%
`poly_add`	✅	10s	11s	-9%
`poly_uniform_4x`	✅	10s	12s	-17%
`keccak_absorb_once_x4`	✅	9s	10s	-10%
`polyveck_invntt_tomont`	✅	9s	8s	+12%
`keccak_absorb`	✅	8s	7s	+14%
`pointwise_acc_native_x86_64`	✅	8s	8s	+0%
`poly_decompose_c`	✅	8s	8s	+0%
`poly_invntt_tomont_c`	✅	8s	9s	-11%
`polyveck_caddq`	✅	8s	7s	+14%
`polyvecl_ntt`	✅	8s	10s	-20%
`mld_sample_s1_s2_serial`	✅	7s	9s	-22%
`polyveck_ntt`	✅	7s	7s	+0%
`sign`	✅	7s	8s	-12%
`sign_verify_pre_hash_internal`	✅	7s	6s	+17%
`unpack_sk_t0hat`	✅	7s	6s	+17%
`mld_compute_pack_z`	✅	6s	7s	-14%
`mld_keccakf1600_permute_c`	✅	6s	5s	+20%
`montgomery_reduce`	✅	6s	2s	+200%
`pack_sig_h`	✅	6s	4s	+50%
`poly_caddq_c`	✅	6s	5s	+20%
`poly_power2round`	✅	6s	5s	+20%
`polyeta_pack`	✅	6s	3s	+100%
`polyveck_pack_eta`	✅	6s	2s	+200%
`polyz_unpack_c`	✅	6s	5s	+20%
`sign_keypair_internal`	✅	6s	5s	+20%
`sign_verify_pre_hash_shake256`	✅	6s	7s	-14%
`mld_sample_s1_s2`	✅	5s	6s	-17%
`poly_caddq`	✅	5s	5s	+0%
`poly_challenge`	✅	5s	4s	+25%
`poly_permute_bitrev_to_custom_optional_native`	✅	5s	4s	+25%
`poly_pointwise_montgomery`	✅	5s	4s	+25%
`polyt0_pack`	✅	5s	2s	+150%
`polyt1_pack`	✅	5s	3s	+67%
`polyt1_unpack`	✅	5s	7s	-29%
`polyveck_chknorm`	✅	5s	8s	-38%
`polyveck_pack_w1`	✅	5s	3s	+67%
`polyveck_unpack_eta`	✅	5s	5s	+0%
`polyvecl_chknorm`	✅	5s	8s	-38%
`polyz_pack`	✅	5s	3s	+67%
`sign_open`	✅	5s	5s	+0%
`sign_pk_from_sk`	✅	5s	7s	-29%
`yvec_init`	✅	5s	2s	+150%
`intt_native_aarch64`	✅	4s	4s	+0%
`keccak_squeezeblocks_x4`	✅	4s	3s	+33%
`keccakf1600_extract_bytes (big endian)`	✅	4s	5s	-20%
`keccakf1600_xor_bytes (big endian)`	✅	4s	3s	+33%
`keccakf1600x4_extract_bytes_native`	✅	4s	4s	+0%
`pack_sk_rho_key_tr_s2`	✅	4s	6s	-33%
`poly_caddq_native_aarch64`	✅	4s	3s	+33%
`poly_chknorm`	✅	4s	4s	+0%
`poly_decompose_native`	✅	4s	5s	-20%
`poly_shiftl`	✅	4s	5s	-20%
`poly_sub`	✅	4s	5s	-20%
`poly_use_hint_c`	✅	4s	3s	+33%
`poly_use_hint_native_aarch64`	✅	4s	1s	+300%
`polyvecl_pointwise_acc_montgomery_native`	✅	4s	3s	+33%
`polyvecl_uniform_gamma1_serial`	✅	4s	3s	+33%
`polyw1_pack_32`	✅	4s	3s	+33%
`polyz_unpack_17_native_aarch64`	✅	4s	4s	+0%
`polyz_unpack_native_x86_64`	✅	4s	5s	-20%
`rej_eta`	✅	4s	2s	+100%
`rej_eta_native`	✅	4s	4s	+0%
`rej_uniform_native_aarch64`	✅	4s	5s	-20%
`shake128x4_absorb_once`	✅	4s	3s	+33%
`shake256_release`	✅	4s	3s	+33%
`shake256_squeeze`	✅	4s	2s	+100%
`sign_signature`	✅	4s	4s	+0%
`sign_signature_extmu`	✅	4s	4s	+0%
`sign_signature_pre_hash_internal`	✅	4s	5s	-20%
`sign_signature_pre_hash_shake256`	✅	4s	5s	-20%
`unpack_sk_s1hat`	✅	4s	4s	+0%
`use_hint`	✅	4s	3s	+33%
`caddq`	✅	3s	3s	+0%
`decompose`	✅	3s	2s	+50%
`intt_native_x86_64`	✅	3s	3s	+0%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	3s	1s	+200%
`keccak_finalize`	✅	3s	1s	+200%
`keccak_squeeze`	✅	3s	2s	+50%
`keccakf1600_permute`	✅	3s	2s	+50%
`keccakf1600x4_permute`	✅	3s	3s	+0%
`keccakf1600x4_xor_bytes_native`	✅	3s	2s	+50%
`make_hint`	✅	3s	4s	-25%
`mld_h`	✅	3s	3s	+0%
`mld_keccakf1600_extract_bytes`	✅	3s	2s	+50%
`mld_keccakf1600x4_extract_bytes_c`	✅	3s	1s	+200%
`mld_keccakf1600x4_xor_bytes_c`	✅	3s	4s	-25%
`mld_polymat_expand_entry`	✅	3s	4s	-25%
`mld_prepare_domain_separation_prefix`	✅	3s	2s	+50%
`mld_value_barrier_u8`	✅	3s	3s	+0%
`ntt_native_aarch64`	✅	3s	4s	-25%
`pack_sig_c`	✅	3s	3s	+0%
`pack_sk_s1`	✅	3s	4s	-25%
`pointwise_native_aarch64`	✅	3s	5s	-40%
`poly_caddq_native_x86_64`	✅	3s	3s	+0%
`poly_decompose_88_native_aarch64`	✅	3s	5s	-40%
`poly_ntt_native`	✅	3s	4s	-25%
`poly_permute_bitrev_to_custom_optional`	✅	3s	4s	-25%
`poly_pointwise_montgomery_native`	✅	3s	3s	+0%
`poly_reduce`	✅	3s	2s	+50%
`poly_uniform_eta`	✅	3s	2s	+50%
`poly_uniform_gamma1`	✅	3s	3s	+0%
`poly_use_hint`	✅	3s	2s	+50%
`polyveck_reduce`	✅	3s	2s	+50%
`polyw1_pack`	✅	3s	4s	-25%
`polyz_unpack`	✅	3s	4s	-25%
`rej_eta_c`	✅	3s	4s	-25%
`shake128_squeeze`	✅	3s	2s	+50%
`shake256_finalize`	✅	3s	1s	+200%
`shake256x4_squeezeblocks`	✅	3s	2s	+50%
`sig_unpack_hints`	✅	3s	5s	-40%
`sign_keypair`	✅	3s	4s	-25%
`sign_verify`	✅	3s	6s	-50%
`sign_verify_extmu`	✅	3s	5s	-40%
`unpack_pk_t1`	✅	3s	5s	-40%
`unpack_sk`	✅	3s	4s	-25%
`fqscale`	✅	2s	4s	-50%
`keccak_f1600_x1_native_aarch64`	✅	2s	1s	+100%
`keccak_f1600_x1_native_aarch64_v84a`	✅	2s	2s	+0%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	2s	4s	-50%
`keccak_f1600_x4_native_avx2`	✅	2s	4s	-50%
`keccak_init`	✅	2s	5s	-60%
`keccakf1600_permute_native`	✅	2s	5s	-60%
`mld_ct_abs_i32`	✅	2s	4s	-50%
`mld_ct_cmask_neg_i32`	✅	2s	4s	-50%
`mld_ct_cmask_nonzero_u32`	✅	2s	2s	+0%
`mld_ct_cmask_nonzero_u8`	✅	2s	4s	-50%
`mld_ct_get_optblocker_i64`	✅	2s	4s	-50%
`mld_ct_get_optblocker_u32`	✅	2s	4s	-50%
`mld_ct_get_optblocker_u8`	✅	2s	5s	-60%
`mld_value_barrier_i64`	✅	2s	3s	-33%
`ntt_native_x86_64`	✅	2s	3s	-33%
`nttunpack_native_x86_64`	✅	2s	2s	+0%
`pack_sig_z`	✅	2s	3s	-33%
`pointwise_native_x86_64`	✅	2s	2s	+0%
`poly_caddq_native`	✅	2s	3s	-33%
`poly_chknorm_native`	✅	2s	4s	-50%
`poly_chknorm_native_aarch64`	✅	2s	4s	-50%
`poly_chknorm_native_x86_64`	✅	2s	6s	-67%
`poly_decompose_32_native_aarch64`	✅	2s	5s	-60%
`poly_invntt_tomont`	✅	2s	2s	+0%
`poly_invntt_tomont_native`	✅	2s	2s	+0%
`poly_uniform`	✅	2s	5s	-60%
`poly_uniform_gamma1_4x`	✅	2s	4s	-50%
`poly_use_hint_native`	✅	2s	3s	-33%
`polyvec_matrix_pointwise_montgomery_row`	✅	2s	2s	+0%
`polyvecl_pack_eta`	✅	2s	4s	-50%
`polyvecl_pointwise_acc_montgomery`	✅	2s	2s	+0%
`polyvecl_uniform_gamma1`	✅	2s	5s	-60%
`polyvecl_unpack_eta`	✅	2s	4s	-50%
`polyvecl_unpack_z`	✅	2s	3s	-33%
`polyz_unpack_19_native_aarch64`	✅	2s	2s	+0%
`polyz_unpack_native`	✅	2s	3s	-33%
`reduce32`	✅	2s	2s	+0%
`rej_uniform_eta_native_aarch64`	✅	2s	3s	-33%
`shake128_absorb`	✅	2s	3s	-33%
`shake128_finalize`	✅	2s	1s	+100%
`shake128x4_squeezeblocks`	✅	2s	2s	+0%
`shake256`	✅	2s	2s	+0%
`shake256x4_absorb_once`	✅	2s	4s	-50%
`sk_s1hat_get_poly`	✅	2s	2s	+0%
`sk_s2hat_get_poly`	✅	2s	4s	-50%
`sk_t0hat_get_poly`	✅	2s	2s	+0%
`sys_check_capability`	✅	2s	4s	-50%
`unpack_sk_s2hat`	✅	2s	2s	+0%
`yvec_get_poly`	✅	2s	3s	-33%
`keccak_f1600_x4_native_aarch64_v84a`	✅	1s	4s	-75%
`keccakf1600_xor_bytes`	✅	1s	4s	-75%
`keccakf1600x4_extract_bytes`	✅	1s	2s	-50%
`keccakf1600x4_xor_bytes`	✅	1s	3s	-67%
`mld_ct_sel_int32`	✅	1s	4s	-75%
`mld_value_barrier_u32`	✅	1s	3s	-67%
`poly_decompose`	✅	1s	2s	-50%
`poly_ntt`	✅	1s	3s	-67%
`poly_ntt_c`	✅	1s	2s	-50%
`polyw1_pack_88`	✅	1s	3s	-67%
`power2round`	✅	1s	5s	-80%
`shake128_init`	✅	1s	3s	-67%
`shake128_release`	✅	1s	2s	-50%
`shake256_absorb`	✅	1s	2s	-50%
`shake256_init`	✅	1s	2s	-50%

oqs-bot · 2026-04-03T04:31:33Z

CBMC Results (ML-DSA-65)

Full Results (206 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2300s	2198s	+4.6%
`mld_invntt_layer`	✅	324s	314s	+3%
`polyvecl_pointwise_acc_montgomery_c`	✅	246s	233s	+6%
`rej_uniform_native`	✅	158s	159s	-1%
`polyvec_matrix_expand`	✅	141s	134s	+5%
`poly_pointwise_montgomery_c`	✅	116s	111s	+5%
`mld_ct_memcmp`	✅	76s	71s	+7%
`mld_attempt_signature_generation`	✅	70s	69s	+1%
`sign_verify_internal`	✅	67s	64s	+5%
`sign_signature_internal`	✅	52s	50s	+4%
`mld_ntt_layer`	✅	46s	45s	+2%
`fqmul`	✅	44s	43s	+2%
`rej_uniform_native_x86_64`	✅	40s	-	new
`polyvec_matrix_expand_serial`	✅	27s	26s	+4%
`mld_ntt_butterfly_block`	✅	25s	22s	+14%
`keccakf1600x4_permute_native`	✅	24s	23s	+4%
`rej_uniform`	✅	24s	24s	+0%
`poly_chknorm_c`	✅	23s	22s	+5%
`mld_check_pct`	✅	19s	16s	+19%
`polyvecl_chknorm`	✅	19s	17s	+12%
`polyt0_unpack`	✅	18s	18s	+0%
`compute_pack_t0_t1`	✅	16s	17s	-6%
`rej_uniform_c`	✅	15s	13s	+15%
`polyveck_decompose`	✅	14s	15s	-7%
`keccak_absorb_once_x4`	✅	12s	11s	+9%
`poly_uniform_4x`	✅	12s	10s	+20%
`poly_uniform_eta_4x`	✅	12s	12s	+0%
`poly_add`	✅	11s	10s	+10%
`mld_keccakf1600_permute_c`	✅	10s	6s	+67%
`polyvec_matrix_pointwise_montgomery_yvec`	✅	10s	8s	+25%
`polyveck_chknorm`	✅	10s	11s	-9%
`poly_caddq_c`	✅	9s	7s	+29%
`poly_invntt_tomont_c`	✅	9s	10s	-10%
`polyveck_ntt`	✅	9s	11s	-18%
`polyvecl_ntt`	✅	9s	10s	-10%
`keccak_absorb`	✅	8s	7s	+14%
`mld_compute_pack_z`	✅	8s	10s	-20%
`sign`	✅	8s	7s	+14%
`mld_sample_s1_s2`	✅	7s	7s	+0%
`polyveck_caddq`	✅	7s	5s	+40%
`polyveck_invntt_tomont`	✅	7s	8s	-12%
`polyz_unpack_c`	✅	7s	7s	+0%
`sign_keypair_internal`	✅	7s	5s	+40%
`keccak_squeeze`	✅	6s	2s	+200%
`pointwise_acc_native_aarch64`	✅	6s	7s	-14%
`pointwise_acc_native_x86_64`	✅	6s	7s	-14%
`poly_challenge`	✅	6s	5s	+20%
`poly_pointwise_montgomery_native`	✅	6s	2s	+200%
`poly_power2round`	✅	6s	6s	+0%
`poly_reduce`	✅	6s	2s	+200%
`rej_uniform_eta_native_aarch64`	✅	6s	5s	+20%
`sign_open`	✅	6s	6s	+0%
`sign_pk_from_sk`	✅	6s	7s	-14%
`sign_verify_pre_hash_shake256`	✅	6s	2s	+200%
`unpack_sk_t0hat`	✅	6s	7s	-14%
`fqscale`	✅	5s	3s	+67%
`mld_sample_s1_s2_serial`	✅	5s	5s	+0%
`poly_decompose_c`	✅	5s	6s	-17%
`poly_uniform_gamma1`	✅	5s	2s	+150%
`poly_use_hint_native_aarch64`	✅	5s	2s	+150%
`polyvecl_unpack_eta`	✅	5s	4s	+25%
`polyw1_pack_32`	✅	5s	3s	+67%
`power2round`	✅	5s	3s	+67%
`sign_verify`	✅	5s	3s	+67%
`sk_t0hat_get_poly`	✅	5s	2s	+150%
`unpack_pk_t1`	✅	5s	4s	+25%
`unpack_sk_s1hat`	✅	5s	3s	+67%
`keccak_f1600_x1_native_aarch64_v84a`	✅	4s	2s	+100%
`keccak_squeezeblocks_x4`	✅	4s	3s	+33%
`keccakf1600x4_permute`	✅	4s	4s	+0%
`mld_ct_abs_i32`	✅	4s	1s	+300%
`mld_ct_cmask_nonzero_u32`	✅	4s	3s	+33%
`pack_sk_s1`	✅	4s	3s	+33%
`pointwise_native_aarch64`	✅	4s	3s	+33%
`poly_chknorm`	✅	4s	3s	+33%
`poly_chknorm_native`	✅	4s	6s	-33%
`poly_decompose_native`	✅	4s	5s	-20%
`poly_invntt_tomont_native`	✅	4s	2s	+100%
`poly_permute_bitrev_to_custom_optional`	✅	4s	2s	+100%
`poly_permute_bitrev_to_custom_optional_native`	✅	4s	4s	+0%
`poly_use_hint`	✅	4s	3s	+33%
`poly_use_hint_c`	✅	4s	4s	+0%
`poly_use_hint_native`	✅	4s	3s	+33%
`polyt1_unpack`	✅	4s	1s	+300%
`polyveck_pack_eta`	✅	4s	2s	+100%
`polyveck_unpack_eta`	✅	4s	4s	+0%
`polyvecl_pointwise_acc_montgomery`	✅	4s	4s	+0%
`polyw1_pack_88`	✅	4s	1s	+300%
`shake128_finalize`	✅	4s	2s	+100%
`shake128_squeeze`	✅	4s	2s	+100%
`shake256_finalize`	✅	4s	2s	+100%
`shake256_init`	✅	4s	2s	+100%
`shake256_release`	✅	4s	2s	+100%
`sign_keypair`	✅	4s	4s	+0%
`sign_signature`	✅	4s	5s	-20%
`sign_signature_pre_hash_shake256`	✅	4s	6s	-33%
`sign_verify_extmu`	✅	4s	4s	+0%
`sk_s2hat_get_poly`	✅	4s	3s	+33%
`unpack_sk`	✅	4s	4s	+0%
`yvec_init`	✅	4s	3s	+33%
`intt_native_aarch64`	✅	3s	4s	-25%
`intt_native_x86_64`	✅	3s	3s	+0%
`keccak_f1600_x4_native_aarch64_v84a`	✅	3s	2s	+50%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	3s	2s	+50%
`keccakf1600_extract_bytes (big endian)`	✅	3s	3s	+0%
`keccakf1600_permute_native`	✅	3s	4s	-25%
`keccakf1600_xor_bytes`	✅	3s	1s	+200%
`keccakf1600x4_extract_bytes_native`	✅	3s	2s	+50%
`make_hint`	✅	3s	3s	+0%
`mld_ct_cmask_nonzero_u8`	✅	3s	5s	-40%
`mld_ct_get_optblocker_u32`	✅	3s	2s	+50%
`mld_ct_sel_int32`	✅	3s	1s	+200%
`mld_prepare_domain_separation_prefix`	✅	3s	4s	-25%
`mld_value_barrier_u8`	✅	3s	3s	+0%
`ntt_native_aarch64`	✅	3s	2s	+50%
`ntt_native_x86_64`	✅	3s	4s	-25%
`nttunpack_native_x86_64`	✅	3s	7s	-57%
`pack_sig_h`	✅	3s	3s	+0%
`pack_sig_z`	✅	3s	5s	-40%
`pack_sk_rho_key_tr_s2`	✅	3s	1s	+200%
`pointwise_native_x86_64`	✅	3s	3s	+0%
`poly_caddq`	✅	3s	2s	+50%
`poly_caddq_native_aarch64`	✅	3s	3s	+0%
`poly_caddq_native_x86_64`	✅	3s	4s	-25%
`poly_chknorm_native_aarch64`	✅	3s	2s	+50%
`poly_chknorm_native_x86_64`	✅	3s	4s	-25%
`poly_invntt_tomont`	✅	3s	2s	+50%
`poly_ntt`	✅	3s	2s	+50%
`poly_ntt_c`	✅	3s	3s	+0%
`poly_ntt_native`	✅	3s	3s	+0%
`poly_shiftl`	✅	3s	4s	-25%
`poly_uniform`	✅	3s	4s	-25%
`poly_uniform_eta`	✅	3s	4s	-25%
`polyeta_pack`	✅	3s	3s	+0%
`polyt1_pack`	✅	3s	2s	+50%
`polyveck_pack_w1`	✅	3s	2s	+50%
`polyvecl_pack_eta`	✅	3s	3s	+0%
`polyvecl_pointwise_acc_montgomery_native`	✅	3s	3s	+0%
`polyvecl_unpack_z`	✅	3s	5s	-40%
`polyw1_pack`	✅	3s	5s	-40%
`polyz_unpack_19_native_aarch64`	✅	3s	4s	-25%
`reduce32`	✅	3s	3s	+0%
`rej_eta`	✅	3s	3s	+0%
`rej_eta_c`	✅	3s	1s	+200%
`rej_eta_native`	✅	3s	4s	-25%
`rej_uniform_native_aarch64`	✅	3s	3s	+0%
`shake128x4_absorb_once`	✅	3s	3s	+0%
`shake256_squeeze`	✅	3s	2s	+50%
`shake256x4_absorb_once`	✅	3s	1s	+200%
`sig_unpack_hints`	✅	3s	3s	+0%
`sign_signature_extmu`	✅	3s	5s	-40%
`sign_signature_pre_hash_internal`	✅	3s	4s	-25%
`sign_verify_pre_hash_internal`	✅	3s	4s	-25%
`sk_s1hat_get_poly`	✅	3s	1s	+200%
`unpack_sk_s2hat`	✅	3s	5s	-40%
`use_hint`	✅	3s	5s	-40%
`yvec_get_poly`	✅	3s	2s	+50%
`caddq`	✅	2s	4s	-50%
`decompose`	✅	2s	2s	+0%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	2s	1s	+100%
`keccak_finalize`	✅	2s	3s	-33%
`keccak_init`	✅	2s	3s	-33%
`keccakf1600_permute`	✅	2s	1s	+100%
`keccakf1600_xor_bytes (big endian)`	✅	2s	3s	-33%
`keccakf1600x4_extract_bytes`	✅	2s	2s	+0%
`mld_ct_get_optblocker_i64`	✅	2s	1s	+100%
`mld_ct_get_optblocker_u8`	✅	2s	3s	-33%
`mld_h`	✅	2s	4s	-50%
`mld_keccakf1600_extract_bytes`	✅	2s	3s	-33%
`mld_keccakf1600x4_extract_bytes_c`	✅	2s	2s	+0%
`mld_keccakf1600x4_xor_bytes_c`	✅	2s	3s	-33%
`mld_polymat_expand_entry`	✅	2s	3s	-33%
`montgomery_reduce`	✅	2s	3s	-33%
`pack_sig_c`	✅	2s	5s	-60%
`poly_caddq_native`	✅	2s	3s	-33%
`poly_decompose_32_native_aarch64`	✅	2s	6s	-67%
`poly_decompose_88_native_aarch64`	✅	2s	2s	+0%
`poly_pointwise_montgomery`	✅	2s	4s	-50%
`poly_sub`	✅	2s	4s	-50%
`poly_uniform_gamma1_4x`	✅	2s	7s	-71%
`polyeta_unpack`	✅	2s	4s	-50%
`polyt0_pack`	✅	2s	5s	-60%
`polyvec_matrix_pointwise_montgomery_row`	✅	2s	5s	-60%
`polyveck_reduce`	✅	2s	4s	-50%
`polyvecl_uniform_gamma1`	✅	2s	2s	+0%
`polyvecl_uniform_gamma1_serial`	✅	2s	2s	+0%
`polyz_pack`	✅	2s	3s	-33%
`polyz_unpack`	✅	2s	2s	+0%
`polyz_unpack_17_native_aarch64`	✅	2s	2s	+0%
`polyz_unpack_native_x86_64`	✅	2s	4s	-50%
`shake128_absorb`	✅	2s	1s	+100%
`shake128_init`	✅	2s	2s	+0%
`shake128x4_squeezeblocks`	✅	2s	4s	-50%
`shake256`	✅	2s	4s	-50%
`shake256_absorb`	✅	2s	4s	-50%
`shake256x4_squeezeblocks`	✅	2s	2s	+0%
`keccak_f1600_x1_native_aarch64`	✅	1s	2s	-50%
`keccak_f1600_x4_native_avx2`	✅	1s	2s	-50%
`keccakf1600x4_xor_bytes`	✅	1s	3s	-67%
`keccakf1600x4_xor_bytes_native`	✅	1s	2s	-50%
`mld_ct_cmask_neg_i32`	✅	1s	2s	-50%
`mld_value_barrier_i64`	✅	1s	2s	-50%
`mld_value_barrier_u32`	✅	1s	2s	-50%
`poly_decompose`	✅	1s	4s	-75%
`polyz_unpack_native`	✅	1s	5s	-80%
`shake128_release`	✅	1s	2s	-50%
`sys_check_capability`	✅	1s	5s	-80%

oqs-bot

Intel Xeon 4th gen (c7i)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`34764` cycles	`34374` cycles	`1.01`
`ML-DSA-44 sign`	`120113` cycles	`120132` cycles	`1.00`
`ML-DSA-44 verify`	`38092` cycles	`38166` cycles	`1.00`
`ML-DSA-65 keypair`	`61138` cycles	`60500` cycles	`1.01`
`ML-DSA-65 sign`	`201844` cycles	`199945` cycles	`1.01`
`ML-DSA-65 verify`	`62783` cycles	`62429` cycles	`1.01`
`ML-DSA-87 keypair`	`93501` cycles	`94486` cycles	`0.99`
`ML-DSA-87 sign`	`236815` cycles	`239500` cycles	`0.99`
`ML-DSA-87 verify`	`95619` cycles	`96894` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i) (no-opt)

Details

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`93930` cycles	`93842` cycles	`1.00`
`ML-DSA-44 sign`	`333310` cycles	`333119` cycles	`1.00`
`ML-DSA-44 verify`	`100022` cycles	`100025` cycles	`1.00`
`ML-DSA-65 keypair`	`159902` cycles	`160115` cycles	`1.00`
`ML-DSA-65 sign`	`543114` cycles	`543227` cycles	`1.00`
`ML-DSA-65 verify`	`160989` cycles	`161060` cycles	`1.00`
`ML-DSA-87 keypair`	`266666` cycles	`266874` cycles	`1.00`
`ML-DSA-87 sign`	`704974` cycles	`706010` cycles	`1.00`
`ML-DSA-87 verify`	`270510` cycles	`269779` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `6539a79`	Previous: `9ee2f35`	Ratio
`ML-DSA-44 keypair`	`42142` cycles	`40662` cycles	`1.04`

This comment was automatically generated by workflow using github-action-benchmark.

jakemas · 2026-06-11T18:58:25Z

Updated to follow format introduced in d417b6b

mkannwischer

Thanks @jakemas and sorry for the long silence.
The core of the proof looks good to me - I have a few more comments/questions below.

mkannwischer · 2026-06-14T09:44:15Z

+def gen_avx2_hol_light_rej_uniform_table():
+    """Emit the HOL Light byte-list form of the AVX2 rej_uniform lookup table.
+    Mirrors mlkem-native's gen_aarch64_hol_light_rej_uniform_table pattern."""
+
+    def get_set_bits_idxs(i):
+        bits = list(map(int, format(i, "08b")))
+        bits.reverse()
+        return [bit_idx for bit_idx in range(8) if bits[bit_idx] == 1]
+
+    def gen_rows():
+        for i in range(256):
+            idxs = get_set_bits_idxs(i)
+            idxs = idxs + [0] * (8 - len(idxs))
+            yield idxs


This function should just call the one used to generate the actual constants (gen_avx2_rej_uniform_table_rows) so we can be sure they are the same.

mkannwischer · 2026-06-14T09:48:06Z

  /* Safety: outlen is at most MLDSA_N and, hence, this cast is safe. */
-  return (int)mld_rej_uniform_avx2(r, buf);
+  return (int)mld_rej_uniform_avx2_asm(r, buf,
+                                       (const uint8_t *)mld_rej_uniform_table);


can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?

mkannwischer · 2026-06-14T09:49:48Z

+__contract__(
+  requires(memory_no_alias(r, sizeof(int32_t) * MLDSA_N))
+  requires(memory_no_alias(buf, 840))
+  requires(table == (const uint8_t *)mld_rej_uniform_table)


can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?

mkannwischer · 2026-06-14T09:51:20Z

+             [ARITH_TAC; ALL_TAC] THEN
+            SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL
+             [ARITH_TAC; ALL_TAC] THEN
+            FIRST_ASSUM(fun th -> 


Can we remove the debug Printf.printf statements or do they still serve any purpose?

mkannwischer · 2026-06-14T10:03:12Z

+             [ARITH_TAC; ALL_TAC] THEN
+            SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL
+             [ARITH_TAC; ALL_TAC] THEN
+            FIRST_ASSUM(fun th -> 


I'm confused. Why are there separate _MEMSAFE and _SAFE proofs? Shouldn't this proof only have the _MEMSAFE theorems?

hanno-becker · 2026-06-15T03:43:17Z

+ *   Low  128 bits: bytes [0..15]  (original 64-bit lanes 0, 1)
+ *   High 128 bits: bytes [8..23]  (original 64-bit lanes 1, 2)


Nit: The comment is slightly confusing by itself, because the semantic granularity of the input is in bytes, not 64-bit. 64-bit granularity comes from vpermq, but this isn't mentioned here.

Suggestion:

* vpermq with 0x94(=0b 10 01 01 00) permutes 64-bit lanes via (0,1,2,3) -> (0,1,1,2). * The loaded 32 bytes are thus rearranged as: * Low 128 bits: bytes [0..15] of original 32-byte * High 128 bits: bytes [8..23] of original 32-byte

hanno-becker · 2026-06-15T03:44:59Z

+// Construct broadcast constants
+        movl    $0x7FFFFF, good
+        vmovd   good, %xmm1
+        vpbroadcastd %xmm1, mask              // mask: 23-bit extraction
+
+        movl    $8380417, good                 // MLDSA_Q
+        vmovd   good, %xmm2
+        vpbroadcastd %xmm2, bound             // bound: rejection threshold
+
+// Initialize counters
+        xorl    ctr, ctr                       // ctr = 0
+        xorl    pos, pos                       // pos = 0


Nit nit: indentation of comments. Also, can we use /* .. */ style? I know this is wrong in many other ASM files, but when we touch/add code we may as well adjust it.

hanno-becker · 2026-06-15T03:46:07Z

+
+/*
+ * Main SIMD loop: process 24 input bytes into up to 8 coefficients
+ * per iteration. Loops while ctr <= MLDSA_N - 8 and pos <= BUFLEN - 32.


It's noteworthy that this approach differs from the mlkem-native rejection sampling, which buffers into a temporary slightly oversized stack buffer, to avoid the scalar tail loop.

I don't know which is better, nor is it necessary that we align this now, but just noting.

My gut feeling is that it's probably better the way it's done here.

hanno-becker · 2026-06-15T03:47:40Z

+rej_uniform_avx2_asm_loop:
+        cmpl    $248, ctr                      // MLDSA_N - 8
+        ja      rej_uniform_avx2_asm_scalar
+        cmpl    $808, pos                      // MLD_AVX2_REJ_UNIFORM_BUFLEN - 32


Nit: Can you use the macro constant instead of a comment here?

hanno-becker · 2026-06-15T03:48:56Z

+        popcntl good, cnt                      // count valid coefficients
+
+        vmovq   (tab, %r8, 8), %xmm4          // load permutation from table[good]


Nit: mix of symbolic good vs raw r8. Can we get the notation consistent (ideally symbolics only)? (If you need variants of the same register, you can do something like name_x,name_y,..

hanno-becker · 2026-06-15T04:03:54Z

+        vpmovzxbd %xmm4, cmp_result            // zero-extend to 8 dword indices
+        vpermd  data, cmp_result, data          // compact valid coefficients to front
+
+        vmovdqu data, (out, %rax, 4)           // store at r[ctr]


Same comment as above: Let's stick to symbolic register names.

hanno-becker · 2026-06-15T04:04:46Z

+rej_uniform_avx2_asm_scalar:
+        cmpl    $256, ctr                      // MLDSA_N
+        jae     rej_uniform_avx2_asm_done
+        cmpl    $837, pos                      // MLD_AVX2_REJ_UNIFORM_BUFLEN - 3


This can never fire, can it? If so, can we remove it?

Here's the reasoning:

The scalar loop bumps pos += 3 on every sample, but ctr only increments on an accepted one (the pos advance happens before the >= Q reject check). So in a rejection-heavy tail, pos keeps climbing while ctr stalls.

Concretely: the buffer is 840 bytes = at most 280 three-byte samples, and we need 256 acceptances — so we can only absorb ~24 rejections before running dry with ctr still < 256. We can also enter the scalar loop with pos as high as 832 (the main loop exits after pos += 24 with pos having been <= 808). From there, if the tail keeps rejecting, pos walks 832 → 835 → 838 and at 838 the next 3-byte read would touch buf[838..840], one past the end of the 840-byte buffer. The cmpl $SCALAR_POS_BOUND, pos; ja done is what stops that.

It's also load-bearing in the proof: the loop exit is disjunctive (hit 256 coefficients or exhaust the buffer), and the memory-safety argument for the 3-byte load depends on pos <= 837, which this branch establishes. Removing it reintroduces the over-read on adversarial input and breaks the MEMSAFE proof.

Worth noting the removed intrinsics had the identical guard as its scalar loop condition:

while (ctr < MLDSA_N && pos <= MLD_AVX2_REJ_UNIFORM_BUFLEN - 3)

which is exactly the pos <= 837 that cmpl $SCALAR_POS_BOUND, pos; ja done implements. The asm didn't add the check, it preserves what was already there.

The header comment on that file also spells out why it has to be there:

The pqcrystals implementation assumes a buffer that is 8 bytes larger as the first loop overreads by 8 bytes that are then discarded. We instead do not pad the buffer and do not overread.

So the bound is load-bearing by design: it's what lets us run on the unpadded 840 byte buffer without over-reading.

hanno-becker

Thanks @jakemas, and sorry this is again taking so long.

In addition to @mkannwischer's important comment that the constant-tables should come from the same source (that's our only argument for why they are consistent across proof and code -- we have no other check!), there's a dead branch in the rejection sampling I believe that can be removed.

…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Wire eta4 to the new asm in meta.h, add the asm entry points and contracts in arith_native_x86_64.h, register the bytecode dump targets in autogen and the Makefile, and add a poly_uniform_eta_4x component benchmark. Signed-off-by: jake massimo <jakemas@amazon.com>

…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Wire both eta2 and eta4 to the new asm in meta.h, add the asm entry points and contracts in arith_native_x86_64.h, register the bytecode dump targets in autogen and the Makefile, and add a poly_uniform_eta_4x component benchmark. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>

…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>

…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm routines) so they are called with the System V register convention on all platforms, including Windows/MinGW. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>

…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm routines) so they are called with the System V register convention on all platforms, including Windows/MinGW. The endbr64 is emitted via MLD_ASM_FN_SYMBOL (CET-gated) rather than a raw mnemonic, so older assemblers (e.g. clang-6) build cleanly. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>

… proof Replaces the intrinsics-based rej_uniform_avx2.c with a hand-written assembly routine (mld_rej_uniform_avx2_asm) and adds HOL-Light functional correctness and memory-safety proofs on top of s2n-bignum, plus the CBMC contract proof. Signed-off-by: Jake Massimo <jakemas@amazon.com>

jakemas requested a review from a team as a code owner April 3, 2026 04:11

jakemas marked this pull request as draft April 3, 2026 04:11

jakemas added the benchmark label Apr 3, 2026

github-actions Bot reviewed Apr 3, 2026

View reviewed changes

oqs-bot reviewed Apr 3, 2026

View reviewed changes

jakemas added benchmark and removed benchmark labels Apr 3, 2026

oqs-bot reviewed Apr 3, 2026

View reviewed changes

jakemas added benchmark and removed benchmark labels Apr 3, 2026

jakemas mentioned this pull request Apr 8, 2026

x86: Add VMOVMSKPS, VPMOVZXBD, VZEROUPPER instruction models awslabs/s2n-bignum#387

Merged

mkannwischer self-assigned this Apr 22, 2026

jakemas force-pushed the jakemas/rej-uniform-asm branch 6 times, most recently from 97ef0aa to b6e6f76 Compare June 11, 2026 18:57

jakemas force-pushed the jakemas/rej-uniform-asm branch 2 times, most recently from 24cfdd8 to a50739c Compare June 12, 2026 20:37

mkannwischer requested changes Jun 14, 2026

View reviewed changes

hanno-becker reviewed Jun 15, 2026

View reviewed changes

hanno-becker requested changes Jun 15, 2026

View reviewed changes

jakemas force-pushed the jakemas/rej-uniform-asm branch 2 times, most recently from ed868d0 to fb75556 Compare June 15, 2026 21:07

This was referenced Jun 16, 2026

x86_64: Replace rej_uniform_eta2/eta4 intrinsics with hand-written assembly #1187

Closed

x86_64: Replace rej_uniform_eta2/eta4 intrinsics with hand-written assembly #1188

Draft

jakemas force-pushed the jakemas/rej-uniform-asm branch from fb75556 to 4c41cac Compare June 19, 2026 00:22

		* Low 128 bits: bytes [0..15] (original 64-bit lanes 0, 1)
		* High 128 bits: bytes [8..23] (original 64-bit lanes 1, 2)

		popcntl good, cnt // count valid coefficients

		vmovq (tab, %r8, 8), %xmm4 // load permutation from table[good]

Conversation

jakemas commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

AMD EPYC 3rd gen (c6a) — opt

Proof

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton4

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton4 (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton3

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton3 (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton2

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton2 (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jakemas commented Apr 3, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

github-actions Bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot commented Apr 3, 2026 •

edited

Loading

oqs-bot commented Apr 3, 2026 •

edited

Loading

oqs-bot commented Apr 3, 2026 •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading