Skip to content

x86_64 + HOL-Light: Replace rej_uniform intrinsics with assembly and HOL-Light CORRECT and MEMSAFE proofs#1014

Open
jakemas wants to merge 1 commit into
mainfrom
jakemas/rej-uniform-asm
Open

x86_64 + HOL-Light: Replace rej_uniform intrinsics with assembly and HOL-Light CORRECT and MEMSAFE proofs#1014
jakemas wants to merge 1 commit into
mainfrom
jakemas/rej-uniform-asm

Conversation

@jakemas

@jakemas jakemas commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves #926 and #418 (?)

Hol-light proof needs instructions from awslabs/s2n-bignum#387

  • Replace AVX2 intrinsics implementation of rej_uniform with hand-written x86_64 assembly
  • Table passed as parameter (consistent with aarch64 approach), avoiding external symbol references for simpasm compatibility
  • All constants constructed from immediates (no .rodata section), enabling future HOL-Light formal verification
  • Register name #defines with #undef cleanup for SCU builds (following mlkem-native pattern)
  • Adds poly_uniform to component benchmark
  • HOL-Light proof infrastructure included (bytecode, table definition, proof skeleton, Makefile)

ML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration.

Performance

AMD EPYC 3rd gen (c6a) — opt

Benchmark Before After Change
ML-DSA-44 keypair 68,874 66,828 -3%
ML-DSA-44 sign 187,594 184,181 -2%
ML-DSA-44 verify 68,993 65,665 -5%
ML-DSA-65 keypair 119,089 112,640 -5%
ML-DSA-65 sign 299,488 294,836 -2%
ML-DSA-65 verify 115,385 108,494 -6%
ML-DSA-87 keypair 203,754 185,518 -9%
ML-DSA-87 sign 396,462 378,579 -5%
ML-DSA-87 verify 196,231 177,157 -10%

Proof

Includes HOL-Light and CBMC proofs, written by claude opus 4.7.

CORRECT
HOL-Light / x86_64 HOL Light proof for mldsa_rej_uniform.S (pull_request) Successful in 12m

CORRECT + MEMSAFE
HOL-Light / x86_64 HOL Light proof for rej_uniform_avx2_asm.S (pull_request)Successful in 21m

It took Claude Opus 4.7 1m around 3weeks to get the MEMSAFE proof. Since it records all token usage per prompt, I got it to go back to the prompt in which I asked it to create the MEMSAFE part of the proof to gather statistics:

 ~10.35 billion tokens
 20,257 turns(i.e distinct API calls)
 ~30 days of development 
 ~100-125 hours of active user engagement (based on user clustering message timestamps)
185 attempted build iterations,

@jakemas jakemas requested a review from a team as a code owner April 3, 2026 04:11
@jakemas jakemas marked this pull request as draft April 3, 2026 04:11

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 113118 cycles 113013 cycles 1.00
ML-DSA-44 sign 355649 cycles 355605 cycles 1.00
ML-DSA-44 verify 117801 cycles 117682 cycles 1.00
ML-DSA-65 keypair 196381 cycles 196214 cycles 1.00
ML-DSA-65 sign 589557 cycles 588943 cycles 1.00
ML-DSA-65 verify 194604 cycles 194375 cycles 1.00
ML-DSA-87 keypair 322210 cycles 322148 cycles 1.00
ML-DSA-87 sign 752493 cycles 752763 cycles 1.00
ML-DSA-87 verify 320055 cycles 319900 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 212361 cycles 212622 cycles 1.00
ML-DSA-44 sign 760716 cycles 760066 cycles 1.00
ML-DSA-44 verify 228743 cycles 228987 cycles 1.00
ML-DSA-65 keypair 379384 cycles 379665 cycles 1.00
ML-DSA-65 sign 1250617 cycles 1249827 cycles 1.00
ML-DSA-65 verify 371531 cycles 372045 cycles 1.00
ML-DSA-87 keypair 604335 cycles 605426 cycles 1.00
ML-DSA-87 sign 1593243 cycles 1591413 cycles 1.00
ML-DSA-87 verify 618270 cycles 617375 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 66830 cycles 68874 cycles 0.97
ML-DSA-44 sign 184077 cycles 187594 cycles 0.98
ML-DSA-44 verify 65562 cycles 68993 cycles 0.95
ML-DSA-65 keypair 111959 cycles 119089 cycles 0.94
ML-DSA-65 sign 292002 cycles 299488 cycles 0.98
ML-DSA-65 verify 108472 cycles 115385 cycles 0.94
ML-DSA-87 keypair 185520 cycles 203754 cycles 0.91
ML-DSA-87 sign 379630 cycles 396462 cycles 0.96
ML-DSA-87 verify 177291 cycles 196231 cycles 0.90

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 68316 cycles 68121 cycles 1.00
ML-DSA-44 sign 202487 cycles 202429 cycles 1.00
ML-DSA-44 verify 70722 cycles 70691 cycles 1.00
ML-DSA-65 keypair 121061 cycles 121050 cycles 1.00
ML-DSA-65 sign 331574 cycles 332242 cycles 1.00
ML-DSA-65 verify 117810 cycles 118169 cycles 1.00
ML-DSA-87 keypair 198140 cycles 198283 cycles 1.00
ML-DSA-87 sign 427941 cycles 428124 cycles 1.00
ML-DSA-87 verify 194637 cycles 194645 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 134578 cycles 135123 cycles 1.00
ML-DSA-44 sign 523923 cycles 523989 cycles 1.00
ML-DSA-44 verify 147640 cycles 147421 cycles 1.00
ML-DSA-65 keypair 228634 cycles 227032 cycles 1.01
ML-DSA-65 sign 864042 cycles 860343 cycles 1.00
ML-DSA-65 verify 236700 cycles 234883 cycles 1.01
ML-DSA-87 keypair 371955 cycles 371568 cycles 1.00
ML-DSA-87 sign 1080535 cycles 1079389 cycles 1.00
ML-DSA-87 verify 383811 cycles 383403 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 56863 cycles 56287 cycles 1.01
ML-DSA-44 sign 181063 cycles 181562 cycles 1.00
ML-DSA-44 verify 61140 cycles 61061 cycles 1.00
ML-DSA-65 keypair 98291 cycles 98770 cycles 1.00
ML-DSA-65 sign 298368 cycles 299116 cycles 1.00
ML-DSA-65 verify 100343 cycles 100251 cycles 1.00
ML-DSA-87 keypair 152430 cycles 153265 cycles 0.99
ML-DSA-87 sign 354719 cycles 355417 cycles 1.00
ML-DSA-87 verify 153124 cycles 153884 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 128315 cycles 128272 cycles 1.00
ML-DSA-44 sign 447513 cycles 447600 cycles 1.00
ML-DSA-44 verify 138123 cycles 144678 cycles 0.95
ML-DSA-65 keypair 220541 cycles 220481 cycles 1.00
ML-DSA-65 sign 726484 cycles 726951 cycles 1.00
ML-DSA-65 verify 222926 cycles 223461 cycles 1.00
ML-DSA-87 keypair 366142 cycles 366604 cycles 1.00
ML-DSA-87 sign 927541 cycles 927414 cycles 1.00
ML-DSA-87 verify 374016 cycles 373875 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 72353 cycles 72235 cycles 1.00
ML-DSA-44 sign 212424 cycles 212375 cycles 1.00
ML-DSA-44 verify 75754 cycles 75714 cycles 1.00
ML-DSA-65 keypair 127646 cycles 127612 cycles 1.00
ML-DSA-65 sign 351030 cycles 350845 cycles 1.00
ML-DSA-65 verify 125627 cycles 125755 cycles 1.00
ML-DSA-87 keypair 205980 cycles 208476 cycles 0.99
ML-DSA-87 sign 444778 cycles 450018 cycles 0.99
ML-DSA-87 verify 205601 cycles 205843 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 157499 cycles 157541 cycles 1.00
ML-DSA-44 sign 549244 cycles 549413 cycles 1.00
ML-DSA-44 verify 169448 cycles 168865 cycles 1.00
ML-DSA-65 keypair 268437 cycles 268818 cycles 1.00
ML-DSA-65 sign 903422 cycles 903672 cycles 1.00
ML-DSA-65 verify 275283 cycles 274680 cycles 1.00
ML-DSA-87 keypair 448241 cycles 448464 cycles 1.00
ML-DSA-87 sign 1158654 cycles 1157970 cycles 1.00
ML-DSA-87 verify 458704 cycles 458043 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 42142 cycles 40662 cycles 1.04
ML-DSA-44 sign 134317 cycles 132808 cycles 1.01
ML-DSA-44 verify 44844 cycles 43607 cycles 1.03
ML-DSA-65 keypair 72940 cycles 71859 cycles 1.02
ML-DSA-65 sign 213861 cycles 213367 cycles 1.00
ML-DSA-65 verify 73729 cycles 72847 cycles 1.01
ML-DSA-87 keypair 107003 cycles 109237 cycles 0.98
ML-DSA-87 sign 250851 cycles 254550 cycles 0.99
ML-DSA-87 verify 107681 cycles 109371 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 120754 cycles 120325 cycles 1.00
ML-DSA-44 sign 447570 cycles 447576 cycles 1.00
ML-DSA-44 verify 130511 cycles 130561 cycles 1.00
ML-DSA-65 keypair 205040 cycles 205018 cycles 1.00
ML-DSA-65 sign 728790 cycles 729474 cycles 1.00
ML-DSA-65 verify 210029 cycles 209605 cycles 1.00
ML-DSA-87 keypair 337610 cycles 336678 cycles 1.00
ML-DSA-87 sign 925517 cycles 924223 cycles 1.00
ML-DSA-87 verify 347563 cycles 347399 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 138744 cycles 138561 cycles 1.00
ML-DSA-44 sign 483982 cycles 484140 cycles 1.00
ML-DSA-44 verify 148574 cycles 162388 cycles 0.91
ML-DSA-65 keypair 241921 cycles 241950 cycles 1.00
ML-DSA-65 sign 792702 cycles 792591 cycles 1.00
ML-DSA-65 verify 240763 cycles 241288 cycles 1.00
ML-DSA-87 keypair 396106 cycles 397138 cycles 1.00
ML-DSA-87 sign 1013453 cycles 1013569 cycles 1.00
ML-DSA-87 verify 403446 cycles 403178 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 113189 cycles 113255 cycles 1.00
ML-DSA-44 sign 355791 cycles 356042 cycles 1.00
ML-DSA-44 verify 117978 cycles 117969 cycles 1.00
ML-DSA-65 keypair 196342 cycles 196623 cycles 1.00
ML-DSA-65 sign 589183 cycles 589242 cycles 1.00
ML-DSA-65 verify 194553 cycles 194559 cycles 1.00
ML-DSA-87 keypair 322537 cycles 322281 cycles 1.00
ML-DSA-87 sign 753613 cycles 753546 cycles 1.00
ML-DSA-87 verify 320115 cycles 320070 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 213219 cycles 212521 cycles 1.00
ML-DSA-44 sign 761553 cycles 760970 cycles 1.00
ML-DSA-44 verify 241351 cycles 234237 cycles 1.03
ML-DSA-65 keypair 380573 cycles 379762 cycles 1.00
ML-DSA-65 sign 1252452 cycles 1252199 cycles 1.00
ML-DSA-65 verify 372839 cycles 371797 cycles 1.00
ML-DSA-87 keypair 607341 cycles 604584 cycles 1.00
ML-DSA-87 sign 1596680 cycles 1595561 cycles 1.00
ML-DSA-87 verify 619175 cycles 618927 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 verify 241351 cycles 234237 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot

oqs-bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

CBMC Results (ML-DSA-44)

Full Results (206 proofs)
Proof Status Current Previous Change
**TOTAL** 1799s 1701s +5.8%
mld_invntt_layer 280s 265s +6%
rej_uniform_native 150s 148s +1%
polyvecl_pointwise_acc_montgomery_c 118s 113s +4%
poly_pointwise_montgomery_c 101s 91s +11%
mld_ct_memcmp 71s 64s +11%
mld_attempt_signature_generation 62s 60s +3%
mld_ntt_layer 43s 43s +0%
fqmul 42s 39s +8%
rej_uniform_native_x86_64 37s - new
sign_verify_internal 28s 24s +17%
polyvec_matrix_expand 25s 25s +0%
keccakf1600x4_permute_native 24s 22s +9%
mld_ntt_butterfly_block 23s 22s +5%
poly_chknorm_c 23s 19s +21%
rej_uniform 20s 23s -13%
sign_signature_internal 18s 17s +6%
poly_add 16s 10s +60%
mld_check_pct 15s 14s +7%
polyt0_unpack 15s 17s -12%
compute_pack_t0_t1 14s 13s +8%
polyeta_unpack 14s 15s -7%
polyveck_chknorm 13s 13s +0%
rej_uniform_c 13s 13s +0%
poly_uniform_eta_4x 11s 13s -15%
polyz_unpack_c 11s 11s +0%
keccak_absorb_once_x4 10s 8s +25%
poly_uniform_4x 10s 12s -17%
polyvec_matrix_expand_serial 9s 8s +12%
polyvec_matrix_pointwise_montgomery_yvec 9s 9s +0%
mld_keccakf1600_permute_c 8s 7s +14%
poly_invntt_tomont_c 8s 7s +14%
polyveck_decompose 8s 6s +33%
sign 8s 7s +14%
mld_compute_pack_z 7s 7s +0%
mld_sample_s1_s2 7s 3s +133%
sign_signature_pre_hash_shake256 7s 6s +17%
keccak_absorb 6s 8s -25%
pointwise_acc_native_aarch64 6s 4s +50%
poly_caddq_c 6s 6s +0%
poly_use_hint_c 6s 6s +0%
polyveck_unpack_eta 6s 4s +50%
polyvecl_chknorm 6s 4s +50%
mld_ct_abs_i32 5s 3s +67%
pack_sig_h 5s 1s +400%
pointwise_acc_native_x86_64 5s 6s -17%
pointwise_native_aarch64 5s 1s +400%
poly_challenge 5s 5s +0%
poly_ntt 5s 5s +0%
poly_pointwise_montgomery_native 5s 2s +150%
poly_power2round 5s 6s -17%
polyveck_invntt_tomont 5s 3s +67%
polyvecl_pointwise_acc_montgomery_native 5s 2s +150%
polyw1_pack_88 5s 3s +67%
sign_open 5s 4s +25%
unpack_sk_s1hat 5s 4s +25%
keccak_squeezeblocks_x4 4s 4s +0%
keccakf1600_permute 4s 2s +100%
keccakf1600x4_xor_bytes_native 4s 4s +0%
make_hint 4s 3s +33%
mld_ct_cmask_nonzero_u32 4s 1s +300%
mld_polymat_expand_entry 4s 3s +33%
mld_value_barrier_i64 4s 3s +33%
mld_value_barrier_u32 4s 3s +33%
montgomery_reduce 4s 3s +33%
ntt_native_aarch64 4s 3s +33%
pack_sk_rho_key_tr_s2 4s 3s +33%
poly_chknorm_native 4s 2s +100%
poly_chknorm_native_aarch64 4s 2s +100%
poly_decompose_native 4s 1s +300%
poly_invntt_tomont 4s 4s +0%
poly_ntt_native 4s 2s +100%
poly_reduce 4s 2s +100%
poly_sub 4s 3s +33%
poly_uniform 4s 3s +33%
poly_uniform_gamma1 4s 2s +100%
poly_uniform_gamma1_4x 4s 3s +33%
poly_use_hint_native 4s 2s +100%
poly_use_hint_native_aarch64 4s 2s +100%
polyt0_pack 4s 5s -20%
polyvec_matrix_pointwise_montgomery_row 4s 3s +33%
polyveck_pack_eta 4s 3s +33%
polyvecl_ntt 4s 7s -43%
polyvecl_pack_eta 4s 4s +0%
polyvecl_uniform_gamma1 4s 2s +100%
polyw1_pack_32 4s 1s +300%
power2round 4s 4s +0%
shake128_absorb 4s 2s +100%
shake128_finalize 4s 3s +33%
shake128_release 4s 2s +100%
shake128x4_absorb_once 4s 4s +0%
shake256x4_absorb_once 4s 2s +100%
sign_keypair 4s 5s -20%
sign_pk_from_sk 4s 5s -20%
sign_verify_extmu 4s 4s +0%
sign_verify_pre_hash_shake256 4s 6s -33%
sk_s2hat_get_poly 4s 2s +100%
caddq 3s 3s +0%
intt_native_x86_64 3s 5s -40%
keccak_f1600_x4_native_aarch64_v84a 3s 2s +50%
keccak_f1600_x4_native_avx2 3s 4s -25%
keccak_squeeze 3s 3s +0%
keccakf1600_xor_bytes 3s 2s +50%
keccakf1600x4_extract_bytes_native 3s 3s +0%
keccakf1600x4_xor_bytes 3s 3s +0%
mld_ct_cmask_neg_i32 3s 1s +200%
mld_ct_get_optblocker_i64 3s 1s +200%
mld_h 3s 2s +50%
mld_keccakf1600x4_xor_bytes_c 3s 3s +0%
mld_prepare_domain_separation_prefix 3s 2s +50%
mld_sample_s1_s2_serial 3s 3s +0%
nttunpack_native_x86_64 3s 3s +0%
pack_sig_c 3s 2s +50%
pack_sig_z 3s 2s +50%
pointwise_native_x86_64 3s 5s -40%
poly_caddq 3s 3s +0%
poly_chknorm 3s 6s -50%
poly_decompose 3s 3s +0%
poly_decompose_88_native_aarch64 3s 3s +0%
poly_decompose_c 3s 2s +50%
poly_invntt_tomont_native 3s 2s +50%
poly_pointwise_montgomery 3s 3s +0%
poly_uniform_eta 3s 4s -25%
polyt1_pack 3s 4s -25%
polyveck_caddq 3s 3s +0%
polyveck_ntt 3s 5s -40%
polyveck_reduce 3s 4s -25%
polyw1_pack 3s 3s +0%
polyz_unpack_17_native_aarch64 3s 3s +0%
polyz_unpack_19_native_aarch64 3s 5s -40%
reduce32 3s 3s +0%
rej_eta_c 3s 3s +0%
rej_uniform_eta_native_aarch64 3s 5s -40%
rej_uniform_native_aarch64 3s 5s -40%
shake128_squeeze 3s 4s -25%
shake128x4_squeezeblocks 3s 2s +50%
shake256_absorb 3s 3s +0%
shake256_finalize 3s 2s +50%
shake256_release 3s 4s -25%
shake256_squeeze 3s 3s +0%
shake256x4_squeezeblocks 3s 3s +0%
sign_signature 3s 3s +0%
sign_signature_extmu 3s 5s -40%
sign_signature_pre_hash_internal 3s 4s -25%
sign_verify 3s 3s +0%
sign_verify_pre_hash_internal 3s 3s +0%
unpack_sk 3s 3s +0%
unpack_sk_t0hat 3s 4s -25%
use_hint 3s 3s +0%
intt_native_aarch64 2s 5s -60%
keccak_f1600_x1_native_aarch64 2s 3s -33%
keccak_finalize 2s 2s +0%
keccak_init 2s 2s +0%
keccakf1600_extract_bytes (big endian) 2s 3s -33%
keccakf1600_permute_native 2s 4s -50%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
mld_ct_cmask_nonzero_u8 2s 4s -50%
mld_ct_get_optblocker_u8 2s 3s -33%
mld_keccakf1600_extract_bytes 2s 3s -33%
mld_keccakf1600x4_extract_bytes_c 2s 3s -33%
ntt_native_x86_64 2s 3s -33%
pack_sk_s1 2s 3s -33%
poly_caddq_native 2s 4s -50%
poly_caddq_native_aarch64 2s 3s -33%
poly_caddq_native_x86_64 2s 5s -60%
poly_chknorm_native_x86_64 2s 3s -33%
poly_decompose_32_native_aarch64 2s 2s +0%
poly_ntt_c 2s 4s -50%
poly_permute_bitrev_to_custom_optional 2s 4s -50%
poly_permute_bitrev_to_custom_optional_native 2s 3s -33%
poly_shiftl 2s 5s -60%
poly_use_hint 2s 4s -50%
polyeta_pack 2s 3s -33%
polyt1_unpack 2s 2s +0%
polyveck_pack_w1 2s 3s -33%
polyvecl_pointwise_acc_montgomery 2s 3s -33%
polyvecl_uniform_gamma1_serial 2s 4s -50%
polyvecl_unpack_eta 2s 2s +0%
polyvecl_unpack_z 2s 2s +0%
polyz_unpack_native_x86_64 2s 1s +100%
rej_eta_native 2s 3s -33%
shake128_init 2s 2s +0%
shake256_init 2s 1s +100%
sig_unpack_hints 2s 2s +0%
sign_keypair_internal 2s 2s +0%
sk_s1hat_get_poly 2s 2s +0%
sk_t0hat_get_poly 2s 5s -60%
unpack_pk_t1 2s 4s -50%
unpack_sk_s2hat 2s 4s -50%
yvec_get_poly 2s 2s +0%
yvec_init 2s 1s +100%
decompose 1s 2s -50%
fqscale 1s 2s -50%
keccak_f1600_x1_native_aarch64_v84a 1s 2s -50%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 1s 2s -50%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 1s 2s -50%
keccakf1600x4_extract_bytes 1s 2s -50%
keccakf1600x4_permute 1s 2s -50%
mld_ct_get_optblocker_u32 1s 2s -50%
mld_ct_sel_int32 1s 2s -50%
mld_value_barrier_u8 1s 2s -50%
polyz_pack 1s 2s -50%
polyz_unpack 1s 4s -75%
polyz_unpack_native 1s 1s +0%
rej_eta 1s 3s -67%
shake256 1s 3s -67%
sys_check_capability 1s 1s +0%

@oqs-bot

oqs-bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

CBMC Results (ML-DSA-87)

Full Results (206 proofs)
Proof Status Current Previous Change
**TOTAL** 2406s 2603s -7.6%
polyvecl_pointwise_acc_montgomery_c 314s 356s -12%
mld_invntt_layer 282s 318s -11%
polyvec_matrix_expand 219s 249s -12%
rej_uniform_native 149s 158s -6%
mld_attempt_signature_generation 104s 111s -6%
poly_pointwise_montgomery_c 101s 110s -8%
mld_ct_memcmp 65s 78s -17%
sign_signature_internal 63s 66s -5%
sign_verify_internal 63s 60s +5%
polyvec_matrix_expand_serial 48s 48s +0%
mld_ntt_layer 44s 45s -2%
fqmul 43s 43s +0%
rej_uniform_native_x86_64 38s - new
compute_pack_t0_t1 32s 35s -9%
polyvec_matrix_pointwise_montgomery_yvec 30s 32s -6%
keccakf1600x4_permute_native 25s 23s +9%
rej_uniform 23s 23s +0%
mld_ntt_butterfly_block 22s 22s +0%
poly_chknorm_c 21s 21s +0%
mld_check_pct 16s 16s +0%
polyeta_unpack 14s 18s -22%
polyt0_unpack 14s 19s -26%
poly_uniform_eta_4x 12s 15s -20%
rej_uniform_c 12s 15s -20%
polyveck_decompose 11s 10s +10%
pointwise_acc_native_aarch64 10s 7s +43%
poly_add 10s 11s -9%
poly_uniform_4x 10s 12s -17%
keccak_absorb_once_x4 9s 10s -10%
polyveck_invntt_tomont 9s 8s +12%
keccak_absorb 8s 7s +14%
pointwise_acc_native_x86_64 8s 8s +0%
poly_decompose_c 8s 8s +0%
poly_invntt_tomont_c 8s 9s -11%
polyveck_caddq 8s 7s +14%
polyvecl_ntt 8s 10s -20%
mld_sample_s1_s2_serial 7s 9s -22%
polyveck_ntt 7s 7s +0%
sign 7s 8s -12%
sign_verify_pre_hash_internal 7s 6s +17%
unpack_sk_t0hat 7s 6s +17%
mld_compute_pack_z 6s 7s -14%
mld_keccakf1600_permute_c 6s 5s +20%
montgomery_reduce 6s 2s +200%
pack_sig_h 6s 4s +50%
poly_caddq_c 6s 5s +20%
poly_power2round 6s 5s +20%
polyeta_pack 6s 3s +100%
polyveck_pack_eta 6s 2s +200%
polyz_unpack_c 6s 5s +20%
sign_keypair_internal 6s 5s +20%
sign_verify_pre_hash_shake256 6s 7s -14%
mld_sample_s1_s2 5s 6s -17%
poly_caddq 5s 5s +0%
poly_challenge 5s 4s +25%
poly_permute_bitrev_to_custom_optional_native 5s 4s +25%
poly_pointwise_montgomery 5s 4s +25%
polyt0_pack 5s 2s +150%
polyt1_pack 5s 3s +67%
polyt1_unpack 5s 7s -29%
polyveck_chknorm 5s 8s -38%
polyveck_pack_w1 5s 3s +67%
polyveck_unpack_eta 5s 5s +0%
polyvecl_chknorm 5s 8s -38%
polyz_pack 5s 3s +67%
sign_open 5s 5s +0%
sign_pk_from_sk 5s 7s -29%
yvec_init 5s 2s +150%
intt_native_aarch64 4s 4s +0%
keccak_squeezeblocks_x4 4s 3s +33%
keccakf1600_extract_bytes (big endian) 4s 5s -20%
keccakf1600_xor_bytes (big endian) 4s 3s +33%
keccakf1600x4_extract_bytes_native 4s 4s +0%
pack_sk_rho_key_tr_s2 4s 6s -33%
poly_caddq_native_aarch64 4s 3s +33%
poly_chknorm 4s 4s +0%
poly_decompose_native 4s 5s -20%
poly_shiftl 4s 5s -20%
poly_sub 4s 5s -20%
poly_use_hint_c 4s 3s +33%
poly_use_hint_native_aarch64 4s 1s +300%
polyvecl_pointwise_acc_montgomery_native 4s 3s +33%
polyvecl_uniform_gamma1_serial 4s 3s +33%
polyw1_pack_32 4s 3s +33%
polyz_unpack_17_native_aarch64 4s 4s +0%
polyz_unpack_native_x86_64 4s 5s -20%
rej_eta 4s 2s +100%
rej_eta_native 4s 4s +0%
rej_uniform_native_aarch64 4s 5s -20%
shake128x4_absorb_once 4s 3s +33%
shake256_release 4s 3s +33%
shake256_squeeze 4s 2s +100%
sign_signature 4s 4s +0%
sign_signature_extmu 4s 4s +0%
sign_signature_pre_hash_internal 4s 5s -20%
sign_signature_pre_hash_shake256 4s 5s -20%
unpack_sk_s1hat 4s 4s +0%
use_hint 4s 3s +33%
caddq 3s 3s +0%
decompose 3s 2s +50%
intt_native_x86_64 3s 3s +0%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 3s 1s +200%
keccak_finalize 3s 1s +200%
keccak_squeeze 3s 2s +50%
keccakf1600_permute 3s 2s +50%
keccakf1600x4_permute 3s 3s +0%
keccakf1600x4_xor_bytes_native 3s 2s +50%
make_hint 3s 4s -25%
mld_h 3s 3s +0%
mld_keccakf1600_extract_bytes 3s 2s +50%
mld_keccakf1600x4_extract_bytes_c 3s 1s +200%
mld_keccakf1600x4_xor_bytes_c 3s 4s -25%
mld_polymat_expand_entry 3s 4s -25%
mld_prepare_domain_separation_prefix 3s 2s +50%
mld_value_barrier_u8 3s 3s +0%
ntt_native_aarch64 3s 4s -25%
pack_sig_c 3s 3s +0%
pack_sk_s1 3s 4s -25%
pointwise_native_aarch64 3s 5s -40%
poly_caddq_native_x86_64 3s 3s +0%
poly_decompose_88_native_aarch64 3s 5s -40%
poly_ntt_native 3s 4s -25%
poly_permute_bitrev_to_custom_optional 3s 4s -25%
poly_pointwise_montgomery_native 3s 3s +0%
poly_reduce 3s 2s +50%
poly_uniform_eta 3s 2s +50%
poly_uniform_gamma1 3s 3s +0%
poly_use_hint 3s 2s +50%
polyveck_reduce 3s 2s +50%
polyw1_pack 3s 4s -25%
polyz_unpack 3s 4s -25%
rej_eta_c 3s 4s -25%
shake128_squeeze 3s 2s +50%
shake256_finalize 3s 1s +200%
shake256x4_squeezeblocks 3s 2s +50%
sig_unpack_hints 3s 5s -40%
sign_keypair 3s 4s -25%
sign_verify 3s 6s -50%
sign_verify_extmu 3s 5s -40%
unpack_pk_t1 3s 5s -40%
unpack_sk 3s 4s -25%
fqscale 2s 4s -50%
keccak_f1600_x1_native_aarch64 2s 1s +100%
keccak_f1600_x1_native_aarch64_v84a 2s 2s +0%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 2s 4s -50%
keccak_f1600_x4_native_avx2 2s 4s -50%
keccak_init 2s 5s -60%
keccakf1600_permute_native 2s 5s -60%
mld_ct_abs_i32 2s 4s -50%
mld_ct_cmask_neg_i32 2s 4s -50%
mld_ct_cmask_nonzero_u32 2s 2s +0%
mld_ct_cmask_nonzero_u8 2s 4s -50%
mld_ct_get_optblocker_i64 2s 4s -50%
mld_ct_get_optblocker_u32 2s 4s -50%
mld_ct_get_optblocker_u8 2s 5s -60%
mld_value_barrier_i64 2s 3s -33%
ntt_native_x86_64 2s 3s -33%
nttunpack_native_x86_64 2s 2s +0%
pack_sig_z 2s 3s -33%
pointwise_native_x86_64 2s 2s +0%
poly_caddq_native 2s 3s -33%
poly_chknorm_native 2s 4s -50%
poly_chknorm_native_aarch64 2s 4s -50%
poly_chknorm_native_x86_64 2s 6s -67%
poly_decompose_32_native_aarch64 2s 5s -60%
poly_invntt_tomont 2s 2s +0%
poly_invntt_tomont_native 2s 2s +0%
poly_uniform 2s 5s -60%
poly_uniform_gamma1_4x 2s 4s -50%
poly_use_hint_native 2s 3s -33%
polyvec_matrix_pointwise_montgomery_row 2s 2s +0%
polyvecl_pack_eta 2s 4s -50%
polyvecl_pointwise_acc_montgomery 2s 2s +0%
polyvecl_uniform_gamma1 2s 5s -60%
polyvecl_unpack_eta 2s 4s -50%
polyvecl_unpack_z 2s 3s -33%
polyz_unpack_19_native_aarch64 2s 2s +0%
polyz_unpack_native 2s 3s -33%
reduce32 2s 2s +0%
rej_uniform_eta_native_aarch64 2s 3s -33%
shake128_absorb 2s 3s -33%
shake128_finalize 2s 1s +100%
shake128x4_squeezeblocks 2s 2s +0%
shake256 2s 2s +0%
shake256x4_absorb_once 2s 4s -50%
sk_s1hat_get_poly 2s 2s +0%
sk_s2hat_get_poly 2s 4s -50%
sk_t0hat_get_poly 2s 2s +0%
sys_check_capability 2s 4s -50%
unpack_sk_s2hat 2s 2s +0%
yvec_get_poly 2s 3s -33%
keccak_f1600_x4_native_aarch64_v84a 1s 4s -75%
keccakf1600_xor_bytes 1s 4s -75%
keccakf1600x4_extract_bytes 1s 2s -50%
keccakf1600x4_xor_bytes 1s 3s -67%
mld_ct_sel_int32 1s 4s -75%
mld_value_barrier_u32 1s 3s -67%
poly_decompose 1s 2s -50%
poly_ntt 1s 3s -67%
poly_ntt_c 1s 2s -50%
polyw1_pack_88 1s 3s -67%
power2round 1s 5s -80%
shake128_init 1s 3s -67%
shake128_release 1s 2s -50%
shake256_absorb 1s 2s -50%
shake256_init 1s 2s -50%

@oqs-bot

oqs-bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

CBMC Results (ML-DSA-65)

Full Results (206 proofs)
Proof Status Current Previous Change
**TOTAL** 2300s 2198s +4.6%
mld_invntt_layer 324s 314s +3%
polyvecl_pointwise_acc_montgomery_c 246s 233s +6%
rej_uniform_native 158s 159s -1%
polyvec_matrix_expand 141s 134s +5%
poly_pointwise_montgomery_c 116s 111s +5%
mld_ct_memcmp 76s 71s +7%
mld_attempt_signature_generation 70s 69s +1%
sign_verify_internal 67s 64s +5%
sign_signature_internal 52s 50s +4%
mld_ntt_layer 46s 45s +2%
fqmul 44s 43s +2%
rej_uniform_native_x86_64 40s - new
polyvec_matrix_expand_serial 27s 26s +4%
mld_ntt_butterfly_block 25s 22s +14%
keccakf1600x4_permute_native 24s 23s +4%
rej_uniform 24s 24s +0%
poly_chknorm_c 23s 22s +5%
mld_check_pct 19s 16s +19%
polyvecl_chknorm 19s 17s +12%
polyt0_unpack 18s 18s +0%
compute_pack_t0_t1 16s 17s -6%
rej_uniform_c 15s 13s +15%
polyveck_decompose 14s 15s -7%
keccak_absorb_once_x4 12s 11s +9%
poly_uniform_4x 12s 10s +20%
poly_uniform_eta_4x 12s 12s +0%
poly_add 11s 10s +10%
mld_keccakf1600_permute_c 10s 6s +67%
polyvec_matrix_pointwise_montgomery_yvec 10s 8s +25%
polyveck_chknorm 10s 11s -9%
poly_caddq_c 9s 7s +29%
poly_invntt_tomont_c 9s 10s -10%
polyveck_ntt 9s 11s -18%
polyvecl_ntt 9s 10s -10%
keccak_absorb 8s 7s +14%
mld_compute_pack_z 8s 10s -20%
sign 8s 7s +14%
mld_sample_s1_s2 7s 7s +0%
polyveck_caddq 7s 5s +40%
polyveck_invntt_tomont 7s 8s -12%
polyz_unpack_c 7s 7s +0%
sign_keypair_internal 7s 5s +40%
keccak_squeeze 6s 2s +200%
pointwise_acc_native_aarch64 6s 7s -14%
pointwise_acc_native_x86_64 6s 7s -14%
poly_challenge 6s 5s +20%
poly_pointwise_montgomery_native 6s 2s +200%
poly_power2round 6s 6s +0%
poly_reduce 6s 2s +200%
rej_uniform_eta_native_aarch64 6s 5s +20%
sign_open 6s 6s +0%
sign_pk_from_sk 6s 7s -14%
sign_verify_pre_hash_shake256 6s 2s +200%
unpack_sk_t0hat 6s 7s -14%
fqscale 5s 3s +67%
mld_sample_s1_s2_serial 5s 5s +0%
poly_decompose_c 5s 6s -17%
poly_uniform_gamma1 5s 2s +150%
poly_use_hint_native_aarch64 5s 2s +150%
polyvecl_unpack_eta 5s 4s +25%
polyw1_pack_32 5s 3s +67%
power2round 5s 3s +67%
sign_verify 5s 3s +67%
sk_t0hat_get_poly 5s 2s +150%
unpack_pk_t1 5s 4s +25%
unpack_sk_s1hat 5s 3s +67%
keccak_f1600_x1_native_aarch64_v84a 4s 2s +100%
keccak_squeezeblocks_x4 4s 3s +33%
keccakf1600x4_permute 4s 4s +0%
mld_ct_abs_i32 4s 1s +300%
mld_ct_cmask_nonzero_u32 4s 3s +33%
pack_sk_s1 4s 3s +33%
pointwise_native_aarch64 4s 3s +33%
poly_chknorm 4s 3s +33%
poly_chknorm_native 4s 6s -33%
poly_decompose_native 4s 5s -20%
poly_invntt_tomont_native 4s 2s +100%
poly_permute_bitrev_to_custom_optional 4s 2s +100%
poly_permute_bitrev_to_custom_optional_native 4s 4s +0%
poly_use_hint 4s 3s +33%
poly_use_hint_c 4s 4s +0%
poly_use_hint_native 4s 3s +33%
polyt1_unpack 4s 1s +300%
polyveck_pack_eta 4s 2s +100%
polyveck_unpack_eta 4s 4s +0%
polyvecl_pointwise_acc_montgomery 4s 4s +0%
polyw1_pack_88 4s 1s +300%
shake128_finalize 4s 2s +100%
shake128_squeeze 4s 2s +100%
shake256_finalize 4s 2s +100%
shake256_init 4s 2s +100%
shake256_release 4s 2s +100%
sign_keypair 4s 4s +0%
sign_signature 4s 5s -20%
sign_signature_pre_hash_shake256 4s 6s -33%
sign_verify_extmu 4s 4s +0%
sk_s2hat_get_poly 4s 3s +33%
unpack_sk 4s 4s +0%
yvec_init 4s 3s +33%
intt_native_aarch64 3s 4s -25%
intt_native_x86_64 3s 3s +0%
keccak_f1600_x4_native_aarch64_v84a 3s 2s +50%
keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid 3s 2s +50%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
keccakf1600_permute_native 3s 4s -25%
keccakf1600_xor_bytes 3s 1s +200%
keccakf1600x4_extract_bytes_native 3s 2s +50%
make_hint 3s 3s +0%
mld_ct_cmask_nonzero_u8 3s 5s -40%
mld_ct_get_optblocker_u32 3s 2s +50%
mld_ct_sel_int32 3s 1s +200%
mld_prepare_domain_separation_prefix 3s 4s -25%
mld_value_barrier_u8 3s 3s +0%
ntt_native_aarch64 3s 2s +50%
ntt_native_x86_64 3s 4s -25%
nttunpack_native_x86_64 3s 7s -57%
pack_sig_h 3s 3s +0%
pack_sig_z 3s 5s -40%
pack_sk_rho_key_tr_s2 3s 1s +200%
pointwise_native_x86_64 3s 3s +0%
poly_caddq 3s 2s +50%
poly_caddq_native_aarch64 3s 3s +0%
poly_caddq_native_x86_64 3s 4s -25%
poly_chknorm_native_aarch64 3s 2s +50%
poly_chknorm_native_x86_64 3s 4s -25%
poly_invntt_tomont 3s 2s +50%
poly_ntt 3s 2s +50%
poly_ntt_c 3s 3s +0%
poly_ntt_native 3s 3s +0%
poly_shiftl 3s 4s -25%
poly_uniform 3s 4s -25%
poly_uniform_eta 3s 4s -25%
polyeta_pack 3s 3s +0%
polyt1_pack 3s 2s +50%
polyveck_pack_w1 3s 2s +50%
polyvecl_pack_eta 3s 3s +0%
polyvecl_pointwise_acc_montgomery_native 3s 3s +0%
polyvecl_unpack_z 3s 5s -40%
polyw1_pack 3s 5s -40%
polyz_unpack_19_native_aarch64 3s 4s -25%
reduce32 3s 3s +0%
rej_eta 3s 3s +0%
rej_eta_c 3s 1s +200%
rej_eta_native 3s 4s -25%
rej_uniform_native_aarch64 3s 3s +0%
shake128x4_absorb_once 3s 3s +0%
shake256_squeeze 3s 2s +50%
shake256x4_absorb_once 3s 1s +200%
sig_unpack_hints 3s 3s +0%
sign_signature_extmu 3s 5s -40%
sign_signature_pre_hash_internal 3s 4s -25%
sign_verify_pre_hash_internal 3s 4s -25%
sk_s1hat_get_poly 3s 1s +200%
unpack_sk_s2hat 3s 5s -40%
use_hint 3s 5s -40%
yvec_get_poly 3s 2s +50%
caddq 2s 4s -50%
decompose 2s 2s +0%
keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid 2s 1s +100%
keccak_finalize 2s 3s -33%
keccak_init 2s 3s -33%
keccakf1600_permute 2s 1s +100%
keccakf1600_xor_bytes (big endian) 2s 3s -33%
keccakf1600x4_extract_bytes 2s 2s +0%
mld_ct_get_optblocker_i64 2s 1s +100%
mld_ct_get_optblocker_u8 2s 3s -33%
mld_h 2s 4s -50%
mld_keccakf1600_extract_bytes 2s 3s -33%
mld_keccakf1600x4_extract_bytes_c 2s 2s +0%
mld_keccakf1600x4_xor_bytes_c 2s 3s -33%
mld_polymat_expand_entry 2s 3s -33%
montgomery_reduce 2s 3s -33%
pack_sig_c 2s 5s -60%
poly_caddq_native 2s 3s -33%
poly_decompose_32_native_aarch64 2s 6s -67%
poly_decompose_88_native_aarch64 2s 2s +0%
poly_pointwise_montgomery 2s 4s -50%
poly_sub 2s 4s -50%
poly_uniform_gamma1_4x 2s 7s -71%
polyeta_unpack 2s 4s -50%
polyt0_pack 2s 5s -60%
polyvec_matrix_pointwise_montgomery_row 2s 5s -60%
polyveck_reduce 2s 4s -50%
polyvecl_uniform_gamma1 2s 2s +0%
polyvecl_uniform_gamma1_serial 2s 2s +0%
polyz_pack 2s 3s -33%
polyz_unpack 2s 2s +0%
polyz_unpack_17_native_aarch64 2s 2s +0%
polyz_unpack_native_x86_64 2s 4s -50%
shake128_absorb 2s 1s +100%
shake128_init 2s 2s +0%
shake128x4_squeezeblocks 2s 4s -50%
shake256 2s 4s -50%
shake256_absorb 2s 4s -50%
shake256x4_squeezeblocks 2s 2s +0%
keccak_f1600_x1_native_aarch64 1s 2s -50%
keccak_f1600_x4_native_avx2 1s 2s -50%
keccakf1600x4_xor_bytes 1s 3s -67%
keccakf1600x4_xor_bytes_native 1s 2s -50%
mld_ct_cmask_neg_i32 1s 2s -50%
mld_value_barrier_i64 1s 2s -50%
mld_value_barrier_u32 1s 2s -50%
poly_decompose 1s 4s -75%
polyz_unpack_native 1s 5s -80%
shake128_release 1s 2s -50%
sys_check_capability 1s 5s -80%

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 34764 cycles 34374 cycles 1.01
ML-DSA-44 sign 120113 cycles 120132 cycles 1.00
ML-DSA-44 verify 38092 cycles 38166 cycles 1.00
ML-DSA-65 keypair 61138 cycles 60500 cycles 1.01
ML-DSA-65 sign 201844 cycles 199945 cycles 1.01
ML-DSA-65 verify 62783 cycles 62429 cycles 1.01
ML-DSA-87 keypair 93501 cycles 94486 cycles 0.99
ML-DSA-87 sign 236815 cycles 239500 cycles 0.99
ML-DSA-87 verify 95619 cycles 96894 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 93930 cycles 93842 cycles 1.00
ML-DSA-44 sign 333310 cycles 333119 cycles 1.00
ML-DSA-44 verify 100022 cycles 100025 cycles 1.00
ML-DSA-65 keypair 159902 cycles 160115 cycles 1.00
ML-DSA-65 sign 543114 cycles 543227 cycles 1.00
ML-DSA-65 verify 160989 cycles 161060 cycles 1.00
ML-DSA-87 keypair 266666 cycles 266874 cycles 1.00
ML-DSA-87 sign 704974 cycles 706010 cycles 1.00
ML-DSA-87 verify 270510 cycles 269779 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot oqs-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 6539a79 Previous: 9ee2f35 Ratio
ML-DSA-44 keypair 42142 cycles 40662 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@jakemas jakemas force-pushed the jakemas/rej-uniform-asm branch 6 times, most recently from 97ef0aa to b6e6f76 Compare June 11, 2026 18:57
@jakemas

jakemas commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Updated to follow format introduced in d417b6b

@jakemas jakemas force-pushed the jakemas/rej-uniform-asm branch 2 times, most recently from 24cfdd8 to a50739c Compare June 12, 2026 20:37

@mkannwischer mkannwischer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakemas and sorry for the long silence.
The core of the proof looks good to me - I have a few more comments/questions below.

Comment thread scripts/autogen Outdated
Comment on lines +1331 to +1344
def gen_avx2_hol_light_rej_uniform_table():
"""Emit the HOL Light byte-list form of the AVX2 rej_uniform lookup table.
Mirrors mlkem-native's gen_aarch64_hol_light_rej_uniform_table pattern."""

def get_set_bits_idxs(i):
bits = list(map(int, format(i, "08b")))
bits.reverse()
return [bit_idx for bit_idx in range(8) if bits[bit_idx] == 1]

def gen_rows():
for i in range(256):
idxs = get_set_bits_idxs(i)
idxs = idxs + [0] * (8 - len(idxs))
yield idxs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should just call the one used to generate the actual constants (gen_avx2_rej_uniform_table_rows) so we can be sure they are the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread mldsa/src/native/x86_64/meta.h Outdated
/* Safety: outlen is at most MLDSA_N and, hence, this cast is safe. */
return (int)mld_rej_uniform_avx2(r, buf);
return (int)mld_rej_uniform_avx2_asm(r, buf,
(const uint8_t *)mld_rej_uniform_table);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

__contract__(
requires(memory_no_alias(r, sizeof(int32_t) * MLDSA_N))
requires(memory_no_alias(buf, 840))
requires(table == (const uint8_t *)mld_rej_uniform_table)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[ARITH_TAC; ALL_TAC] THEN
SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL
[ARITH_TAC; ALL_TAC] THEN
FIRST_ASSUM(fun th ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the debug Printf.printf statements or do they still serve any purpose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, done

[ARITH_TAC; ALL_TAC] THEN
SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL
[ARITH_TAC; ALL_TAC] THEN
FIRST_ASSUM(fun th ->

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why are there separate _MEMSAFE and _SAFE proofs? Shouldn't this proof only have the _MEMSAFE theorems?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
Comment on lines +65 to +66
* Low 128 bits: bytes [0..15] (original 64-bit lanes 0, 1)
* High 128 bits: bytes [8..23] (original 64-bit lanes 1, 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The comment is slightly confusing by itself, because the semantic granularity of the input is in bytes, not 64-bit. 64-bit granularity comes from vpermq, but this isn't mentioned here.

Suggestion:

* vpermq with 0x94(=0b 10 01 01 00) permutes 64-bit lanes via (0,1,2,3) -> (0,1,1,2).
* The loaded 32 bytes are thus rearranged as:
*   Low  128 bits: bytes [0..15] of original 32-byte
*   High 128 bits: bytes [8..23] of original 32-byte

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
Comment on lines +84 to +95
// Construct broadcast constants
movl $0x7FFFFF, good
vmovd good, %xmm1
vpbroadcastd %xmm1, mask // mask: 23-bit extraction

movl $8380417, good // MLDSA_Q
vmovd good, %xmm2
vpbroadcastd %xmm2, bound // bound: rejection threshold

// Initialize counters
xorl ctr, ctr // ctr = 0
xorl pos, pos // pos = 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit nit: indentation of comments. Also, can we use /* .. */ style? I know this is wrong in many other ASM files, but when we touch/add code we may as well adjust it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/*
* Main SIMD loop: process 24 input bytes into up to 8 coefficients
* per iteration. Loops while ctr <= MLDSA_N - 8 and pos <= BUFLEN - 32.

@hanno-becker hanno-becker Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's noteworthy that this approach differs from the mlkem-native rejection sampling, which buffers into a temporary slightly oversized stack buffer, to avoid the scalar tail loop.

I don't know which is better, nor is it necessary that we align this now, but just noting.

My gut feeling is that it's probably better the way it's done here.

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
rej_uniform_avx2_asm_loop:
cmpl $248, ctr // MLDSA_N - 8
ja rej_uniform_avx2_asm_scalar
cmpl $808, pos // MLD_AVX2_REJ_UNIFORM_BUFLEN - 32

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you use the macro constant instead of a comment here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
Comment on lines +116 to +118
popcntl good, cnt // count valid coefficients

vmovq (tab, %r8, 8), %xmm4 // load permutation from table[good]

@hanno-becker hanno-becker Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: mix of symbolic good vs raw r8. Can we get the notation consistent (ideally symbolics only)? (If you need variants of the same register, you can do something like name_x,name_y,..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
vpmovzxbd %xmm4, cmp_result // zero-extend to 8 dword indices
vpermd data, cmp_result, data // compact valid coefficients to front

vmovdqu data, (out, %rax, 4) // store at r[ctr]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: Let's stick to symbolic register names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread dev/x86_64/src/rej_uniform_avx2_asm.S Outdated
rej_uniform_avx2_asm_scalar:
cmpl $256, ctr // MLDSA_N
jae rej_uniform_avx2_asm_done
cmpl $837, pos // MLD_AVX2_REJ_UNIFORM_BUFLEN - 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can never fire, can it? If so, can we remove it?

@jakemas jakemas Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the reasoning:

The scalar loop bumps pos += 3 on every sample, but ctr only increments on an accepted one (the pos advance happens before the >= Q reject check). So in a rejection-heavy tail, pos keeps climbing while ctr stalls.

Concretely: the buffer is 840 bytes = at most 280 three-byte samples, and we need 256 acceptances — so we can only absorb ~24 rejections before running dry with ctr still < 256. We can also enter the scalar loop with pos as high as 832 (the main loop exits after pos += 24 with pos having been <= 808). From there, if the tail keeps rejecting, pos walks 832 → 835 → 838 and at 838 the next 3-byte read would touch buf[838..840], one past the end of the 840-byte buffer. The cmpl $SCALAR_POS_BOUND, pos; ja done is what stops that.

It's also load-bearing in the proof: the loop exit is disjunctive (hit 256 coefficients or exhaust the buffer), and the memory-safety argument for the 3-byte load depends on pos <= 837, which this branch establishes. Removing it reintroduces the over-read on adversarial input and breaks the MEMSAFE proof.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting the removed intrinsics had the identical guard as its scalar loop condition:

while (ctr < MLDSA_N && pos <= MLD_AVX2_REJ_UNIFORM_BUFLEN - 3)

which is exactly the pos <= 837 that cmpl $SCALAR_POS_BOUND, pos; ja done implements. The asm didn't add the check, it preserves what was already there.

The header comment on that file also spells out why it has to be there:

The pqcrystals implementation assumes a buffer that is 8 bytes larger as the first loop overreads by 8 bytes that are then discarded. We instead do not pad the buffer and do not overread.

So the bound is load-bearing by design: it's what lets us run on the unpadded 840 byte buffer without over-reading.

@hanno-becker hanno-becker left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakemas, and sorry this is again taking so long.

In addition to @mkannwischer's important comment that the constant-tables should come from the same source (that's our only argument for why they are consistent across proof and code -- we have no other check!), there's a dead branch in the rejection sampling I believe that can be removed.

@jakemas jakemas force-pushed the jakemas/rej-uniform-asm branch 2 times, most recently from ed868d0 to fb75556 Compare June 15, 2026 21:07
jakemas added a commit that referenced this pull request Jun 16, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4, following the rej_uniform approach in #1014: the table
is passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Wire eta4 to the new
asm in meta.h, add the asm entry points and contracts in
arith_native_x86_64.h, register the bytecode dump targets in autogen and
the Makefile, and add a poly_uniform_eta_4x component benchmark.

Signed-off-by: jake massimo <jakemas@amazon.com>
jakemas added a commit that referenced this pull request Jun 16, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4, following the rej_uniform approach in #1014: the table
is passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Wire both eta2 and
eta4 to the new asm in meta.h, add the asm entry points and contracts in
arith_native_x86_64.h, register the bytecode dump targets in autogen and
the Makefile, and add a poly_uniform_eta_4x component benchmark.

The eta2 vector path applies the centered mod-5 reduction to (2 - nibble)
directly (matching the reference), rather than reducing the raw nibble and
subtracting afterwards; the two are not equivalent because vpmulhrsw rounds
to nearest. Verified against the ACVP keyGen vectors for all parameter sets.

Signed-off-by: jake massimo <jakemas@amazon.com>
jakemas added a commit that referenced this pull request Jun 16, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4, following the rej_uniform approach in #1014: the table
is passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Wire both eta2 and
eta4 to the new asm in meta.h, add the asm entry points and contracts in
arith_native_x86_64.h, register the bytecode dump targets in autogen and
the Makefile, and add a poly_uniform_eta_4x component benchmark.

The eta2 vector path applies the centered mod-5 reduction to (2 - nibble)
directly (matching the reference), rather than reducing the raw nibble and
subtracting afterwards; the two are not equivalent because vpmulhrsw rounds
to nearest. Verified against the ACVP keyGen vectors for all parameter sets.

Signed-off-by: jake massimo <jakemas@amazon.com>
jakemas added a commit that referenced this pull request Jun 16, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4 and remove the AVX2 intrinsics implementations they
replace, following the rej_uniform approach in #1014: the table is
passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Both eta2 and eta4 are
wired to the new asm in meta.h, with contracts in arith_native_x86_64.h,
bytecode dump targets in autogen and the Makefile, and a
poly_uniform_eta_4x component benchmark.

The eta2 vector path applies the centered mod-5 reduction to (2 - nibble)
directly (matching the reference), rather than reducing the raw nibble and
subtracting afterwards; the two are not equivalent because vpmulhrsw rounds
to nearest. Verified against the ACVP keyGen vectors for all parameter sets.

Signed-off-by: jake massimo <jakemas@amazon.com>
jakemas added a commit that referenced this pull request Jun 16, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4 and remove the AVX2 intrinsics implementations they
replace, following the rej_uniform approach in #1014: the table is
passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Both eta2 and eta4 are
wired to the new asm in meta.h, with contracts in arith_native_x86_64.h,
bytecode dump targets in autogen and the Makefile, and a
poly_uniform_eta_4x component benchmark.

The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm
routines) so they are called with the System V register convention on all
platforms, including Windows/MinGW.

The eta2 vector path applies the centered mod-5 reduction to (2 - nibble)
directly (matching the reference), rather than reducing the raw nibble and
subtracting afterwards; the two are not equivalent because vpmulhrsw rounds
to nearest. Verified against the ACVP keyGen vectors for all parameter sets.

Signed-off-by: jake massimo <jakemas@amazon.com>
jakemas added a commit that referenced this pull request Jun 17, 2026
…sembly

Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and
rej_uniform_eta4 and remove the AVX2 intrinsics implementations they
replace, following the rej_uniform approach in #1014: the table is
passed as a parameter and all constants are built from immediates (no
.rodata), enabling future HOL-Light verification. Both eta2 and eta4 are
wired to the new asm in meta.h, with contracts in arith_native_x86_64.h,
bytecode dump targets in autogen and the Makefile, and a
poly_uniform_eta_4x component benchmark.

The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm
routines) so they are called with the System V register convention on all
platforms, including Windows/MinGW. The endbr64 is emitted via
MLD_ASM_FN_SYMBOL (CET-gated) rather than a raw mnemonic, so older
assemblers (e.g. clang-6) build cleanly.

The eta2 vector path applies the centered mod-5 reduction to (2 - nibble)
directly (matching the reference), rather than reducing the raw nibble and
subtracting afterwards; the two are not equivalent because vpmulhrsw rounds
to nearest. Verified against the ACVP keyGen vectors for all parameter sets.

Signed-off-by: jake massimo <jakemas@amazon.com>
… proof

Replaces the intrinsics-based rej_uniform_avx2.c with a hand-written
assembly routine (mld_rej_uniform_avx2_asm) and adds HOL-Light functional
correctness and memory-safety proofs on top of s2n-bignum, plus the CBMC
contract proof.

Signed-off-by: Jake Massimo <jakemas@amazon.com>
@jakemas jakemas force-pushed the jakemas/rej-uniform-asm branch from fb75556 to 4c41cac Compare June 19, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of rej_uniform with assembly

4 participants