x86_64 + HOL-Light: Replace rej_uniform intrinsics with assembly and HOL-Light CORRECT and MEMSAFE proofs#1014
x86_64 + HOL-Light: Replace rej_uniform intrinsics with assembly and HOL-Light CORRECT and MEMSAFE proofs#1014jakemas wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113118 cycles |
113013 cycles |
1.00 |
ML-DSA-44 sign |
355649 cycles |
355605 cycles |
1.00 |
ML-DSA-44 verify |
117801 cycles |
117682 cycles |
1.00 |
ML-DSA-65 keypair |
196381 cycles |
196214 cycles |
1.00 |
ML-DSA-65 sign |
589557 cycles |
588943 cycles |
1.00 |
ML-DSA-65 verify |
194604 cycles |
194375 cycles |
1.00 |
ML-DSA-87 keypair |
322210 cycles |
322148 cycles |
1.00 |
ML-DSA-87 sign |
752493 cycles |
752763 cycles |
1.00 |
ML-DSA-87 verify |
320055 cycles |
319900 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
212361 cycles |
212622 cycles |
1.00 |
ML-DSA-44 sign |
760716 cycles |
760066 cycles |
1.00 |
ML-DSA-44 verify |
228743 cycles |
228987 cycles |
1.00 |
ML-DSA-65 keypair |
379384 cycles |
379665 cycles |
1.00 |
ML-DSA-65 sign |
1250617 cycles |
1249827 cycles |
1.00 |
ML-DSA-65 verify |
371531 cycles |
372045 cycles |
1.00 |
ML-DSA-87 keypair |
604335 cycles |
605426 cycles |
1.00 |
ML-DSA-87 sign |
1593243 cycles |
1591413 cycles |
1.00 |
ML-DSA-87 verify |
618270 cycles |
617375 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
66830 cycles |
68874 cycles |
0.97 |
ML-DSA-44 sign |
184077 cycles |
187594 cycles |
0.98 |
ML-DSA-44 verify |
65562 cycles |
68993 cycles |
0.95 |
ML-DSA-65 keypair |
111959 cycles |
119089 cycles |
0.94 |
ML-DSA-65 sign |
292002 cycles |
299488 cycles |
0.98 |
ML-DSA-65 verify |
108472 cycles |
115385 cycles |
0.94 |
ML-DSA-87 keypair |
185520 cycles |
203754 cycles |
0.91 |
ML-DSA-87 sign |
379630 cycles |
396462 cycles |
0.96 |
ML-DSA-87 verify |
177291 cycles |
196231 cycles |
0.90 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
68316 cycles |
68121 cycles |
1.00 |
ML-DSA-44 sign |
202487 cycles |
202429 cycles |
1.00 |
ML-DSA-44 verify |
70722 cycles |
70691 cycles |
1.00 |
ML-DSA-65 keypair |
121061 cycles |
121050 cycles |
1.00 |
ML-DSA-65 sign |
331574 cycles |
332242 cycles |
1.00 |
ML-DSA-65 verify |
117810 cycles |
118169 cycles |
1.00 |
ML-DSA-87 keypair |
198140 cycles |
198283 cycles |
1.00 |
ML-DSA-87 sign |
427941 cycles |
428124 cycles |
1.00 |
ML-DSA-87 verify |
194637 cycles |
194645 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
134578 cycles |
135123 cycles |
1.00 |
ML-DSA-44 sign |
523923 cycles |
523989 cycles |
1.00 |
ML-DSA-44 verify |
147640 cycles |
147421 cycles |
1.00 |
ML-DSA-65 keypair |
228634 cycles |
227032 cycles |
1.01 |
ML-DSA-65 sign |
864042 cycles |
860343 cycles |
1.00 |
ML-DSA-65 verify |
236700 cycles |
234883 cycles |
1.01 |
ML-DSA-87 keypair |
371955 cycles |
371568 cycles |
1.00 |
ML-DSA-87 sign |
1080535 cycles |
1079389 cycles |
1.00 |
ML-DSA-87 verify |
383811 cycles |
383403 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
56863 cycles |
56287 cycles |
1.01 |
ML-DSA-44 sign |
181063 cycles |
181562 cycles |
1.00 |
ML-DSA-44 verify |
61140 cycles |
61061 cycles |
1.00 |
ML-DSA-65 keypair |
98291 cycles |
98770 cycles |
1.00 |
ML-DSA-65 sign |
298368 cycles |
299116 cycles |
1.00 |
ML-DSA-65 verify |
100343 cycles |
100251 cycles |
1.00 |
ML-DSA-87 keypair |
152430 cycles |
153265 cycles |
0.99 |
ML-DSA-87 sign |
354719 cycles |
355417 cycles |
1.00 |
ML-DSA-87 verify |
153124 cycles |
153884 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128315 cycles |
128272 cycles |
1.00 |
ML-DSA-44 sign |
447513 cycles |
447600 cycles |
1.00 |
ML-DSA-44 verify |
138123 cycles |
144678 cycles |
0.95 |
ML-DSA-65 keypair |
220541 cycles |
220481 cycles |
1.00 |
ML-DSA-65 sign |
726484 cycles |
726951 cycles |
1.00 |
ML-DSA-65 verify |
222926 cycles |
223461 cycles |
1.00 |
ML-DSA-87 keypair |
366142 cycles |
366604 cycles |
1.00 |
ML-DSA-87 sign |
927541 cycles |
927414 cycles |
1.00 |
ML-DSA-87 verify |
374016 cycles |
373875 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
72353 cycles |
72235 cycles |
1.00 |
ML-DSA-44 sign |
212424 cycles |
212375 cycles |
1.00 |
ML-DSA-44 verify |
75754 cycles |
75714 cycles |
1.00 |
ML-DSA-65 keypair |
127646 cycles |
127612 cycles |
1.00 |
ML-DSA-65 sign |
351030 cycles |
350845 cycles |
1.00 |
ML-DSA-65 verify |
125627 cycles |
125755 cycles |
1.00 |
ML-DSA-87 keypair |
205980 cycles |
208476 cycles |
0.99 |
ML-DSA-87 sign |
444778 cycles |
450018 cycles |
0.99 |
ML-DSA-87 verify |
205601 cycles |
205843 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
157499 cycles |
157541 cycles |
1.00 |
ML-DSA-44 sign |
549244 cycles |
549413 cycles |
1.00 |
ML-DSA-44 verify |
169448 cycles |
168865 cycles |
1.00 |
ML-DSA-65 keypair |
268437 cycles |
268818 cycles |
1.00 |
ML-DSA-65 sign |
903422 cycles |
903672 cycles |
1.00 |
ML-DSA-65 verify |
275283 cycles |
274680 cycles |
1.00 |
ML-DSA-87 keypair |
448241 cycles |
448464 cycles |
1.00 |
ML-DSA-87 sign |
1158654 cycles |
1157970 cycles |
1.00 |
ML-DSA-87 verify |
458704 cycles |
458043 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
ML-DSA-44 sign |
134317 cycles |
132808 cycles |
1.01 |
ML-DSA-44 verify |
44844 cycles |
43607 cycles |
1.03 |
ML-DSA-65 keypair |
72940 cycles |
71859 cycles |
1.02 |
ML-DSA-65 sign |
213861 cycles |
213367 cycles |
1.00 |
ML-DSA-65 verify |
73729 cycles |
72847 cycles |
1.01 |
ML-DSA-87 keypair |
107003 cycles |
109237 cycles |
0.98 |
ML-DSA-87 sign |
250851 cycles |
254550 cycles |
0.99 |
ML-DSA-87 verify |
107681 cycles |
109371 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120754 cycles |
120325 cycles |
1.00 |
ML-DSA-44 sign |
447570 cycles |
447576 cycles |
1.00 |
ML-DSA-44 verify |
130511 cycles |
130561 cycles |
1.00 |
ML-DSA-65 keypair |
205040 cycles |
205018 cycles |
1.00 |
ML-DSA-65 sign |
728790 cycles |
729474 cycles |
1.00 |
ML-DSA-65 verify |
210029 cycles |
209605 cycles |
1.00 |
ML-DSA-87 keypair |
337610 cycles |
336678 cycles |
1.00 |
ML-DSA-87 sign |
925517 cycles |
924223 cycles |
1.00 |
ML-DSA-87 verify |
347563 cycles |
347399 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
138744 cycles |
138561 cycles |
1.00 |
ML-DSA-44 sign |
483982 cycles |
484140 cycles |
1.00 |
ML-DSA-44 verify |
148574 cycles |
162388 cycles |
0.91 |
ML-DSA-65 keypair |
241921 cycles |
241950 cycles |
1.00 |
ML-DSA-65 sign |
792702 cycles |
792591 cycles |
1.00 |
ML-DSA-65 verify |
240763 cycles |
241288 cycles |
1.00 |
ML-DSA-87 keypair |
396106 cycles |
397138 cycles |
1.00 |
ML-DSA-87 sign |
1013453 cycles |
1013569 cycles |
1.00 |
ML-DSA-87 verify |
403446 cycles |
403178 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113189 cycles |
113255 cycles |
1.00 |
ML-DSA-44 sign |
355791 cycles |
356042 cycles |
1.00 |
ML-DSA-44 verify |
117978 cycles |
117969 cycles |
1.00 |
ML-DSA-65 keypair |
196342 cycles |
196623 cycles |
1.00 |
ML-DSA-65 sign |
589183 cycles |
589242 cycles |
1.00 |
ML-DSA-65 verify |
194553 cycles |
194559 cycles |
1.00 |
ML-DSA-87 keypair |
322537 cycles |
322281 cycles |
1.00 |
ML-DSA-87 sign |
753613 cycles |
753546 cycles |
1.00 |
ML-DSA-87 verify |
320115 cycles |
320070 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2 (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213219 cycles |
212521 cycles |
1.00 |
ML-DSA-44 sign |
761553 cycles |
760970 cycles |
1.00 |
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
ML-DSA-65 keypair |
380573 cycles |
379762 cycles |
1.00 |
ML-DSA-65 sign |
1252452 cycles |
1252199 cycles |
1.00 |
ML-DSA-65 verify |
372839 cycles |
371797 cycles |
1.00 |
ML-DSA-87 keypair |
607341 cycles |
604584 cycles |
1.00 |
ML-DSA-87 sign |
1596680 cycles |
1595561 cycles |
1.00 |
ML-DSA-87 verify |
619175 cycles |
618927 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 verify |
241351 cycles |
234237 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
CBMC Results (ML-DSA-44)Full Results (206 proofs)
|
CBMC Results (ML-DSA-87)Full Results (206 proofs)
|
CBMC Results (ML-DSA-65)Full Results (206 proofs)
|
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34764 cycles |
34374 cycles |
1.01 |
ML-DSA-44 sign |
120113 cycles |
120132 cycles |
1.00 |
ML-DSA-44 verify |
38092 cycles |
38166 cycles |
1.00 |
ML-DSA-65 keypair |
61138 cycles |
60500 cycles |
1.01 |
ML-DSA-65 sign |
201844 cycles |
199945 cycles |
1.01 |
ML-DSA-65 verify |
62783 cycles |
62429 cycles |
1.01 |
ML-DSA-87 keypair |
93501 cycles |
94486 cycles |
0.99 |
ML-DSA-87 sign |
236815 cycles |
239500 cycles |
0.99 |
ML-DSA-87 verify |
95619 cycles |
96894 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i) (no-opt)
Details
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
93930 cycles |
93842 cycles |
1.00 |
ML-DSA-44 sign |
333310 cycles |
333119 cycles |
1.00 |
ML-DSA-44 verify |
100022 cycles |
100025 cycles |
1.00 |
ML-DSA-65 keypair |
159902 cycles |
160115 cycles |
1.00 |
ML-DSA-65 sign |
543114 cycles |
543227 cycles |
1.00 |
ML-DSA-65 verify |
160989 cycles |
161060 cycles |
1.00 |
ML-DSA-87 keypair |
266666 cycles |
266874 cycles |
1.00 |
ML-DSA-87 sign |
704974 cycles |
706010 cycles |
1.00 |
ML-DSA-87 verify |
270510 cycles |
269779 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 6539a79 | Previous: 9ee2f35 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
42142 cycles |
40662 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
97ef0aa to
b6e6f76
Compare
|
Updated to follow format introduced in d417b6b |
24cfdd8 to
a50739c
Compare
mkannwischer
left a comment
There was a problem hiding this comment.
Thanks @jakemas and sorry for the long silence.
The core of the proof looks good to me - I have a few more comments/questions below.
| def gen_avx2_hol_light_rej_uniform_table(): | ||
| """Emit the HOL Light byte-list form of the AVX2 rej_uniform lookup table. | ||
| Mirrors mlkem-native's gen_aarch64_hol_light_rej_uniform_table pattern.""" | ||
|
|
||
| def get_set_bits_idxs(i): | ||
| bits = list(map(int, format(i, "08b"))) | ||
| bits.reverse() | ||
| return [bit_idx for bit_idx in range(8) if bits[bit_idx] == 1] | ||
|
|
||
| def gen_rows(): | ||
| for i in range(256): | ||
| idxs = get_set_bits_idxs(i) | ||
| idxs = idxs + [0] * (8 - len(idxs)) | ||
| yield idxs |
There was a problem hiding this comment.
This function should just call the one used to generate the actual constants (gen_avx2_rej_uniform_table_rows) so we can be sure they are the same.
| /* Safety: outlen is at most MLDSA_N and, hence, this cast is safe. */ | ||
| return (int)mld_rej_uniform_avx2(r, buf); | ||
| return (int)mld_rej_uniform_avx2_asm(r, buf, | ||
| (const uint8_t *)mld_rej_uniform_table); |
There was a problem hiding this comment.
can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?
| __contract__( | ||
| requires(memory_no_alias(r, sizeof(int32_t) * MLDSA_N)) | ||
| requires(memory_no_alias(buf, 840)) | ||
| requires(table == (const uint8_t *)mld_rej_uniform_table) |
There was a problem hiding this comment.
can we declare mld_rej_uniform_table as a flat array so this cast here is not needed?
| [ARITH_TAC; ALL_TAC] THEN | ||
| SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL | ||
| [ARITH_TAC; ALL_TAC] THEN | ||
| FIRST_ASSUM(fun th -> |
There was a problem hiding this comment.
Can we remove the debug Printf.printf statements or do they still serve any purpose?
| [ARITH_TAC; ALL_TAC] THEN | ||
| SUBGOAL_THEN `4 * (curlen + 1) = 4 * curlen + 4` SUBST1_TAC THENL | ||
| [ARITH_TAC; ALL_TAC] THEN | ||
| FIRST_ASSUM(fun th -> |
There was a problem hiding this comment.
I'm confused. Why are there separate _MEMSAFE and _SAFE proofs? Shouldn't this proof only have the _MEMSAFE theorems?
| * Low 128 bits: bytes [0..15] (original 64-bit lanes 0, 1) | ||
| * High 128 bits: bytes [8..23] (original 64-bit lanes 1, 2) |
There was a problem hiding this comment.
Nit: The comment is slightly confusing by itself, because the semantic granularity of the input is in bytes, not 64-bit. 64-bit granularity comes from vpermq, but this isn't mentioned here.
Suggestion:
* vpermq with 0x94(=0b 10 01 01 00) permutes 64-bit lanes via (0,1,2,3) -> (0,1,1,2).
* The loaded 32 bytes are thus rearranged as:
* Low 128 bits: bytes [0..15] of original 32-byte
* High 128 bits: bytes [8..23] of original 32-byte
| // Construct broadcast constants | ||
| movl $0x7FFFFF, good | ||
| vmovd good, %xmm1 | ||
| vpbroadcastd %xmm1, mask // mask: 23-bit extraction | ||
|
|
||
| movl $8380417, good // MLDSA_Q | ||
| vmovd good, %xmm2 | ||
| vpbroadcastd %xmm2, bound // bound: rejection threshold | ||
|
|
||
| // Initialize counters | ||
| xorl ctr, ctr // ctr = 0 | ||
| xorl pos, pos // pos = 0 |
There was a problem hiding this comment.
Nit nit: indentation of comments. Also, can we use /* .. */ style? I know this is wrong in many other ASM files, but when we touch/add code we may as well adjust it.
|
|
||
| /* | ||
| * Main SIMD loop: process 24 input bytes into up to 8 coefficients | ||
| * per iteration. Loops while ctr <= MLDSA_N - 8 and pos <= BUFLEN - 32. |
There was a problem hiding this comment.
It's noteworthy that this approach differs from the mlkem-native rejection sampling, which buffers into a temporary slightly oversized stack buffer, to avoid the scalar tail loop.
I don't know which is better, nor is it necessary that we align this now, but just noting.
My gut feeling is that it's probably better the way it's done here.
| rej_uniform_avx2_asm_loop: | ||
| cmpl $248, ctr // MLDSA_N - 8 | ||
| ja rej_uniform_avx2_asm_scalar | ||
| cmpl $808, pos // MLD_AVX2_REJ_UNIFORM_BUFLEN - 32 |
There was a problem hiding this comment.
Nit: Can you use the macro constant instead of a comment here?
| popcntl good, cnt // count valid coefficients | ||
|
|
||
| vmovq (tab, %r8, 8), %xmm4 // load permutation from table[good] |
There was a problem hiding this comment.
Nit: mix of symbolic good vs raw r8. Can we get the notation consistent (ideally symbolics only)? (If you need variants of the same register, you can do something like name_x,name_y,..
| vpmovzxbd %xmm4, cmp_result // zero-extend to 8 dword indices | ||
| vpermd data, cmp_result, data // compact valid coefficients to front | ||
|
|
||
| vmovdqu data, (out, %rax, 4) // store at r[ctr] |
There was a problem hiding this comment.
Same comment as above: Let's stick to symbolic register names.
| rej_uniform_avx2_asm_scalar: | ||
| cmpl $256, ctr // MLDSA_N | ||
| jae rej_uniform_avx2_asm_done | ||
| cmpl $837, pos // MLD_AVX2_REJ_UNIFORM_BUFLEN - 3 |
There was a problem hiding this comment.
This can never fire, can it? If so, can we remove it?
There was a problem hiding this comment.
Here's the reasoning:
The scalar loop bumps pos += 3 on every sample, but ctr only increments on an accepted one (the pos advance happens before the >= Q reject check). So in a rejection-heavy tail, pos keeps climbing while ctr stalls.
Concretely: the buffer is 840 bytes = at most 280 three-byte samples, and we need 256 acceptances — so we can only absorb ~24 rejections before running dry with ctr still < 256. We can also enter the scalar loop with pos as high as 832 (the main loop exits after pos += 24 with pos having been <= 808). From there, if the tail keeps rejecting, pos walks 832 → 835 → 838 and at 838 the next 3-byte read would touch buf[838..840], one past the end of the 840-byte buffer. The cmpl $SCALAR_POS_BOUND, pos; ja done is what stops that.
It's also load-bearing in the proof: the loop exit is disjunctive (hit 256 coefficients or exhaust the buffer), and the memory-safety argument for the 3-byte load depends on pos <= 837, which this branch establishes. Removing it reintroduces the over-read on adversarial input and breaks the MEMSAFE proof.
There was a problem hiding this comment.
Worth noting the removed intrinsics had the identical guard as its scalar loop condition:
while (ctr < MLDSA_N && pos <= MLD_AVX2_REJ_UNIFORM_BUFLEN - 3)which is exactly the pos <= 837 that cmpl $SCALAR_POS_BOUND, pos; ja done implements. The asm didn't add the check, it preserves what was already there.
The header comment on that file also spells out why it has to be there:
The pqcrystals implementation assumes a buffer that is 8 bytes larger as the first loop overreads by 8 bytes that are then discarded. We instead do not pad the buffer and do not overread.
So the bound is load-bearing by design: it's what lets us run on the unpadded 840 byte buffer without over-reading.
hanno-becker
left a comment
There was a problem hiding this comment.
Thanks @jakemas, and sorry this is again taking so long.
In addition to @mkannwischer's important comment that the constant-tables should come from the same source (that's our only argument for why they are consistent across proof and code -- we have no other check!), there's a dead branch in the rejection sampling I believe that can be removed.
ed868d0 to
fb75556
Compare
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Wire eta4 to the new asm in meta.h, add the asm entry points and contracts in arith_native_x86_64.h, register the bytecode dump targets in autogen and the Makefile, and add a poly_uniform_eta_4x component benchmark. Signed-off-by: jake massimo <jakemas@amazon.com>
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Wire both eta2 and eta4 to the new asm in meta.h, add the asm entry points and contracts in arith_native_x86_64.h, register the bytecode dump targets in autogen and the Makefile, and add a poly_uniform_eta_4x component benchmark. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Wire both eta2 and eta4 to the new asm in meta.h, add the asm entry points and contracts in arith_native_x86_64.h, register the bytecode dump targets in autogen and the Makefile, and add a poly_uniform_eta_4x component benchmark. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm routines) so they are called with the System V register convention on all platforms, including Windows/MinGW. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>
…sembly Add hand-written x86_64 AVX2 assembly for rej_uniform_eta2 and rej_uniform_eta4 and remove the AVX2 intrinsics implementations they replace, following the rej_uniform approach in #1014: the table is passed as a parameter and all constants are built from immediates (no .rodata), enabling future HOL-Light verification. Both eta2 and eta4 are wired to the new asm in meta.h, with contracts in arith_native_x86_64.h, bytecode dump targets in autogen and the Makefile, and a poly_uniform_eta_4x component benchmark. The asm entry points are declared MLD_SYSV_ABI (like the other x86_64 asm routines) so they are called with the System V register convention on all platforms, including Windows/MinGW. The endbr64 is emitted via MLD_ASM_FN_SYMBOL (CET-gated) rather than a raw mnemonic, so older assemblers (e.g. clang-6) build cleanly. The eta2 vector path applies the centered mod-5 reduction to (2 - nibble) directly (matching the reference), rather than reducing the raw nibble and subtracting afterwards; the two are not equivalent because vpmulhrsw rounds to nearest. Verified against the ACVP keyGen vectors for all parameter sets. Signed-off-by: jake massimo <jakemas@amazon.com>
… proof Replaces the intrinsics-based rej_uniform_avx2.c with a hand-written assembly routine (mld_rej_uniform_avx2_asm) and adds HOL-Light functional correctness and memory-safety proofs on top of s2n-bignum, plus the CBMC contract proof. Signed-off-by: Jake Massimo <jakemas@amazon.com>
fb75556 to
4c41cac
Compare
Summary
Resolves #926 and #418 (?)
Hol-light proof needs instructions from awslabs/s2n-bignum#387
rej_uniformwith hand-written x86_64 assembly.rodatasection), enabling future HOL-Light formal verification#defines with#undefcleanup for SCU builds (following mlkem-native pattern)poly_uniformto component benchmarkML-DSA's 23-bit coefficients require 32-bit lanes, which naturally fills a 256-bit YMM register for 8 elements per iteration.
Performance
AMD EPYC 3rd gen (c6a) — opt
Proof
Includes HOL-Light and CBMC proofs, written by claude opus 4.7.
CORRECT
HOL-Light / x86_64 HOL Light proof for mldsa_rej_uniform.S (pull_request) Successful in 12m
CORRECT + MEMSAFE
HOL-Light / x86_64 HOL Light proof for rej_uniform_avx2_asm.S (pull_request)Successful in 21m
It took Claude Opus 4.7 1m around 3weeks to get the MEMSAFE proof. Since it records all token usage per prompt, I got it to go back to the prompt in which I asked it to create the MEMSAFE part of the proof to gather statistics: