Skip to content

Conversation

@briansmith
Copy link
Owner

No description provided.

@briansmith briansmith self-assigned this Mar 5, 2025
src/cpu/intel.rs Outdated
// Intel: "15.3 DETECTION OF 512-BIT INSTRUCTION GROUPS OF THE INTEL
// AVX-512 FAMILY".
// `OPENSSL_cpuid_setup` clears these bits when XCR0[7:5] isn't 0b111.
// doesn't AVX-512 state.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming PR #2439 is merged before this, then this will need to be updated.

@codecov
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

❌ Patch coverage is 30.50847% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.14%. Comparing base (f4836b0) to head (eac2860).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/aead/aes_gcm/vaesclmulavx512.rs 0.00% 36 Missing ⚠️
src/cpu/x86_64.rs 54.76% 12 Missing and 7 partials ⚠️
src/aead/gcm/vclmulavx512.rs 0.00% 16 Missing ⚠️
src/aead/aes_gcm.rs 56.25% 6 Missing and 1 partial ⚠️
src/aead/gcm.rs 0.00% 3 Missing ⚠️
src/cpu.rs 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2444      +/-   ##
==========================================
- Coverage   96.52%   96.14%   -0.38%     
==========================================
  Files         190      192       +2     
  Lines       20119    20217      +98     
  Branches      513      523      +10     
==========================================
+ Hits        19419    19437      +18     
- Misses        560      633      +73     
- Partials      140      147       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@briansmith
Copy link
Owner Author

This is blocked on code coverage testing.

See also these pending changes upstream:
https://boringssl-review.googlesource.com/c/boringssl/+/77168
https://boringssl-review.googlesource.com/c/boringssl/+/77167

See also issue #2469 regarding developing a workaround for this code to work in ancient binutils.

@briansmith
Copy link
Owner Author

The code coverage aspect of this comprises two parts:

  • Ensure that the aes-gcm-avx2 implementation is tested in the coverage job regardless of whether the GItHub Actions runner would normally use the avx512 version.
  • Ensure that the aes-gcm-avx512 implementation is tested in the coverage job regardless of whether the GItHub Actions runner would normally use the avx512 version.

I think GitHub is experimenting with some AVX-512-enabled actions runners so the tests might be flaky in the interim without explicitly using QEMU to target specific CPUs that would choose each implementation.

In PR #2464 I am experimenting with QEMU 9.2.2, which adds newer CPUs than are available in QEMU 8.2.2 used in GitHub Actions Ubuntu 24.04 runners.

@briansmith briansmith added this to the 0.17.15 milestone Mar 11, 2025

# GHASH the remaining data 16 bytes at a time, using xmm registers only.
.Laad_blockbyblock$local_label_suffix:
test $AADLEN, $AADLEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.

But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, please note that if someone actually passes in a large amount of AAD (which can happen if someone uses the AES-GCM API to compute GMAC for an authentication-only use case), breaking it into 16-byte chunks is very bad for performance.

Yes, I'm aware, but I don't know of any use cases for that at all that would be relevant to ring users. I only know that Google does it for some unknown reason.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're updating this function to support only len=16 again? It might be a good idea to remove this check for len==0, and update the comment "|len| must be a multiple of 16" which this change makes outdated.

Thanks. I will make those changes and also rebase this on top of the BoringSSL changes from upstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(

If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10(), that is equivalent to gcm_ghash_vpclmulqdq_avx10_512() with len=16.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're only doing 16 bytes at a time anyway, did you also consider just using gcm_gmult_vpclmulqdq_avx10()? If you XOR the 16 bytes of data into the GHASH accumulator ("Xi") and call gcm_gmult_vpclmulqdq_avx10(), that is equivalent to gcm_ghash_vpclmulqdq_avx10_512() with len=16.

That is how we were doing things before with pre-VAES implementations, but we've been switching over to the (tweaked) ghash implementations because it's less fighting the rustc optimizer on the Rust side.

We had trouble, for example, getting rustc to always use SSE XOR instead of byte-by-byte XOR, in some cases, thought that might be resolved now. Also, we had trouble getting rustc to assume that the partial/single-block case is more likely than the multi-block case. In later rustc versions it will matter less once we can use likely/unlikely.

Regardless, in PR #2478 I tweaked the AVX2 version of this function to be based on the gmult implementation instead of the ghash implementation. (See also PR #2477, which attempts the same tweaks still based on the ghash implementation).

I didn't expect any users of large amounts of AAD either, but it turns out that with enough users there will be someone doing something unusual :(

I don't think ring has any unusual users. We rely on people telling us what they need and we try to optimize for what people are actually using.

@briansmith briansmith force-pushed the b/aes-gcm-avx512 branch 3 times, most recently from 34889f5 to 7ca1e37 Compare March 20, 2025 00:43
@georglauterbach georglauterbach mentioned this pull request Jun 4, 2025
@briansmith briansmith force-pushed the b/aes-gcm-avx512 branch 7 times, most recently from 282f193 to 429a9be Compare July 17, 2025 19:23
@briansmith briansmith force-pushed the b/aes-gcm-avx512 branch 7 times, most recently from e4d1361 to 7ff364b Compare July 26, 2025 23:46
@briansmith briansmith force-pushed the b/aes-gcm-avx512 branch 3 times, most recently from 925ac5f to bd0587d Compare July 27, 2025 08:17
@Scottmitch
Copy link

Hello and thanks so much for supporting Ring. I’ve run benchmarks with this PR and the AVX-512 instructions provide a meaningful performance boost. IIUC most of the PRs/issues reference by this PR are closed/resolved. Are there any remaining issues or blockers before merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants