feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok#3275
feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok#3275
Conversation
acd4e17 to
15638fd
Compare
c55fde3 to
ae99e9d
Compare
15638fd to
e500003
Compare
e500003 to
7ff06f9
Compare
ae99e9d to
01362e8
Compare
849e110 to
d4b3c54
Compare
0e83c12 to
757d5d0
Compare
d349076 to
54f63ae
Compare
61bfe14 to
98753c0
Compare
78a08f4 to
19443ef
Compare
98753c0 to
76c6bb4
Compare
c73ab78 to
7ef51c0
Compare
|
Hey @nsarlin-zama @IceTDrinker this PR is finally ready for review. I did my best to make your review easier, but is still a large PR. Hope you don't have to spend time on obvious issues. |
IceTDrinker
left a comment
There was a problem hiding this comment.
thanks @pdroalves we have some other topics to finish first on our end, but we'll take a look when we get a chance !
@IceTDrinker made 1 comment.
Reviewable status: 0 of 25 files reviewed, all discussions resolved (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).
andrei-stoian-zama
left a comment
There was a problem hiding this comment.
I think this PR looks very good!
I have some minor comments, and one design question about rayon thread pools / gpu_indexes.
| -p tfhe-zk-pok --features experimental,gpu-experimental -- gpu | ||
|
|
||
| .PHONY: test_integer_zk_gpu # Run tfhe-zk-pok tests | ||
| test_integer_zk_gpu: install_rs_check_toolchain |
There was a problem hiding this comment.
this target and the above one do not use the gpu-experimental feature. what do these targets test ? old code ?
There was a problem hiding this comment.
Yes, test_integer_zk_gpu executes verify_and_expand with GPU-accelerated expand (but CPU verify).
test_zk_pok_gpu has gpu-experimental so the target name actually should be test_zk_pok_experimental_gpu, as test_integer_zk_experimental_gpu.
.github/workflows/gpu_zk_tests.yml
Outdated
| - name: Run zk-cuda-backend integration tests | ||
| run: | | ||
| make test_zk_cuda_backend | ||
| make test_zk_pok_gpu |
There was a problem hiding this comment.
how much CI time is added ?
There was a problem hiding this comment.
|
|
||
| .PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend | ||
| test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend | ||
| test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend |
There was a problem hiding this comment.
should we run the zk tests on approved ? only when the zk backend changes ?
There was a problem hiding this comment.
Given it's a 15min overhead, I would run them every time some zk-related thing changes.
There was a problem hiding this comment.
ok, could you add a workflow filter for that ?
There was a problem hiding this comment.
That's already what's happening.
In the should-run job the current file patterns are:
- tfhe/Cargo.toml
- tfhe/build.rs
- backends/zk-cuda-backend/**
- tfhe/src/integer/gpu/zk/**
- tfhe-zk-pok/**
- tfhe/docs/**/**.md
- .github/workflows/gpu_zk_tests.yml
- ci/slab.toml
For completeness I will add a few more items to that list:
- tfhe/src/zk/**
- backends/tfhe-cuda-backend/**
- tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams)
- tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there)
- tfhe/src/shortint/parameters/** (in case something changes in the parameters)
tfhe-zk-pok/src/gpu/mod.rs
Outdated
| /// - If `bases` and `scalars` have different lengths (checked inside the backend). | ||
| /// - If the GPU MSM call fails. | ||
| #[must_use] | ||
| pub fn g1_msm_gpu(bases: &[G1Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G1 { |
There was a problem hiding this comment.
I don't see a good reason to use Optional gpu_index.
There was a problem hiding this comment.
That was useful when we were sharing implementation code with the CPU. Not anymore. I will remove the optional modifier.
tfhe-zk-pok/src/gpu/mod.rs
Outdated
| /// - If `bases` and `scalars` have different lengths (checked inside the backend). | ||
| /// - If the GPU MSM call fails. | ||
| #[must_use] | ||
| pub fn g2_msm_gpu(bases: &[G2Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G2 { |
There was a problem hiding this comment.
again, no need for optional gpu_index
There was a problem hiding this comment.
I will remove this one either.
tfhe-zk-pok/src/gpu/mod.rs
Outdated
| /// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU | ||
| /// is present. | ||
| #[inline] | ||
| pub fn select_gpu_for_msm() -> Option<u32> { |
There was a problem hiding this comment.
I don't understand why we have to rely on rayon here. is our multi-gpu logic necessarily related to rayon use ? in the cuda backend we use rayon in benchmarks but we don't assume rayon is needed.
There was a problem hiding this comment.
Quick answer: yes, out multi-gpu logic depends on rayon. We link one GPU per rayon thread.
Longer answer: The original version of this PR had verify_impl and prove_impl shared between CPU and GPU. This version splits those, so we now have our own. This opens the possibility to rewrite and reorganize those methods freely, but that's not what I'm doing in this PR. It has the minimum necessary changes to integrated the zk backend.
That's what I meant that I want to work next.
There was a problem hiding this comment.
makes sense! can you create an issue detailing the future work please ?
There was a problem hiding this comment.
| /// | ||
| /// The result is cached after the first call since GPU count cannot change | ||
| /// during execution. | ||
| pub(crate) fn get_num_gpus() -> u32 { |
There was a problem hiding this comment.
can't we have a common file with the tfhe backend and use the function we used traditionally ?
There was a problem hiding this comment.
We could, just need to think where.
We can't use the version on tfhe/ otherwise it would create a circular dependency with tfhe-zk-pok. We could maybe convert the current method we have in tfhe/src/core_crypto/gpu/mod.rs to tfhe-cuda-backend bindings as a safe rust wrapper. I suppose that's feasible.
There was a problem hiding this comment.
should it be done in this PR or a new one ?
There was a problem hiding this comment.
I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325
tfhe-zk-pok/src/gpu/pke.rs
Outdated
There was a problem hiding this comment.
some comments:
-
// FIXME: div_round -> anything to do for it ? remove it ? - is there a reference for the proof - a link to some PDF ? CPU doesn't have one either..
There was a problem hiding this comment.
There was a problem hiding this comment.
if there is a publicly accessible reference we should add, if possible, comments in the code such as. see "Fancy ZK paper" by Joye et al. 2007 - algorithm A. If there isn't a publicly available paper then nothing to be done, we'll handle docs internally.
There was a problem hiding this comment.
Not sure if we have a public reference for that. @nsarlin-zama do you?
There was a problem hiding this comment.
why do we use the run_in_pool mechanism ? CPU uses it, is it the best for us ?
There was a problem hiding this comment.
Possibly not. As before, this integration is done with the minimum code change to verify_impl and proof_impl. I plan to approach this type of thing in a future PR.
nsarlin-zama
left a comment
There was a problem hiding this comment.
@nsarlin-zama partially reviewed 15 files and made 4 comments.
Reviewable status: 12 of 26 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).
tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):
#[allow(clippy::too_many_arguments)] fn prove_impl(
just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?
And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?
tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):
#[test] fn test_pke_v2_gpu_cpu_equivalence() { let params = crate::proofs::pke_v2::tests::PKEV2_TEST_PARAMS;
you should print the seed so we can replay if a bug happens
tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):
//! GPU-accelerated prove/verify for PKE v1.
I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"
tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):
// batch size let elements = (rayon::current_num_threads() / num_block).max(1) + 1;
This might not be a good metric for the gpu ?
…dd PTX carry chains for fp_add/sub and branchless reduction -replace software carry detection (carry = (sum < old) ? 1 : 0) with inline PTX hardware carry flags (add.cc.u64/addc.u64) - replace software carry detection in fp_add_raw/fp_sub_raw with inline PTX add.cc.u64/addc.cc.u64 and sub.cc.u64/subc.cc.u64 chains\ - now we always compute both reduced and unreduced result and select via bitmask
Callers always pass a concrete `u32` via `select_gpu_for_msm()`, so the `Option<u32>` indirection added no value. Simplifies the API per review feedback on PR #3275.
…enchmark Replace the CPU thread count heuristic (rayon::current_num_threads() / num_block) with gpu_zk_throughput_elements() scaled by GPU count, matching the approach already used by gpu_pke_zk_verify. Also use compute_load_config() to respect the __TFHE_RS_BENCH_OP_FLAVOR env var for fast mode.
Separate gpu_zk_throughput_elements into two functions: - gpu_zk_proof_throughput_elements: ~2x higher values since proof MSM uses ~0.5-1.8 MB/element (compute-bound, not memory-bound) - gpu_zk_verify_throughput_elements: unchanged values tuned for expansion OOM avoidance
pdroalves
left a comment
There was a problem hiding this comment.
@pdroalves made 4 comments.
Reviewable status: 12 of 41 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).
tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):
Previously, nsarlin-zama (Nicolas Sarlin) wrote…
This might not be a good metric for the gpu ?
That's a terrible metric. Our throughput benchmark was broken when I started this and honestly I forgot to give it some love. I just changed that to use a gpu-specific function to calculate the number of elements. Since proof and verify have different memory footprint I have different methods for each. I will run a few benchmarks on a 8xH100 instance to see what's the optimal set of values for them.
tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):
Previously, nsarlin-zama (Nicolas Sarlin) wrote…
you should print the seed so we can replay if a bug happens
Good point. I added a print to test_pke_v1_gpu_cpu_equivalence and test_pke_v2_gpu_cpu_equivalence. I'm printing as hex, since this is what is done in the other zk tests.
tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):
Previously, nsarlin-zama (Nicolas Sarlin) wrote…
I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"
If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?
tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):
Previously, nsarlin-zama (Nicolas Sarlin) wrote…
just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?
And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?
Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.
Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?
nsarlin-zama
left a comment
There was a problem hiding this comment.
@nsarlin-zama made 2 comments and resolved 2 discussions.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, pdroalves, soonum, SouchonTheo, and tmontaigu).
tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):
Previously, pdroalves (Pedro Alves) wrote…
If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?
cc @IceTDrinker
tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):
Previously, pdroalves (Pedro Alves) wrote…
Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.
Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?
yes that's great!
IceTDrinker
left a comment
There was a problem hiding this comment.
@IceTDrinker made 1 comment.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).
tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):
Previously, nsarlin-zama (Nicolas Sarlin) wrote…
cc @IceTDrinker
my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore
if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost
but yeah, I think you are safe removing it
Verify: 15 → 60 elem/GPU (+22% throughput, limited by expansion memory). Proof: 30 → 250 elem/GPU (5x improvement, limited by CPU/GPU pipeline).
pdroalves
left a comment
There was a problem hiding this comment.
@pdroalves made 5 comments and resolved 1 discussion.
Reviewable status: 9 of 42 files reviewed, 10 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).
tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):
Previously, IceTDrinker wrote…
my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore
if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost
but yeah, I think you are safe removing it
Ok, removed.
|
|
||
| .PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend | ||
| test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend | ||
| test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend |
There was a problem hiding this comment.
That's already what's happening.
In the should-run job the current file patterns are:
- tfhe/Cargo.toml
- tfhe/build.rs
- backends/zk-cuda-backend/**
- tfhe/src/integer/gpu/zk/**
- tfhe-zk-pok/**
- tfhe/docs/**/**.md
- .github/workflows/gpu_zk_tests.yml
- ci/slab.toml
For completeness I will add a few more items to that list:
- tfhe/src/zk/**
- backends/tfhe-cuda-backend/**
- tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams)
- tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there)
- tfhe/src/shortint/parameters/** (in case something changes in the parameters)
| /// | ||
| /// The result is cached after the first call since GPU count cannot change | ||
| /// during execution. | ||
| pub(crate) fn get_num_gpus() -> u32 { |
There was a problem hiding this comment.
I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325
tfhe-zk-pok/src/gpu/mod.rs
Outdated
| /// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU | ||
| /// is present. | ||
| #[inline] | ||
| pub fn select_gpu_for_msm() -> Option<u32> { |
There was a problem hiding this comment.
tfhe-zk-pok/src/gpu/pke.rs
Outdated
There was a problem hiding this comment.
Not sure if we have a public reference for that. @nsarlin-zama do you?
v1 is deprecated and the GPU code path is dead weight. Only v2 GPU support remains. The v1 prove now unconditionally uses the CPU path.
Add missing paths that affect GPU ZK test results: tfhe-cuda-backend, core_crypto GPU primitives, shortint parameters, and CPU-side ZK types. Broaden integer/gpu/zk to integer/gpu to catch shared GPU modules.
Wire the zk-cuda-backend into tfhe-zk-pok and tfhe.
GPU acceleration module (
tfhe-zk-pok/src/gpu/)New
gpumodule gated behind thegpu-experimentalfeature flag, containing:tfhe-zk-pok(arkworks) andzk-cuda-backendtypes:G1Affine,G2Affine,Zp/Scalar,Fq/Fp,Fq2/Fp2. Includes compile-timesize_of/align_ofassertions to guarantee transmute safety between wrapper types and inner arkworks types.g1_msm_gpu,g2_msm_gpu): create a CUDA stream, dispatch MSM to the backend, convert results from Montgomery form back to arkworks projective coordinates. Multi-GPU support viaselect_gpu_for_msm()which distributes work across GPUs by rayon thread index.GPU prove/verify for PKE v1 and v2
GPU variants of prove and verify are structured as submodules of each proof module, mirroring the CPU API:
pke::prove(...)pke::gpu::prove(...)pke::verify(...)pke::gpu::verify(...)pke_v2::prove(...)pke_v2::gpu::prove(...)pke_v2::verify(...)pke_v2::gpu::verify(...)proofs/pke/gpu.rs: Duplicates v1prove_impllogic, replacing everymulti_mul_scalarwithg1_msm_gpu/g2_msm_gpu. Verification delegates to the CPU verifier since v1 verify has no MSM calls.proofs/pke_v2/gpu.rs: Duplicates v2prove_implandverify_impl, replacing MSM sites. Also provides GPU variants ofpairing_check_two_stepsandpairing_check_batchedfor verification.Visibility changes in
proofs/Several items that were previously
pub(super)or private are nowpub(crate)so the GPU submodules can access them:proofs/mod.rs:OneBasedinner field,assert_pke_proof_preconditions,decode_q,compute_r1,compute_r2,Sid::to_le_bytes,SidBytes,run_in_pool, test utilities (PkeTestParameters,PkeTestcase, etc.)proofs/pke/mod.rs:bit_iter,compute_a_theta,PublicCommit/PrivateCommitfields, test constantsproofs/pke_v2/mod.rs:bit_iter,compute_a_theta,compute_crs_params,inf_norm_bound_to_euclidean_squared,ComputeLoadProofFields::to_le_bytes,GeneratedScalars,EvaluationPoints,PublicCommit/PrivateCommitfields, test constantsproofs/pke_v2/hashes.rs:RHash,PhiHash,XiHash,YHash,THash,ThetaHash,OmegaHash,DeltaHash,ZHashand their chainedgen_*methodsTests
gpu/tests/zk_cuda_backend.rs: Low-level integration tests for the GPU backend — MSM correctness for G1/G2, type round-trip conversions, multi-GPU dispatch.gpu/tests/prove_verify_stress.rs: Exhaustive GPU-vs-CPU equivalence tests for both v1 and v2. Each test iterates over 3 CRS variants x 32 invalid-witness combinations x 2 compute loads (v2 also sweeps both pairing modes), asserting byte-identical serialized proofs and matching accept/reject decisions.Benchmarks
tfhe-zk-pok/benches/pke_v1.rsandpke_v2.rs: GPU benchmark groups added behindgpu-experimental, covering prove and verify for both protocol versions.tfhe-benchmark/benches/zk/msm.rs: Standalone MSM benchmarks (CPU and GPU) for G1 and G2 at various sizes, useful for profiling the MSM kernel in isolation.tfhe-benchmark/benches/integer/zk_pke.rs: GPU throughput benchmark tuning — replaces hardcoded element counts with agpu_zk_throughput_elementsfunction sized per CRS/bit configuration to avoid OOM. Creates per-GPU server keys and streams for proper multi-GPU scaling.Build system and CI
tfhe-zk-pok/Cargo.toml: Newgpu-experimentalfeature depending onzk-cuda-backendandtfhe-cuda-backend.ark-ec/ark-ffswitched to workspace dependencies. Addeditertoolsdependency.tfhe/Cargo.toml: Newgpu-experimental-zkfeature forwarding totfhe-zk-pok/gpu-experimental.tfhe-benchmark/Cargo.toml: Newgpu-experimental-zkfeature;zk-poknow brings indep:tfhe-zk-pokdirectly. Newzk-msmbench target.Makefile: New targetstest_zk_pok_gpu,test_integer_zk_gpu,test_zk_cuda,bench_msm_zk,bench_msm_zk_gpu,bench_tfhe_zk_pok_gpu. Existing GPU clippy/check/build targets updated to includegpu-experimental-zk.bench_integer_zk_gpuswitched fromrelease_lto_offtoreleaseprofile. Minor trailing-comma fixes inbench_integer_aes*_gpu..github/workflows/:benchmark_gpu.ymladdstfhe_zk_pokandmsm_zkcommands.benchmark_cpu.ymladdsmsm_zk.gpu_zk_tests.ymlbroadens file change triggers and runstest_zk_pok_gpu+test_integer_zk_gpu..gitignore: Ignoresbackends/tfhe-cuda-backend/cuda/build/.New Makefile targets
test_zk_cuda_backend: Builds the zk-cuda-backend C++ tests via CMake and runs both the C++ and Rust test suites.test_zk_pok_gpu: Runstfhe-zk-poktests with thegpu-experimentalfeature, filtering to GPU tests.test_integer_zk_gpu: Runs tfhe integer ZK tests with expand accelerated on the GPU.test_integer_zk_experimental_gpu: Same as above but accelerating the entire verify/proof circuit usingzk-cuda-backend.bench_msm_zk: Runs CPU MSM benchmarks for G1 and G2 at various sizes.bench_msm_zk_gpu: Runs GPU MSM benchmarks for G1 and G2 at various sizes.bench_tfhe_zk_pok_gpu: Runstfhe-zk-pokprove/verify benchmarks with GPU acceleration.closes: https://github.com/zama-ai/tfhe-rs-internal/issues/1336
PR content/description
Check-list:
This change is