Skip to content

feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok#3275

Open
pdroalves wants to merge 14 commits intomainfrom
pa/feat/rust-zk-cuda-backend
Open

feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok#3275
pdroalves wants to merge 14 commits intomainfrom
pa/feat/rust-zk-cuda-backend

Conversation

@pdroalves
Copy link
Contributor

@pdroalves pdroalves commented Feb 6, 2026

Wire the zk-cuda-backend into tfhe-zk-pok and tfhe.

GPU acceleration module (tfhe-zk-pok/src/gpu/)

New gpu module gated behind the gpu-experimental feature flag, containing:

  • Type conversions between tfhe-zk-pok (arkworks) and zk-cuda-backend types: G1Affine, G2Affine, Zp/Scalar, Fq/Fp, Fq2/Fp2. Includes compile-time size_of/align_of assertions to guarantee transmute safety between wrapper types and inner arkworks types.
  • GPU MSM helpers (g1_msm_gpu, g2_msm_gpu): create a CUDA stream, dispatch MSM to the backend, convert results from Montgomery form back to arkworks projective coordinates. Multi-GPU support via select_gpu_for_msm() which distributes work across GPUs by rayon thread index.

GPU prove/verify for PKE v1 and v2

GPU variants of prove and verify are structured as submodules of each proof module, mirroring the CPU API:

CPU GPU
pke::prove(...) pke::gpu::prove(...)
pke::verify(...) pke::gpu::verify(...)
pke_v2::prove(...) pke_v2::gpu::prove(...)
pke_v2::verify(...) pke_v2::gpu::verify(...)
  • proofs/pke/gpu.rs: Duplicates v1 prove_impl logic, replacing every multi_mul_scalar with g1_msm_gpu/g2_msm_gpu. Verification delegates to the CPU verifier since v1 verify has no MSM calls.
  • proofs/pke_v2/gpu.rs: Duplicates v2 prove_impl and verify_impl, replacing MSM sites. Also provides GPU variants of pairing_check_two_steps and pairing_check_batched for verification.

Visibility changes in proofs/

Several items that were previously pub(super) or private are now pub(crate) so the GPU submodules can access them:

  • proofs/mod.rs: OneBased inner field, assert_pke_proof_preconditions, decode_q, compute_r1, compute_r2, Sid::to_le_bytes, SidBytes, run_in_pool, test utilities (PkeTestParameters, PkeTestcase, etc.)
  • proofs/pke/mod.rs: bit_iter, compute_a_theta, PublicCommit/PrivateCommit fields, test constants
  • proofs/pke_v2/mod.rs: bit_iter, compute_a_theta, compute_crs_params, inf_norm_bound_to_euclidean_squared, ComputeLoadProofFields::to_le_bytes, GeneratedScalars, EvaluationPoints, PublicCommit/PrivateCommit fields, test constants
  • proofs/pke_v2/hashes.rs: RHash, PhiHash, XiHash, YHash, THash, ThetaHash, OmegaHash, DeltaHash, ZHash and their chained gen_* methods

Tests

  • gpu/tests/zk_cuda_backend.rs: Low-level integration tests for the GPU backend — MSM correctness for G1/G2, type round-trip conversions, multi-GPU dispatch.
  • gpu/tests/prove_verify_stress.rs: Exhaustive GPU-vs-CPU equivalence tests for both v1 and v2. Each test iterates over 3 CRS variants x 32 invalid-witness combinations x 2 compute loads (v2 also sweeps both pairing modes), asserting byte-identical serialized proofs and matching accept/reject decisions.

Benchmarks

  • tfhe-zk-pok/benches/pke_v1.rs and pke_v2.rs: GPU benchmark groups added behind gpu-experimental, covering prove and verify for both protocol versions.
  • tfhe-benchmark/benches/zk/msm.rs: Standalone MSM benchmarks (CPU and GPU) for G1 and G2 at various sizes, useful for profiling the MSM kernel in isolation.
  • tfhe-benchmark/benches/integer/zk_pke.rs: GPU throughput benchmark tuning — replaces hardcoded element counts with a gpu_zk_throughput_elements function sized per CRS/bit configuration to avoid OOM. Creates per-GPU server keys and streams for proper multi-GPU scaling.

Build system and CI

  • tfhe-zk-pok/Cargo.toml: New gpu-experimental feature depending on zk-cuda-backend and tfhe-cuda-backend. ark-ec/ark-ff switched to workspace dependencies. Added itertools dependency.
  • tfhe/Cargo.toml: New gpu-experimental-zk feature forwarding to tfhe-zk-pok/gpu-experimental.
  • tfhe-benchmark/Cargo.toml: New gpu-experimental-zk feature; zk-pok now brings in dep:tfhe-zk-pok directly. New zk-msm bench target.
  • Makefile: New targets test_zk_pok_gpu, test_integer_zk_gpu, test_zk_cuda, bench_msm_zk, bench_msm_zk_gpu, bench_tfhe_zk_pok_gpu. Existing GPU clippy/check/build targets updated to include gpu-experimental-zk. bench_integer_zk_gpu switched from release_lto_off to release profile. Minor trailing-comma fixes in bench_integer_aes*_gpu.
  • .github/workflows/: benchmark_gpu.yml adds tfhe_zk_pok and msm_zk commands. benchmark_cpu.yml adds msm_zk. gpu_zk_tests.yml broadens file change triggers and runs test_zk_pok_gpu + test_integer_zk_gpu.
  • .gitignore: Ignores backends/tfhe-cuda-backend/cuda/build/.

New Makefile targets

  • test_zk_cuda_backend: Builds the zk-cuda-backend C++ tests via CMake and runs both the C++ and Rust test suites.
  • test_zk_pok_gpu: Runs tfhe-zk-pok tests with the gpu-experimental feature, filtering to GPU tests.
  • test_integer_zk_gpu: Runs tfhe integer ZK tests with expand accelerated on the GPU.
  • test_integer_zk_experimental_gpu: Same as above but accelerating the entire verify/proof circuit using zk-cuda-backend.
  • bench_msm_zk: Runs CPU MSM benchmarks for G1 and G2 at various sizes.
  • bench_msm_zk_gpu: Runs GPU MSM benchmarks for G1 and G2 at various sizes.
  • bench_tfhe_zk_pok_gpu: Runs tfhe-zk-pok prove/verify benchmarks with GPU acceleration.

closes: https://github.com/zama-ai/tfhe-rs-internal/issues/1336

PR content/description

Check-list:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Relevant issues are marked as resolved/closed, related issues are linked in the description
  • Check for breaking changes (including serialization changes) and add them to commit message following the conventional commit specification

This change is Reviewable

@cla-bot cla-bot bot added the cla-signed label Feb 6, 2026
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch from acd4e17 to 15638fd Compare February 6, 2026 19:56
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from c55fde3 to ae99e9d Compare February 6, 2026 20:04
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch from 15638fd to e500003 Compare February 6, 2026 20:05
@pdroalves pdroalves marked this pull request as ready for review February 6, 2026 20:06
@pdroalves pdroalves marked this pull request as draft February 6, 2026 20:07
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch from e500003 to 7ff06f9 Compare February 9, 2026 12:46
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch from ae99e9d to 01362e8 Compare February 9, 2026 13:10
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch 5 times, most recently from 849e110 to d4b3c54 Compare February 9, 2026 17:37
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from 0e83c12 to 757d5d0 Compare February 10, 2026 15:02
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch 4 times, most recently from d349076 to 54f63ae Compare February 12, 2026 11:20
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from 61bfe14 to 98753c0 Compare February 12, 2026 12:15
@pdroalves pdroalves force-pushed the pa/feat/zk-cuda-backend branch 2 times, most recently from 78a08f4 to 19443ef Compare February 13, 2026 14:56
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch from 98753c0 to 76c6bb4 Compare February 13, 2026 17:13
Base automatically changed from pa/feat/zk-cuda-backend to main February 14, 2026 01:30
@pdroalves pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 5 times, most recently from c73ab78 to 7ef51c0 Compare March 5, 2026 00:09
@pdroalves pdroalves marked this pull request as ready for review March 5, 2026 00:13
@pdroalves pdroalves requested review from a team and SouchonTheo as code owners March 5, 2026 00:13
@pdroalves
Copy link
Contributor Author

Hey @nsarlin-zama @IceTDrinker this PR is finally ready for review. I did my best to make your review easier, but is still a large PR. Hope you don't have to spend time on obvious issues.

Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @pdroalves we have some other topics to finish first on our end, but we'll take a look when we get a chance !

@IceTDrinker made 1 comment.
Reviewable status: 0 of 25 files reviewed, all discussions resolved (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).

Copy link
Contributor

@andrei-stoian-zama andrei-stoian-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks very good!

I have some minor comments, and one design question about rayon thread pools / gpu_indexes.

-p tfhe-zk-pok --features experimental,gpu-experimental -- gpu

.PHONY: test_integer_zk_gpu # Run tfhe-zk-pok tests
test_integer_zk_gpu: install_rs_check_toolchain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this target and the above one do not use the gpu-experimental feature. what do these targets test ? old code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, test_integer_zk_gpu executes verify_and_expand with GPU-accelerated expand (but CPU verify).

test_zk_pok_gpu has gpu-experimental so the target name actually should be test_zk_pok_experimental_gpu, as test_integer_zk_experimental_gpu.

- name: Run zk-cuda-backend integration tests
run: |
make test_zk_cuda_backend
make test_zk_pok_gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much CI time is added ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


.PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend
test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we run the zk tests on approved ? only when the zk backend changes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's a 15min overhead, I would run them every time some zk-related thing changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, could you add a workflow filter for that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already what's happening.

In the should-run job the current file patterns are:

- tfhe/Cargo.toml                                                                                                                                                                         
- tfhe/build.rs                                                                                                                                                                           
- backends/zk-cuda-backend/**                                                                                                                                                             
- tfhe/src/integer/gpu/zk/**                                                                                                                                                              
- tfhe-zk-pok/**                                               
- tfhe/docs/**/**.md
- .github/workflows/gpu_zk_tests.yml
 - ci/slab.toml

For completeness I will add a few more items to that list:

- tfhe/src/zk/**
- backends/tfhe-cuda-backend/**
- tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams)
- tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there)
- tfhe/src/shortint/parameters/** (in case something changes in the parameters)

/// - If `bases` and `scalars` have different lengths (checked inside the backend).
/// - If the GPU MSM call fails.
#[must_use]
pub fn g1_msm_gpu(bases: &[G1Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to use Optional gpu_index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was useful when we were sharing implementation code with the CPU. Not anymore. I will remove the optional modifier.

/// - If `bases` and `scalars` have different lengths (checked inside the backend).
/// - If the GPU MSM call fails.
#[must_use]
pub fn g2_msm_gpu(bases: &[G2Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, no need for optional gpu_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this one either.

/// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU
/// is present.
#[inline]
pub fn select_gpu_for_msm() -> Option<u32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have to rely on rayon here. is our multi-gpu logic necessarily related to rayon use ? in the cuda backend we use rayon in benchmarks but we don't assume rayon is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick answer: yes, out multi-gpu logic depends on rayon. We link one GPU per rayon thread.

Longer answer: The original version of this PR had verify_impl and prove_impl shared between CPU and GPU. This version splits those, so we now have our own. This opens the possibility to rewrite and reorganize those methods freely, but that's not what I'm doing in this PR. It has the minimum necessary changes to integrated the zk backend.

That's what I meant that I want to work next.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! can you create an issue detailing the future work please ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///
/// The result is cached after the first call since GPU count cannot change
/// during execution.
pub(crate) fn get_num_gpus() -> u32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we have a common file with the tfhe backend and use the function we used traditionally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, just need to think where.

We can't use the version on tfhe/ otherwise it would create a circular dependency with tfhe-zk-pok. We could maybe convert the current method we have in tfhe/src/core_crypto/gpu/mod.rs to tfhe-cuda-backend bindings as a safe rust wrapper. I suppose that's feasible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be done in this PR or a new one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments:

  • // FIXME: div_round -> anything to do for it ? remove it ? 
    
  • is there a reference for the proof - a link to some PDF ? CPU doesn't have one either..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a publicly accessible reference we should add, if possible, comments in the code such as. see "Fancy ZK paper" by Joye et al. 2007 - algorithm A. If there isn't a publicly available paper then nothing to be done, we'll handle docs internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we have a public reference for that. @nsarlin-zama do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we use the run_in_pool mechanism ? CPU uses it, is it the best for us ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly not. As before, this integration is done with the minimum code change to verify_impl and proof_impl. I plan to approach this type of thing in a future PR.

Copy link
Contributor

@nsarlin-zama nsarlin-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsarlin-zama partially reviewed 15 files and made 4 comments.
Reviewable status: 12 of 26 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).


tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):


#[allow(clippy::too_many_arguments)]
fn prove_impl(

just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?

And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?


tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):

#[test]
fn test_pke_v2_gpu_cpu_equivalence() {
    let params = crate::proofs::pke_v2::tests::PKEV2_TEST_PARAMS;

you should print the seed so we can replay if a bug happens


tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

//! GPU-accelerated prove/verify for PKE v1.

I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"


tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):

                                // batch size
                                let elements =
                                    (rayon::current_num_threads() / num_block).max(1) + 1;

This might not be a good metric for the gpu ?

…dd PTX carry chains for fp_add/sub and branchless reduction

 -replace software carry detection (carry = (sum < old) ? 1 : 0) with
inline PTX hardware carry flags (add.cc.u64/addc.u64)
- replace software carry detection in fp_add_raw/fp_sub_raw with inline
PTX add.cc.u64/addc.cc.u64 and sub.cc.u64/subc.cc.u64 chains\
- now we always compute both reduced and unreduced result and select via bitmask
Callers always pass a concrete `u32` via `select_gpu_for_msm()`, so the
`Option<u32>` indirection added no value. Simplifies the API per review
feedback on PR #3275.
…enchmark

Replace the CPU thread count heuristic (rayon::current_num_threads() /
num_block) with gpu_zk_throughput_elements() scaled by GPU count, matching
the approach already used by gpu_pke_zk_verify. Also use compute_load_config()
to respect the __TFHE_RS_BENCH_OP_FLAVOR env var for fast mode.
Separate gpu_zk_throughput_elements into two functions:
- gpu_zk_proof_throughput_elements: ~2x higher values since proof MSM
  uses ~0.5-1.8 MB/element (compute-bound, not memory-bound)
- gpu_zk_verify_throughput_elements: unchanged values tuned for
  expansion OOM avoidance
Copy link
Contributor Author

@pdroalves pdroalves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdroalves made 4 comments.
Reviewable status: 12 of 41 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).


tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

This might not be a good metric for the gpu ?

That's a terrible metric. Our throughput benchmark was broken when I started this and honestly I forgot to give it some love. I just changed that to use a gpu-specific function to calculate the number of elements. Since proof and verify have different memory footprint I have different methods for each. I will run a few benchmarks on a 8xH100 instance to see what's the optimal set of values for them.


tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

you should print the seed so we can replay if a bug happens

Good point. I added a print to test_pke_v1_gpu_cpu_equivalence and test_pke_v2_gpu_cpu_equivalence. I'm printing as hex, since this is what is done in the other zk tests.


tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"

If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?


tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?

And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?

Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.

Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?

Copy link
Contributor

@nsarlin-zama nsarlin-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsarlin-zama made 2 comments and resolved 2 discussions.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, pdroalves, soonum, SouchonTheo, and tmontaigu).


tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, pdroalves (Pedro Alves) wrote…

If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?

cc @IceTDrinker


tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):

Previously, pdroalves (Pedro Alves) wrote…

Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.

Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?

yes that's great!

Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IceTDrinker made 1 comment.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).


tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

cc @IceTDrinker

my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore

if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost

but yeah, I think you are safe removing it

Verify: 15 → 60 elem/GPU (+22% throughput, limited by expansion memory).
Proof: 30 → 250 elem/GPU (5x improvement, limited by CPU/GPU pipeline).
Copy link
Contributor Author

@pdroalves pdroalves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdroalves made 5 comments and resolved 1 discussion.
Reviewable status: 9 of 42 files reviewed, 10 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).


tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, IceTDrinker wrote…

my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore

if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost

but yeah, I think you are safe removing it

Ok, removed.


.PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend
test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already what's happening.

In the should-run job the current file patterns are:

- tfhe/Cargo.toml                                                                                                                                                                         
- tfhe/build.rs                                                                                                                                                                           
- backends/zk-cuda-backend/**                                                                                                                                                             
- tfhe/src/integer/gpu/zk/**                                                                                                                                                              
- tfhe-zk-pok/**                                               
- tfhe/docs/**/**.md
- .github/workflows/gpu_zk_tests.yml
 - ci/slab.toml

For completeness I will add a few more items to that list:

- tfhe/src/zk/**
- backends/tfhe-cuda-backend/**
- tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams)
- tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there)
- tfhe/src/shortint/parameters/** (in case something changes in the parameters)

///
/// The result is cached after the first call since GPU count cannot change
/// during execution.
pub(crate) fn get_num_gpus() -> u32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325

/// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU
/// is present.
#[inline]
pub fn select_gpu_for_msm() -> Option<u32> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we have a public reference for that. @nsarlin-zama do you?

v1 is deprecated and the GPU code path is dead weight. Only v2 GPU
support remains. The v1 prove now unconditionally uses the CPU path.
Add missing paths that affect GPU ZK test results: tfhe-cuda-backend,
core_crypto GPU primitives, shortint parameters, and CPU-side ZK types.
Broaden integer/gpu/zk to integer/gpu to catch shared GPU modules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants