feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok by pdroalves · Pull Request #3275 · zama-ai/tfhe-rs

pdroalves · 2026-02-06T19:24:20Z

Wire the zk-cuda-backend into tfhe-zk-pok and tfhe.

GPU acceleration module (`tfhe-zk-pok/src/gpu/`)

New gpu module gated behind the gpu-experimental feature flag, containing:

Type conversions between tfhe-zk-pok (arkworks) and zk-cuda-backend types: G1Affine, G2Affine, Zp/Scalar, Fq/Fp, Fq2/Fp2. Includes compile-time size_of/align_of assertions to guarantee transmute safety between wrapper types and inner arkworks types.
GPU MSM helpers (g1_msm_gpu, g2_msm_gpu): create a CUDA stream, dispatch MSM to the backend, convert results from Montgomery form back to arkworks projective coordinates. Multi-GPU support via select_gpu_for_msm() which distributes work across GPUs by rayon thread index.

GPU prove/verify for PKE v1 and v2

GPU variants of prove and verify are structured as submodules of each proof module, mirroring the CPU API:

CPU	GPU
`pke::prove(...)`	`pke::gpu::prove(...)`
`pke::verify(...)`	`pke::gpu::verify(...)`
`pke_v2::prove(...)`	`pke_v2::gpu::prove(...)`
`pke_v2::verify(...)`	`pke_v2::gpu::verify(...)`

proofs/pke/gpu.rs: Duplicates v1 prove_impl logic, replacing every multi_mul_scalar with g1_msm_gpu/g2_msm_gpu. Verification delegates to the CPU verifier since v1 verify has no MSM calls.
proofs/pke_v2/gpu.rs: Duplicates v2 prove_impl and verify_impl, replacing MSM sites. Also provides GPU variants of pairing_check_two_steps and pairing_check_batched for verification.

Visibility changes in `proofs/`

Several items that were previously pub(super) or private are now pub(crate) so the GPU submodules can access them:

proofs/mod.rs: OneBased inner field, assert_pke_proof_preconditions, decode_q, compute_r1, compute_r2, Sid::to_le_bytes, SidBytes, run_in_pool, test utilities (PkeTestParameters, PkeTestcase, etc.)
proofs/pke/mod.rs: bit_iter, compute_a_theta, PublicCommit/PrivateCommit fields, test constants
proofs/pke_v2/mod.rs: bit_iter, compute_a_theta, compute_crs_params, inf_norm_bound_to_euclidean_squared, ComputeLoadProofFields::to_le_bytes, GeneratedScalars, EvaluationPoints, PublicCommit/PrivateCommit fields, test constants
proofs/pke_v2/hashes.rs: RHash, PhiHash, XiHash, YHash, THash, ThetaHash, OmegaHash, DeltaHash, ZHash and their chained gen_* methods

Tests

gpu/tests/zk_cuda_backend.rs: Low-level integration tests for the GPU backend — MSM correctness for G1/G2, type round-trip conversions, multi-GPU dispatch.
gpu/tests/prove_verify_stress.rs: Exhaustive GPU-vs-CPU equivalence tests for both v1 and v2. Each test iterates over 3 CRS variants x 32 invalid-witness combinations x 2 compute loads (v2 also sweeps both pairing modes), asserting byte-identical serialized proofs and matching accept/reject decisions.

Benchmarks

tfhe-zk-pok/benches/pke_v1.rs and pke_v2.rs: GPU benchmark groups added behind gpu-experimental, covering prove and verify for both protocol versions.
tfhe-benchmark/benches/zk/msm.rs: Standalone MSM benchmarks (CPU and GPU) for G1 and G2 at various sizes, useful for profiling the MSM kernel in isolation.
tfhe-benchmark/benches/integer/zk_pke.rs: GPU throughput benchmark tuning — replaces hardcoded element counts with a gpu_zk_throughput_elements function sized per CRS/bit configuration to avoid OOM. Creates per-GPU server keys and streams for proper multi-GPU scaling.

Build system and CI

tfhe-zk-pok/Cargo.toml: New gpu-experimental feature depending on zk-cuda-backend and tfhe-cuda-backend. ark-ec/ark-ff switched to workspace dependencies. Added itertools dependency.
tfhe/Cargo.toml: New gpu-experimental-zk feature forwarding to tfhe-zk-pok/gpu-experimental.
tfhe-benchmark/Cargo.toml: New gpu-experimental-zk feature; zk-pok now brings in dep:tfhe-zk-pok directly. New zk-msm bench target.
Makefile: New targets test_zk_pok_gpu, test_integer_zk_gpu, test_zk_cuda, bench_msm_zk, bench_msm_zk_gpu, bench_tfhe_zk_pok_gpu. Existing GPU clippy/check/build targets updated to include gpu-experimental-zk. bench_integer_zk_gpu switched from release_lto_off to release profile. Minor trailing-comma fixes in bench_integer_aes*_gpu.
.github/workflows/: benchmark_gpu.yml adds tfhe_zk_pok and msm_zk commands. benchmark_cpu.yml adds msm_zk. gpu_zk_tests.yml broadens file change triggers and runs test_zk_pok_gpu + test_integer_zk_gpu.
.gitignore: Ignores backends/tfhe-cuda-backend/cuda/build/.

New Makefile targets

test_zk_cuda_backend: Builds the zk-cuda-backend C++ tests via CMake and runs both the C++ and Rust test suites.
test_zk_pok_gpu: Runs tfhe-zk-pok tests with the gpu-experimental feature, filtering to GPU tests.
test_integer_zk_gpu: Runs tfhe integer ZK tests with expand accelerated on the GPU.
test_integer_zk_experimental_gpu: Same as above but accelerating the entire verify/proof circuit using zk-cuda-backend.
bench_msm_zk: Runs CPU MSM benchmarks for G1 and G2 at various sizes.
bench_msm_zk_gpu: Runs GPU MSM benchmarks for G1 and G2 at various sizes.
bench_tfhe_zk_pok_gpu: Runs tfhe-zk-pok prove/verify benchmarks with GPU acceleration.

closes: https://github.com/zama-ai/tfhe-rs-internal/issues/1336

PR content/description

Check-list:

Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)
Relevant issues are marked as resolved/closed, related issues are linked in the description
Check for breaking changes (including serialization changes) and add them to commit message following the conventional commit specification

This change is

pdroalves · 2026-03-05T00:23:32Z

Hey @nsarlin-zama @IceTDrinker this PR is finally ready for review. I did my best to make your review easier, but is still a large PR. Hope you don't have to spend time on obvious issues.

IceTDrinker

thanks @pdroalves we have some other topics to finish first on our end, but we'll take a look when we get a chance !

@IceTDrinker made 1 comment.
Reviewable status: 0 of 25 files reviewed, all discussions resolved (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).

andrei-stoian-zama

I think this PR looks very good!

I have some minor comments, and one design question about rayon thread pools / gpu_indexes.

andrei-stoian-zama · 2026-03-11T10:43:58Z

Makefile

+		-p tfhe-zk-pok --features experimental,gpu-experimental -- gpu
+
+.PHONY: test_integer_zk_gpu # Run tfhe-zk-pok tests
+test_integer_zk_gpu: install_rs_check_toolchain


this target and the above one do not use the gpu-experimental feature. what do these targets test ? old code ?

Yes, test_integer_zk_gpu executes verify_and_expand with GPU-accelerated expand (but CPU verify).

test_zk_pok_gpu has gpu-experimental so the target name actually should be test_zk_pok_experimental_gpu, as test_integer_zk_experimental_gpu.

andrei-stoian-zama · 2026-03-11T10:45:54Z

.github/workflows/gpu_zk_tests.yml

      - name: Run zk-cuda-backend integration tests
        run: |
          make test_zk_cuda_backend
+          make test_zk_pok_gpu


how much CI time is added ?

15 minutes. https://github.com/zama-ai/tfhe-rs/actions/runs/21573070408

andrei-stoian-zama · 2026-03-11T10:48:30Z

Makefile


 .PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
-test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend
+test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend


should we run the zk tests on approved ? only when the zk backend changes ?

Given it's a 15min overhead, I would run them every time some zk-related thing changes.

ok, could you add a workflow filter for that ?

That's already what's happening.

In the should-run job the current file patterns are:

- tfhe/Cargo.toml - tfhe/build.rs - backends/zk-cuda-backend/** - tfhe/src/integer/gpu/zk/** - tfhe-zk-pok/** - tfhe/docs/**/**.md - .github/workflows/gpu_zk_tests.yml - ci/slab.toml

For completeness I will add a few more items to that list:

- tfhe/src/zk/** - backends/tfhe-cuda-backend/** - tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams) - tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there) - tfhe/src/shortint/parameters/** (in case something changes in the parameters)

andrei-stoian-zama · 2026-03-11T13:33:06Z

tfhe-zk-pok/src/gpu/mod.rs

+/// - If `bases` and `scalars` have different lengths (checked inside the backend).
+/// - If the GPU MSM call fails.
+#[must_use]
+pub fn g1_msm_gpu(bases: &[G1Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G1 {


I don't see a good reason to use Optional gpu_index.

That was useful when we were sharing implementation code with the CPU. Not anymore. I will remove the optional modifier.

andrei-stoian-zama · 2026-03-11T13:33:53Z

tfhe-zk-pok/src/gpu/mod.rs

+/// - If `bases` and `scalars` have different lengths (checked inside the backend).
+/// - If the GPU MSM call fails.
+#[must_use]
+pub fn g2_msm_gpu(bases: &[G2Affine], scalars: &[Zp], gpu_index: Option<u32>) -> G2 {


again, no need for optional gpu_index

I will remove this one either.

andrei-stoian-zama · 2026-03-11T13:45:32Z

tfhe-zk-pok/src/gpu/mod.rs

+/// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU
+/// is present.
+#[inline]
+pub fn select_gpu_for_msm() -> Option<u32> {


I don't understand why we have to rely on rayon here. is our multi-gpu logic necessarily related to rayon use ? in the cuda backend we use rayon in benchmarks but we don't assume rayon is needed.

Quick answer: yes, out multi-gpu logic depends on rayon. We link one GPU per rayon thread.

Longer answer: The original version of this PR had verify_impl and prove_impl shared between CPU and GPU. This version splits those, so we now have our own. This opens the possibility to rewrite and reorganize those methods freely, but that's not what I'm doing in this PR. It has the minimum necessary changes to integrated the zk backend.

That's what I meant that I want to work next.

makes sense! can you create an issue detailing the future work please ?

Done: https://github.com/zama-ai/tfhe-rs-internal/issues/1358

andrei-stoian-zama · 2026-03-11T13:46:10Z

tfhe-zk-pok/src/gpu/mod.rs

+///
+/// The result is cached after the first call since GPU count cannot change
+/// during execution.
+pub(crate) fn get_num_gpus() -> u32 {


can't we have a common file with the tfhe backend and use the function we used traditionally ?

We could, just need to think where.

We can't use the version on tfhe/ otherwise it would create a circular dependency with tfhe-zk-pok. We could maybe convert the current method we have in tfhe/src/core_crypto/gpu/mod.rs to tfhe-cuda-backend bindings as a safe rust wrapper. I suppose that's feasible.

should it be done in this PR or a new one ?

I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325

andrei-stoian-zama · 2026-03-11T16:33:11Z

tfhe-zk-pok/src/gpu/pke.rs

some comments:

// FIXME: div_round -> anything to do for it ? remove it ?

is there a reference for the proof - a link to some PDF ? CPU doesn't have one either..

Yes, we do: https://www.notion.so/zamaai/ZK-resources-d803e74fff9d4f0bbe1fa504679a2864

if there is a publicly accessible reference we should add, if possible, comments in the code such as. see "Fancy ZK paper" by Joye et al. 2007 - algorithm A. If there isn't a publicly available paper then nothing to be done, we'll handle docs internally.

Not sure if we have a public reference for that. @nsarlin-zama do you?

andrei-stoian-zama · 2026-03-11T16:35:40Z

tfhe-zk-pok/src/gpu/pke_v2.rs

why do we use the run_in_pool mechanism ? CPU uses it, is it the best for us ?

Possibly not. As before, this integration is done with the minimum code change to verify_impl and proof_impl. I plan to approach this type of thing in a future PR.

nsarlin-zama

@nsarlin-zama partially reviewed 15 files and made 4 comments.
Reviewable status: 12 of 26 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).

tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):


#[allow(clippy::too_many_arguments)]
fn prove_impl(

just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?

And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?

tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):

#[test]
fn test_pke_v2_gpu_cpu_equivalence() {
    let params = crate::proofs::pke_v2::tests::PKEV2_TEST_PARAMS;

you should print the seed so we can replay if a bug happens

tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

//! GPU-accelerated prove/verify for PKE v1.

I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"

tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):

                                // batch size
                                let elements =
                                    (rayon::current_num_threads() / num_block).max(1) + 1;

This might not be a good metric for the gpu ?

…dd PTX carry chains for fp_add/sub and branchless reduction -replace software carry detection (carry = (sum < old) ? 1 : 0) with inline PTX hardware carry flags (add.cc.u64/addc.u64) - replace software carry detection in fp_add_raw/fp_sub_raw with inline PTX add.cc.u64/addc.cc.u64 and sub.cc.u64/subc.cc.u64 chains\ - now we always compute both reduced and unreduced result and select via bitmask

Callers always pass a concrete `u32` via `select_gpu_for_msm()`, so the `Option<u32>` indirection added no value. Simplifies the API per review feedback on PR #3275.

…enchmark Replace the CPU thread count heuristic (rayon::current_num_threads() / num_block) with gpu_zk_throughput_elements() scaled by GPU count, matching the approach already used by gpu_pke_zk_verify. Also use compute_load_config() to respect the __TFHE_RS_BENCH_OP_FLAVOR env var for fast mode.

Separate gpu_zk_throughput_elements into two functions: - gpu_zk_proof_throughput_elements: ~2x higher values since proof MSM uses ~0.5-1.8 MB/element (compute-bound, not memory-bound) - gpu_zk_verify_throughput_elements: unchanged values tuned for expansion OOM avoidance

pdroalves

@pdroalves made 4 comments.
Reviewable status: 12 of 41 files reviewed, 13 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).

tfhe-benchmark/benches/integer/zk_pke.rs line 948 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

This might not be a good metric for the gpu ?

That's a terrible metric. Our throughput benchmark was broken when I started this and honestly I forgot to give it some love. I just changed that to use a gpu-specific function to calculate the number of elements. Since proof and verify have different memory footprint I have different methods for each. I will run a few benchmarks on a 8xH100 instance to see what's the optimal set of values for them.

tfhe-zk-pok/src/gpu/tests/prove_verify_stress.rs line 214 at r3 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

you should print the seed so we can replay if a bug happens

Good point. I added a print to test_pke_v1_gpu_cpu_equivalence and test_pke_v2_gpu_cpu_equivalence. I'm printing as hex, since this is what is done in the other zk tests.

tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

I'm wondering if it's worth including the zkv1, we might deprecate it on the cpu at some point, it's not used in prod, will never be and has been removed from the NIST doc.
And it adds a lots of code so it's not "free"

If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?

tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

just checking, we went for the "duplicate" solution so that you could later have more freedom in how you optimize the proof for the gpu. Is it still in the plan?

And right now, is it just a copy paste of the cpu code with msm replaced by gpu msm?

Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.

Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?

nsarlin-zama

@nsarlin-zama made 2 comments and resolved 2 discussions.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, pdroalves, soonum, SouchonTheo, and tmontaigu).

tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, pdroalves (Pedro Alves) wrote…

If we are sure it's never going to be used, so yeah we should remove it. It's adding more code to maintain and increasing CI costs. Can I remove it?

cc @IceTDrinker

tfhe-zk-pok/src/gpu/pke_v2.rs line 75 at r5 (raw file):

Previously, pdroalves (Pedro Alves) wrote…

Yes, that's the idea! For now this is just a copy line by line, just replacing the MSM as you noticed. I plan to start playing with those implementation in a following PR when we merge this.

Also, as we agreed, I added the test to exhaustively compare CPU and GPU results, did you see it?

yes that's great!

IceTDrinker

@IceTDrinker made 1 comment.
Reviewable status: 9 of 42 files reviewed, 11 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, nsarlin-zama, pdroalves, soonum, SouchonTheo, and tmontaigu).

tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, nsarlin-zama (Nicolas Sarlin) wrote…

cc @IceTDrinker

my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore

if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost

but yeah, I think you are safe removing it

Verify: 15 → 60 elem/GPU (+22% throughput, limited by expansion memory). Proof: 30 → 250 elem/GPU (5x improvement, limited by CPU/GPU pipeline).

pdroalves

@pdroalves made 5 comments and resolved 1 discussion.
Reviewable status: 9 of 42 files reviewed, 10 unresolved discussions (waiting on agnesLeroy, andrei-stoian-zama, IceTDrinker, nsarlin-zama, soonum, SouchonTheo, and tmontaigu).

tfhe-zk-pok/src/gpu/pke.rs line 1 at r5 (raw file):

Previously, IceTDrinker wrote…

my opinion is you can remove the zkv1 code, we don't use it and as indicated by Nicolas it's not in the NIST submission anymore

if you are worried you will need it in the future : just keep a branch around with the code, this way it won't be completely lost

but yeah, I think you are safe removing it

Ok, removed.

pdroalves · 2026-03-13T11:34:52Z

Makefile


 .PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
-test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend
+test_gpu: test_core_crypto_gpu test_integer_gpu test_cuda_backend test_zk_cuda_backend


That's already what's happening.

In the should-run job the current file patterns are:

- tfhe/Cargo.toml - tfhe/build.rs - backends/zk-cuda-backend/** - tfhe/src/integer/gpu/zk/** - tfhe-zk-pok/** - tfhe/docs/**/**.md - .github/workflows/gpu_zk_tests.yml - ci/slab.toml

For completeness I will add a few more items to that list:

- tfhe/src/zk/** - backends/tfhe-cuda-backend/** - tfhe/src/core_crypto/gpu/** (tfhe-zk-pok depends on CudaStreams) - tfhe/src/integer/gpu/** (ZK depends on ciphertext::compact_list and key_switching_key, we want to catch anything that could break there) - tfhe/src/shortint/parameters/** (in case something changes in the parameters)

pdroalves · 2026-03-13T11:34:52Z

tfhe-zk-pok/src/gpu/mod.rs

+///
+/// The result is cached after the first call since GPU count cannot change
+/// during execution.
+pub(crate) fn get_num_gpus() -> u32 {


I would do it in a new PR. We already have to work on the common compilation artifacts between both CUDA backends. I think this would be part of that. I added it to the issue we have about that: https://github.com/zama-ai/tfhe-rs-internal/issues/1325

pdroalves · 2026-03-13T11:34:52Z

tfhe-zk-pok/src/gpu/mod.rs

+/// across all available GPUs. Returns `None` (meaning GPU 0) when only one GPU
+/// is present.
+#[inline]
+pub fn select_gpu_for_msm() -> Option<u32> {


Done: https://github.com/zama-ai/tfhe-rs-internal/issues/1358

pdroalves · 2026-03-13T11:34:53Z

tfhe-zk-pok/src/gpu/pke.rs

Not sure if we have a public reference for that. @nsarlin-zama do you?

v1 is deprecated and the GPU code path is dead weight. Only v2 GPU support remains. The v1 prove now unconditionally uses the CPU path.

Add missing paths that affect GPU ZK test results: tfhe-cuda-backend, core_crypto GPU primitives, shortint parameters, and CPU-side ZK types. Broaden integer/gpu/zk to integer/gpu to catch shared GPU modules.

…k_gpu

cla-bot bot added the cla-signed label Feb 6, 2026

pdroalves force-pushed the pa/feat/zk-cuda-backend branch from acd4e17 to 15638fd Compare February 6, 2026 19:56

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from c55fde3 to ae99e9d Compare February 6, 2026 20:04

pdroalves force-pushed the pa/feat/zk-cuda-backend branch from 15638fd to e500003 Compare February 6, 2026 20:05

pdroalves marked this pull request as ready for review February 6, 2026 20:06

pdroalves requested review from IceTDrinker, agnesLeroy, nsarlin-zama and soonum as code owners February 6, 2026 20:06

pdroalves marked this pull request as draft February 6, 2026 20:07

pdroalves force-pushed the pa/feat/zk-cuda-backend branch from e500003 to 7ff06f9 Compare February 9, 2026 12:46

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch from ae99e9d to 01362e8 Compare February 9, 2026 13:10

pdroalves force-pushed the pa/feat/zk-cuda-backend branch 5 times, most recently from 849e110 to d4b3c54 Compare February 9, 2026 17:37

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from 0e83c12 to 757d5d0 Compare February 10, 2026 15:02

pdroalves force-pushed the pa/feat/zk-cuda-backend branch 4 times, most recently from d349076 to 54f63ae Compare February 12, 2026 11:20

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 2 times, most recently from 61bfe14 to 98753c0 Compare February 12, 2026 12:15

pdroalves force-pushed the pa/feat/zk-cuda-backend branch 2 times, most recently from 78a08f4 to 19443ef Compare February 13, 2026 14:56

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch from 98753c0 to 76c6bb4 Compare February 13, 2026 17:13

Base automatically changed from pa/feat/zk-cuda-backend to main February 14, 2026 01:30

pdroalves force-pushed the pa/feat/rust-zk-cuda-backend branch 5 times, most recently from c73ab78 to 7ef51c0 Compare March 5, 2026 00:09

pdroalves marked this pull request as ready for review March 5, 2026 00:13

pdroalves requested review from a team and SouchonTheo as code owners March 5, 2026 00:13

IceTDrinker reviewed Mar 5, 2026

View reviewed changes

andrei-stoian-zama reviewed Mar 11, 2026

View reviewed changes

nsarlin-zama reviewed Mar 12, 2026

View reviewed changes

pdroalves added 8 commits March 12, 2026 16:05

feat(gpu): integrate zk-cuda-backend with tfhe-zk-pok

d2794bd

fix(gpu): bump tfhe-cuda-backend dependency to 0.14.0

0f76a9c

fix(gpu): remove Option wrapper from gpu_index parameter

028bb49

Callers always pass a concrete `u32` via `select_gpu_for_msm()`, so the `Option<u32>` indirection added no value. Simplifies the API per review feedback on PR #3275.

chore(gpu): solves many compilation warnings in zk-cuda-backend

e5dc9c1

fix(gpu): print seed in GPU equivalence tests for replay on failure

4fbac1f

pdroalves commented Mar 12, 2026

View reviewed changes

pdroalves added 2 commits March 12, 2026 16:13

fix(gpu): update msm test calls to match non-tuple return type

b355648

fix(gpu): adjust test name in CI workflow

16c19fc

nsarlin-zama reviewed Mar 13, 2026

View reviewed changes

IceTDrinker reviewed Mar 13, 2026

View reviewed changes

feat(gpu): update throughput element counts from H100 tuning results

50c439c

Verify: 15 → 60 elem/GPU (+22% throughput, limited by expansion memory). Proof: 30 → 250 elem/GPU (5x improvement, limited by CPU/GPU pipeline).

pdroalves commented Mar 13, 2026

View reviewed changes

pdroalves added 3 commits March 13, 2026 08:35

chore(gpu): remove zkv1 GPU code path

b25d93c

v1 is deprecated and the GPU code path is dead weight. Only v2 GPU support remains. The v1 prove now unconditionally uses the CPU path.

fix(gpu): broaden gpu_zk_tests.yml path filters to cover dependencies

8a83e09

Add missing paths that affect GPU ZK test results: tfhe-cuda-backend, core_crypto GPU primitives, shortint parameters, and CPU-side ZK types. Broaden integer/gpu/zk to integer/gpu to catch shared GPU modules.

chore(gpu): add a filter to run only cuda benchmarks with bench_msm_z…

5d287c9

…k_gpu

Conversation

pdroalves commented Feb 6, 2026 • edited by IceTDrinker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU acceleration module (tfhe-zk-pok/src/gpu/)

GPU prove/verify for PKE v1 and v2

Visibility changes in proofs/

Tests

Benchmarks

Build system and CI

New Makefile targets

PR content/description

Check-list:

Uh oh!

pdroalves commented Mar 5, 2026

Uh oh!

IceTDrinker left a comment

Choose a reason for hiding this comment

Uh oh!

andrei-stoian-zama left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsarlin-zama left a comment

Choose a reason for hiding this comment

Uh oh!

pdroalves left a comment

Choose a reason for hiding this comment

Uh oh!

nsarlin-zama left a comment

pdroalves commented Feb 6, 2026 •

edited by IceTDrinker

Loading

GPU acceleration module (`tfhe-zk-pok/src/gpu/`)

Visibility changes in `proofs/`