Skip to content

OOM errors for 16gb GPUs #1646

@beiuhori07

Description

@beiuhori07

my machine runs into out of memory erros often, when i believe it really shouldnt.

here is some of the data i grabbed today:

GPUs: 5x 5080 with 16303MiB (many more soon :))
Segement Size: 19

0x371479ca8a23b662f5b08d137bb21a4850861aca4a465bd3 - segement 19 / 6.4B cycles -> failed with OOM (only after 30min-ish)-> input size 51.3MB
0x371479ca8a23b662f5b08d137bb21a4850861aca32885cd3 - segment 19 / 3.1B cycles -> failed with OOM (a few mins in) -> input size 20.8MB

gpu_prove_agent2-1 |
gpu_prove_agent2-1 | thread 'main' panicked at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/hal/cuda.rs:170:13:
gpu_prove_agent2-1 | Failure during hash_rows: cudaGetLastError()@kernels/zkp/cuda/supra/api.cu:47 failed: "out of memory"
gpu_prove_agent2-1 | stack backtrace:
gpu_prove_agent2-1 | 0: __rustc::rust_begin_unwind
gpu_prove_agent2-1 | at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
gpu_prove_agent2-1 | 1: core::panicking::panic_fmt
gpu_prove_agent2-1 | at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
gpu_prove_agent2-1 | 2: <risc0_zkp::hal::cuda::CudaHashPoseidon2 as risc0_zkp::hal::cuda::CudaHash>::hash_rows
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/hal/cuda.rs:170:13
gpu_prove_agent2-1 | 3: <risc0_zkp::hal::cuda::CudaHal as risc0_zkp::hal::Hal>::hash_rows
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/hal/cuda.rs:971:37
gpu_prove_agent2-1 | 4: risc0_zkp::prove::merkle::MerkleTreeProver::new
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/prove/merkle.rs:66:13
gpu_prove_agent2-1 | 5: risc0_zkp::prove::poly_group::PolyGroup::new
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/prove/poly_group.rs:76:22
gpu_prove_agent2-1 | 6: risc0_zkp::prove::prover::Prover::finalize
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkp-3.0.3/src/prove/prover.rs:178:27
gpu_prove_agent2-1 | 7: <risc0_circuit_keccak::prove::KeccakProverImpl<H,C> as risc0_circuit_keccak::prove::KeccakProver>::prove
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-circuit-keccak-4.0.3/src/prove/mod.rs:179:27
gpu_prove_agent2-1 | 8: risc0_zkvm::host::server::prove::keccak::prove_keccak
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkvm-3.0.4/src/host/server/prove/keccak.rs:36:27
gpu_prove_agent2-1 | 9: <risc0_zkvm::host::server::prove::prover_impl::ProverImpl as risc0_zkvm::host::server::prove::ProverServer>::prove_keccak
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/risc0-zkvm-3.0.4/src/host/server/prove/prover_impl.rs:363:9
gpu_prove_agent2-1 | 10: workflow::tasks::keccak::keccak::{{closure}}
gpu_prove_agent2-1 | at /opt/actions-runner/_work/boundless/boundless/bento/crates/workflow/src/tasks/keccak.rs:67:10
gpu_prove_agent2-1 | 11: workflow::Agent::process_work::{{closure}}
gpu_prove_agent2-1 | at /opt/actions-runner/_work/boundless/boundless/bento/crates/workflow/src/lib.rs:488:22
gpu_prove_agent2-1 | 12: workflow::Agent::poll_work::{{closure}}
gpu_prove_agent2-1 | at /opt/actions-runner/_work/boundless/boundless/bento/crates/workflow/src/lib.rs:356:56
gpu_prove_agent2-1 | 13: agent::main::{{closure}}
gpu_prove_agent2-1 | at /opt/actions-runner/_work/boundless/boundless/bento/crates/workflow/src/bin/agent.rs:33:23
gpu_prove_agent2-1 | 14: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/park.rs:285:71
gpu_prove_agent2-1 | 15: tokio::task::coop::with_budget
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/task/coop/mod.rs:167:5
gpu_prove_agent2-1 | 16: tokio::task::coop::budget
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/task/coop/mod.rs:133:5
gpu_prove_agent2-1 | 17: tokio::runtime::park::CachedParkThread::block_on
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/park.rs:285:31
gpu_prove_agent2-1 | 18: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
gpu_prove_agent2-1 | at /home/ubuntu/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/context/blocking.rs:66:14
gpu_prove_agent2-1 | 19: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}

clean db

0x7276fa3c23c1783bdcb97258717f8f32b75919c9979bcfc5 - segement size 19 / 177M cycles - 26kb input - proved successfully
0x7276fa3c23c1783bdcb97258717f8f32b75919c9f089d31c - segement size 19 / 361M cycles - 91kb input - proved successfully

-> running boundless rewards prepare-mining -> with segment size env var 19 -> does work on just these two proofs above

0x7276fa3c23c1783bdcb97258717f8f32b75919c9b57cbdf4 - segment size 19 / 1.5B cycles - 9.4MB input - hits OOM error
0x7276fa3c23c1783bdcb97258717f8f32b75919c9f1c05477 - segment size 19 / 1.2B cycles - 7.8MP input - hits OOM error

clean db

-> running boundless rewards prepare-mining -> with segment size env var 19 -> on 2 orders + a few hours of mining -> works! (this used to break as well, could not reproduce it this time fast)
-> after that -> boundless rewards submit-mining -> runs into OOM error (the state file does include previous proofs done on segment size 20, if that matters)

same with claim

  • until now
    • worked these around running the rewards commands with the --use-default-prover option but it takes a lot of time :\
    • ran the boundless orders that run successfully also on segment size 20 and therefore reach better performance.

no other special logs appear in the proximity of the error.

if more data is needed or you need me to test some other edges cases, im happy to help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions