Skip to content

Recurring Bento Proof Failures Causing Collateral Lockup #1698

@wep-v

Description

@wep-v

Boundless Version: v1.2.2 (commit 70f1ef27)
Date: February 27, 2026
Cluster: 24 GPUs across 3 servers, broker on dedicated node
Redis: valkey, noeviction policy, no maxmemory set


Summary

Four orders failed to prove today due to bento errors BENTO-WF-115 and BENTO-WF-117, both manifesting as io error: unexpected end of file when reading segment data from Redis. Each failure caused the broker to abandon a locked order, leaving 100 ZKC collateral locked on-chain per order until expiry. With four consecutive failures across a ~4 hour window, ~400 ZKC of collateral became inaccessible, eventually preventing the broker from locking new orders entirely due to insufficient available collateral.


Error Occurrences

Incident 1 — 17:49 UTC
Order 69265, proof_id 22251d66-2b9d-4579-bdee-71a07e724168

[BENTO-WF-117] POVW join failed: io error: unexpected end of file

Incident 2 — 19:23 UTC
Order 69274, proof_id ede0ab07-7ade-4b68-9de0-b5b902a0a6c0

[BENTO-WF-115] Prove failed: Failed to deserialize segment data from redis: io error: unexpected end of file

Incident 3 — 21:39 UTC
Order 69287, proof_id 9b093b57-9c50-45a6-9e63-ad7de33646b3

[BENTO-WF-115] Prove failed: Failed to deserialize segment data from redis: io error: unexpected end of file

Incident 4 — 21:41 UTC
Order 928b, proof_id 14d9412e-373c-46a2-a46f-36c504429daa

[BENTO-WF-117] POVW join failed: io error: unexpected end of file
[BENTO-WF-115] Prove failed: Failed to deserialize segment data from redis: io error: unexpected end of file

Redis State at Time of Report

used_memory_human: 2.23G (current)
used_memory_peak_human: 14.19G (peak today)
maxmemory: 0 (no limit set)
maxmemory_policy: noeviction

Redis peaked at 14.19GB during the proving session. The EOF errors may be related to memory pressure during peak load causing truncated writes to Redis, though the exact trigger is unclear. No evictions occurred (noeviction policy confirmed).

Downstream Impact

When a proof fails, the broker correctly abandons the order per current behavior. However, the locked collateral (100 ZKC per order) remains held on-chain until the lock window expires (~3.75 hours). With multiple consecutive failures, collateral accumulated to the point where the broker could not lock new orders:

WARN broker::order_monitor: No longer have enough collateral deposited to market
to lock order 928c, skipping [need: 100.000000000000000000 ZKC,
have: 76.848500000000000000 ZKC]

This caused a complete proving outage until locks expired and collateral was released.

Requested Fixes / Questions

  1. BENTO-WF-115 / BENTO-WF-117: What causes unexpected end of file when deserializing segment data from Redis? Is this a known race condition or memory pressure issue? Is there a fix or workaround?

  2. Broker collateral release: When a proof fails and the broker abandons an order, is there a mechanism to slash/reclaim the locked collateral early, or does it always wait for on-chain lock expiry? If it must wait, could the broker factor current locked-but-abandoned collateral into its available balance calculation to avoid over-committing?

  3. Retry behavior: Currently max_retries = 1 for proof monitoring. Would increasing retries help in the case of transient Redis errors, or does the segment data need to be re-generated from scratch?

Environment Notes

  • noeviction Redis policy confirmed (previously allkeys-lru, changed prior to these incidents)
  • Redis peaked at 14.19GB during 24-GPU full load
  • Orders affected were ~40-42K mcycles each
  • Cluster was upgraded to v1.2.2 earlier today from v1.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions