Skip to content

compress_and_verify operation fails a lot #1648

@beiuhori07

Description

@beiuhori07

GPUs: RTX 5080.

At the end of the proof the compress and verify operations fails a lot of times on my machine. Sometimes it gets it right from the first try, sometimes gets it right from the 20th try, or sometimes on the 100th try. I could not find any correlation to why this happens, or what kind of orders seem to cause this.

One valuable insight i managed to find is that, whenever i start my prover, the first order ALWAYS gets it right from the first try. No failures. As the provers goes on there seems to be a tendency for the failure rate to increase noticeably.
If i stop it and turn it on again, then the first proof is done first try. always.

This led me to try a very stupid work-around: have a script that restarts a random gpu-agent container every X minutes. Two months later and thousand of orders delivered, this work-around seems to be actually working well. Absolutely ALWAYS, when an agent is assigned to do this compress and verify operation, if the agent has been restarted during this current proof, it gets it right first try. If the agent was restarted during the previous proof, the rate of failure increases (it is possible to get it right further on first try though), seemingly linear with the age of the agent.

So there absolutely has to be some insights for you from this. It seems like always when the agent gets its memory cleaned, it works like magic, so there has to be some sort of invalid data, memory leak building up, idk.

Hope that i got my explanation clear. If more info from my end is needed for this fix, let me know

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions