Skip to content

Proposed improvements: evolve-loop efficiency and attempt-budget waste #72

@ZhuangDingyi

Description

@ZhuangDingyi

Summary

Across recent runs, a meaningful share of token spend goes to attempts that produce little or no signal — multi-agent stalls counted as fails, hyperparameter tuning that consumes attempt slots, evolution that re-discovers via blind sampling what credit assignment could surface faster, and model-combination gains we don't currently search for properly.

This issue groups four related improvement directions. They share a theme (recover wasted budget + close capability gaps in the evolve loop) and trade off against each other for engineering time, so it's useful to consider them together.


1. Stalled attempts shouldn't consume the attempt budget

Problem. Multi-agent stalls are currently counted as failed attempts. When several agents stall in a window, we lose both the slots and the tokens already spent on them. This is the single largest source of wasted tokens we've observed.

Proposed directions.

  • Distinguish "stalled / infra-killed" from "ran and failed" at the budget layer. Stalls should not draw from the same pool as real fails.
  • Detect stalls earlier — per-step heartbeat or token-rate floor — instead of waiting on the long idle threshold. By the time we hit that threshold, most of the tokens are already gone.
  • Re-queue or quarantine stall-recovered attempts instead of marking them dead.
  • Surface per-agent stall rate so pathological configs become visible.

2. Local tuning should not draw from the attempt budget

Problem. Hyperparameter and config tuning is necessary work, but every tuning trial currently counts as a full attempt. The result is that we can hit the attempt cap before an agent has even ironed out obvious config issues.

Proposed directions.

  • Add an explicit tune mode at the grader / budget layer, decoupled from the attempt budget.
  • Tune against a cheaper local target — smaller eval slice, dev split, or smoke harness — and only promote the final tuned config to a real attempt.
  • Make the mode boundary explicit in the agent contract: agents should know whether they're in tune or submit mode and behave differently (e.g. skip expensive ensembles during tune).

3. Add gradient-style meta-evolve

Problem. The current meta loop is population-based search with no gradient signal at the meta level. We spend many attempts re-discovering, via random sampling, things that credit assignment over operators / mutations would surface much earlier. Over a full run this compounds into a large token cost.

Proposed directions.

  • Track per-operator and per-mutation score deltas, use them to bias future operator sampling.
  • Treat the meta-config (mutation weights, exploration/exploit balance, etc.) as something we evolve with gradient signal rather than sample blindly.
  • Roll out behind a flag and A/B against the current evolve loop on a fixed budget.

Open questions.

  • Cheapest gradient proxy — per-operator lift vs. fuller credit assignment.
  • How to handle the discrete / non-differentiable parts of the operator space.

4. Layered evolution for model combination

Problem. Model combination is a real source of score gains, but evolving it inside the same flat loop as base models doesn't work — the search space blows up and combo-search has a very different cost/value profile from base-search. Combo gains are mostly unrealized today.

Proposed directions.

  • Two-stage / hierarchical evolution: evolve base models first, then a separate evolution stage on combination strategies over a frozen or curated pool of bases.
  • Give combination its own population, budget, and selection criteria — don't mix it into the per-model attempt budget.
  • Acknowledged as a larger architectural change; tracking as a longer-horizon direction rather than a quick fix.

Suggested sequencing

A reasonable order, but open to discussion:

  1. (1) Stall accounting and (2) tune-mode separation — pure budget-recovery wins, smallest scope, biggest near-term token savings.
  2. (3) Gradient meta-evolve — next-biggest leverage; unlocks more efficient search per attempt.
  3. (4) Layered combination evolution — largest change, best taken on once 1–3 are stable.

Note on evidence

I'm holding back specific run logs / IDs / score numbers from this issue on purpose; happy to share those privately for triage. The directions above are the part that's safe to discuss in the open.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions