feat(slime): support arbitrary agent payload shapes in the training backend#61
Merged
Conversation
Collapses _sample_to_payload to return a shallow copy of
Sample.metadata. Previously it synthesized a hybrid payload shape
(sample.prompt -> payload["prompt"], sample.label -> payload["answer"],
sample.metadata nested under payload["metadata"], plus a fall-through
copy of Sample fields), which locked the slime backend into the math
agent's shape and forced other agents (appworld, migration,
officebench) into workarounds.
After this change, the JSONL row's metadata dict is the agent payload
exactly, so each agent declares whatever payload shape it wants by
choosing what keys to put in metadata. The JSONL top-level prompt
field still drives slime's tokenizer and length filter.
Breaking change for existing math JSONLs: rows using {prompt, label}
now produce an empty payload. Regenerate with the updated SETUP.md
data-prep snippet which emits {prompt, metadata: {prompt, answer}}.
Also drops --label-key from train.sh (nothing reads sample.label
under the new rule).
Verified end-to-end on Qwen2.5-3B-Instruct + GSM8K with NUM_ROLLOUT=10:
raw_reward climbed 0.27 -> 0.63, train/loss and grad_norm move as
expected, no rollout failures.
Plan: docs/roadmap/committed/slime-data-contract.md (committed on
docs/core-api-rename-roadmap in PR #59).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Seven tests collapse to three, one per distinct invariant:
- metadata-is-payload-verbatim (also covers that prompt/label are ignored)
- returned dict is a shallow copy (guards against _process_one_episode
mutation leaking back)
- missing/None/non-dict metadata defensively returns {}
The four former "empty / missing / None / non-dict" tests all exercised
the same fallback branch; merged into one parameterized assertion.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
lliquid
pushed a commit
that referenced
this pull request
May 5, 2026
Implemented in #61. Moves the plan out of committed/ (in-flight) into done/ (shipped), alongside the other completed plans. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
lyzustc
approved these changes
May 5, 2026
3 tasks
lliquid
added a commit
that referenced
this pull request
May 6, 2026
Implemented in #61. Moves the plan out of committed/ (in-flight) into done/ (shipped), alongside the other completed plans. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Let the slime training backend carry any agent payload shape from the JSONL dataset to the agent, so agents with different input schemas (GSM8K, AppWorld, migration, OfficeBench, future ones) can train through slime without bespoke integration changes.
Before this PR, `_sample_to_payload` hardcoded the GSM8K shape (`prompt` + `answer` top-level plus a nested `metadata` dict), which forced every other agent into workarounds. After this PR, the JSONL row's `metadata` dict is the agent payload verbatim — the data author owns the schema, the framework holds no opinion.
User-facing effect
Each agent declares its payload shape by choosing what keys go in `metadata`:
The top-level `prompt` field stays — it drives slime's tokenizer / length filter. The agent-visible payload comes entirely from `metadata`.
How
One-line replacement in `backends/slime/integration/rollout.py::_sample_to_payload`: return a shallow copy of `Sample.metadata`. All the old hardcoded-shape logic is deleted.
Plan: `docs/roadmap/committed/slime-data-contract.md` (on the PR #59 branch).
Breaking change
JSONLs in the old `{prompt, label}` shape produce empty agent payloads under the new rule. Regenerate with the updated SETUP.md data-prep snippet, which emits `{prompt, metadata: {prompt, answer}}`. The math agent's `rl_app.py` already reads `payload["prompt"]` / `payload["answer"]` and works without changes once the JSONL is regenerated.
End-to-end smoke test
Ran `bash train.sh` with `NUM_ROLLOUT=10` on Qwen2.5-3B-Instruct + 64-row GSM8K (`slimerl/slime:latest` container, 8 × H100):
Reward climbs from 0.27 → 0.63 over 10 rollouts — GRPO is learning under the new contract. All 10 train steps logged `train/loss`, `train/grad_norm`, and progressed `train/step` monotonically. No Tracebacks or FAILED session statuses (one transient ACR ThrottlingException was caught and retried; two atexit tracebacks from Ray teardown at job exit are cosmetic).
Test plan
🤖 Generated with Claude Code