Skip to content

feat(slime): support arbitrary agent payload shapes in the training backend#61

Merged
lliquid merged 2 commits into
mainfrom
feat/slime-data-contract
May 5, 2026
Merged

feat(slime): support arbitrary agent payload shapes in the training backend#61
lliquid merged 2 commits into
mainfrom
feat/slime-data-contract

Conversation

@lliquid

@lliquid lliquid commented May 5, 2026

Copy link
Copy Markdown
Contributor

Goal

Let the slime training backend carry any agent payload shape from the JSONL dataset to the agent, so agents with different input schemas (GSM8K, AppWorld, migration, OfficeBench, future ones) can train through slime without bespoke integration changes.

Before this PR, `_sample_to_payload` hardcoded the GSM8K shape (`prompt` + `answer` top-level plus a nested `metadata` dict), which forced every other agent into workarounds. After this PR, the JSONL row's `metadata` dict is the agent payload verbatim — the data author owns the schema, the framework holds no opinion.

User-facing effect

Each agent declares its payload shape by choosing what keys go in `metadata`:

Agent JSONL row shape Agent reads
Math (GSM8K) `{prompt, metadata: {prompt, answer}}` `payload["prompt"]`, `payload["answer"]`
AppWorld `{prompt, metadata: {task_id}}` `payload["task_id"]`
Migration `{prompt, metadata: {repo_uri, metadata_uri, ...}}` `InvocationRequest(**payload)`
OfficeBench `{prompt, metadata: {task_uri, testbed_uri}}` `InvocationRequest(**payload)`

The top-level `prompt` field stays — it drives slime's tokenizer / length filter. The agent-visible payload comes entirely from `metadata`.

How

One-line replacement in `backends/slime/integration/rollout.py::_sample_to_payload`: return a shallow copy of `Sample.metadata`. All the old hardcoded-shape logic is deleted.

Plan: `docs/roadmap/committed/slime-data-contract.md` (on the PR #59 branch).

Breaking change

JSONLs in the old `{prompt, label}` shape produce empty agent payloads under the new rule. Regenerate with the updated SETUP.md data-prep snippet, which emits `{prompt, metadata: {prompt, answer}}`. The math agent's `rl_app.py` already reads `payload["prompt"]` / `payload["answer"]` and works without changes once the JSONL is regenerated.

End-to-end smoke test

Ran `bash train.sh` with `NUM_ROLLOUT=10` on Qwen2.5-3B-Instruct + 64-row GSM8K (`slimerl/slime:latest` container, 8 × H100):

Rollout raw_reward
0 0.271
1 0.308
2 0.346
3 0.540
4 0.441
5 0.633
6 0.552
7 0.597
8 0.458
9 0.556

Reward climbs from 0.27 → 0.63 over 10 rollouts — GRPO is learning under the new contract. All 10 train steps logged `train/loss`, `train/grad_norm`, and progressed `train/step` monotonically. No Tracebacks or FAILED session statuses (one transient ACR ThrottlingException was caught and retried; two atexit tracebacks from Ray teardown at job exit are cosmetic).

Test plan

  • Unit tests (`tests/test_slime_rollout_payload.py`) pin the contract: metadata-in-is-payload-out, empty/missing/None/non-dict metadata → `{}`, returned dict is a shallow copy, `sample.prompt` / `sample.label` are ignored.
  • Existing test suite passes (79 tests).
  • End-to-end smoke test (10-rollout training on Qwen2.5-3B, rewards climb, loss moves).

🤖 Generated with Claude Code

Collapses _sample_to_payload to return a shallow copy of
Sample.metadata. Previously it synthesized a hybrid payload shape
(sample.prompt -> payload["prompt"], sample.label -> payload["answer"],
sample.metadata nested under payload["metadata"], plus a fall-through
copy of Sample fields), which locked the slime backend into the math
agent's shape and forced other agents (appworld, migration,
officebench) into workarounds.

After this change, the JSONL row's metadata dict is the agent payload
exactly, so each agent declares whatever payload shape it wants by
choosing what keys to put in metadata. The JSONL top-level prompt
field still drives slime's tokenizer and length filter.

Breaking change for existing math JSONLs: rows using {prompt, label}
now produce an empty payload. Regenerate with the updated SETUP.md
data-prep snippet which emits {prompt, metadata: {prompt, answer}}.

Also drops --label-key from train.sh (nothing reads sample.label
under the new rule).

Verified end-to-end on Qwen2.5-3B-Instruct + GSM8K with NUM_ROLLOUT=10:
raw_reward climbed 0.27 -> 0.63, train/loss and grad_norm move as
expected, no rollout failures.

Plan: docs/roadmap/committed/slime-data-contract.md (committed on
docs/core-api-rename-roadmap in PR #59).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lliquid lliquid changed the title feat(slime): make Sample.metadata the agent payload verbatim feat(slime): support arbitrary agent payload shapes in the training backend May 5, 2026
Seven tests collapse to three, one per distinct invariant:
- metadata-is-payload-verbatim (also covers that prompt/label are ignored)
- returned dict is a shallow copy (guards against _process_one_episode
  mutation leaking back)
- missing/None/non-dict metadata defensively returns {}

The four former "empty / missing / None / non-dict" tests all exercised
the same fallback branch; merged into one parameterized assertion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
lliquid pushed a commit that referenced this pull request May 5, 2026
Implemented in #61. Moves the plan out of committed/ (in-flight) into
done/ (shipped), alongside the other completed plans.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lliquid lliquid merged commit 051d7df into main May 5, 2026
5 checks passed
@lliquid lliquid deleted the feat/slime-data-contract branch May 5, 2026 21:51
lliquid added a commit that referenced this pull request May 6, 2026
Implemented in #61. Moves the plan out of committed/ (in-flight) into
done/ (shipped), alongside the other completed plans.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants