feat(slime): support arbitrary agent payload shapes in the training backend by lliquid · Pull Request #61 · awslabs/agentcore-rl-toolkit

lliquid · 2026-05-05T20:08:29Z

Goal

Let the slime training backend carry any agent payload shape from the JSONL dataset to the agent, so agents with different input schemas (GSM8K, AppWorld, migration, OfficeBench, future ones) can train through slime without bespoke integration changes.

Before this PR, `_sample_to_payload` hardcoded the GSM8K shape (`prompt` + `answer` top-level plus a nested `metadata` dict), which forced every other agent into workarounds. After this PR, the JSONL row's `metadata` dict is the agent payload verbatim — the data author owns the schema, the framework holds no opinion.

User-facing effect

Each agent declares its payload shape by choosing what keys go in `metadata`:

Agent	JSONL row shape	Agent reads
Math (GSM8K)	`{prompt, metadata: {prompt, answer}}`	`payload["prompt"]`, `payload["answer"]`
AppWorld	`{prompt, metadata: {task_id}}`	`payload["task_id"]`
Migration	`{prompt, metadata: {repo_uri, metadata_uri, ...}}`	`InvocationRequest(**payload)`
OfficeBench	`{prompt, metadata: {task_uri, testbed_uri}}`	`InvocationRequest(**payload)`

The top-level `prompt` field stays — it drives slime's tokenizer / length filter. The agent-visible payload comes entirely from `metadata`.

How

One-line replacement in `backends/slime/integration/rollout.py::_sample_to_payload`: return a shallow copy of `Sample.metadata`. All the old hardcoded-shape logic is deleted.

Plan: `docs/roadmap/committed/slime-data-contract.md` (on the PR #59 branch).

Breaking change

JSONLs in the old `{prompt, label}` shape produce empty agent payloads under the new rule. Regenerate with the updated SETUP.md data-prep snippet, which emits `{prompt, metadata: {prompt, answer}}`. The math agent's `rl_app.py` already reads `payload["prompt"]` / `payload["answer"]` and works without changes once the JSONL is regenerated.

End-to-end smoke test

Ran `bash train.sh` with `NUM_ROLLOUT=10` on Qwen2.5-3B-Instruct + 64-row GSM8K (`slimerl/slime:latest` container, 8 × H100):

Rollout	raw_reward
0	0.271
1	0.308
2	0.346
3	0.540
4	0.441
5	0.633
6	0.552
7	0.597
8	0.458
9	0.556

Reward climbs from 0.27 → 0.63 over 10 rollouts — GRPO is learning under the new contract. All 10 train steps logged `train/loss`, `train/grad_norm`, and progressed `train/step` monotonically. No Tracebacks or FAILED session statuses (one transient ACR ThrottlingException was caught and retried; two atexit tracebacks from Ray teardown at job exit are cosmetic).

Test plan

Unit tests (`tests/test_slime_rollout_payload.py`) pin the contract: metadata-in-is-payload-out, empty/missing/None/non-dict metadata → `{}`, returned dict is a shallow copy, `sample.prompt` / `sample.label` are ignored.
Existing test suite passes (79 tests).
End-to-end smoke test (10-rollout training on Qwen2.5-3B, rewards climb, loss moves).

🤖 Generated with Claude Code

Collapses _sample_to_payload to return a shallow copy of Sample.metadata. Previously it synthesized a hybrid payload shape (sample.prompt -> payload["prompt"], sample.label -> payload["answer"], sample.metadata nested under payload["metadata"], plus a fall-through copy of Sample fields), which locked the slime backend into the math agent's shape and forced other agents (appworld, migration, officebench) into workarounds. After this change, the JSONL row's metadata dict is the agent payload exactly, so each agent declares whatever payload shape it wants by choosing what keys to put in metadata. The JSONL top-level prompt field still drives slime's tokenizer and length filter. Breaking change for existing math JSONLs: rows using {prompt, label} now produce an empty payload. Regenerate with the updated SETUP.md data-prep snippet which emits {prompt, metadata: {prompt, answer}}. Also drops --label-key from train.sh (nothing reads sample.label under the new rule). Verified end-to-end on Qwen2.5-3B-Instruct + GSM8K with NUM_ROLLOUT=10: raw_reward climbed 0.27 -> 0.63, train/loss and grad_norm move as expected, no rollout failures. Plan: docs/roadmap/committed/slime-data-contract.md (committed on docs/core-api-rename-roadmap in PR #59). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Seven tests collapse to three, one per distinct invariant: - metadata-is-payload-verbatim (also covers that prompt/label are ignored) - returned dict is a shallow copy (guards against _process_one_episode mutation leaking back) - missing/None/non-dict metadata defensively returns {} The four former "empty / missing / None / non-dict" tests all exercised the same fallback branch; merged into one parameterized assertion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implemented in #61. Moves the plan out of committed/ (in-flight) into done/ (shipped), alongside the other completed plans. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

lliquid changed the title ~~feat(slime): make Sample.metadata the agent payload verbatim~~ feat(slime): support arbitrary agent payload shapes in the training backend May 5, 2026

lliquid mentioned this pull request May 5, 2026

docs(roadmap): core-api rename, SlimeRunner plans committed; slime data-contract done #59

Open

2 tasks

lyzustc approved these changes May 5, 2026

View reviewed changes

lliquid merged commit 051d7df into main May 5, 2026
5 checks passed

lliquid deleted the feat/slime-data-contract branch May 5, 2026 21:51

lliquid mentioned this pull request May 5, 2026

feat(slime): add SlimeRunner as the primary Python entry point #62

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(slime): support arbitrary agent payload shapes in the training backend#61

feat(slime): support arbitrary agent payload shapes in the training backend#61
lliquid merged 2 commits into
mainfrom
feat/slime-data-contract

lliquid commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lliquid commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

User-facing effect

How

Breaking change

End-to-end smoke test

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lliquid commented May 5, 2026 •

edited

Loading