Finetuning SmolAgents with Reinforcement Learning #1798

lamthuy · 2025-10-08T04:59:29Z

lamthuy
Oct 8, 2025

When running SmolAgents CodeAct for tool calling, we often observe that smaller open-source models struggle with complex tool-use tasks — and sometimes even fail at simple ones. While careful prompt engineering can mitigate this problem, it’s not a sustainable solution, especially in dynamic agentic systems where any workflow change can disrupt tool-calling accuracy.

To address this issue at its core, the ideal approach is to train/fine-tune models to use tools effectively. However, this is a non-trivial task that requires setting up complex machine learning pipelines tightly integrated with the agentic system — something that can be challenging for most developers.

To make this process easier, we’ve developed a lightweight open-source library that removes the need to build these pipelines from scratch with MIT license for more information https://github.com/ToolBrain/ToolBrain

✨ Key Features

🤖 Learning algorithms: Supports GRPO, DPO, and supervised learning.
🎯 Flexible rewards: Define your own reward functions or use LLM-as-judge.
🔧 Tool management: Scalable retrieval for managing large tool collections.
📊 Knowledge distillation: Distill large teacher models into smaller student models for efficiency.
🚀 Zero-learn: Automatically generate training tasks.
⚡ Efficient training: Supports FP16 finetuning, LoRA, Unsloth, and BitsAndBytes for resource-efficient training.
🧠 Multiple agent frameworks: Supports SmolAgent and LangChain, with more coming soon.

from smolagents import tool, TransformersModel, CodeAgent
from toolbrain import Brain
from toolbrain.rewards import reward_exact_match

# --- 1. Define Tools and Reward Function (User-defined) ---
@tool
def add(a: int, b: int) -> int:
    """
    Add two integers.

    Args:
        a (int): First addend.
        b (int): Second addend.

    Returns:
        int: Sum of a and b.
    """
    return a + b


# --- 2. Prepare Training Data ---
training_dataset = [
    {
        "query": "Use the add tool to calculate 5 + 7",
        "gold_answer": "12"
    }
]


# 3. Create agent
model = TransformersModel(
    model_id="Qwen/Qwen2.5-0.5B-Instruct",  # use a bigger model for better results
    max_new_tokens=128
)

agent = CodeAgent(
    model=model,
    tools=[add],
    max_steps=1
)

# 4. Create Brain

brain = Brain(
    agent,                          # Agent instance
    algorithm="GRPO",                # Algorithm choice
    reward_func=reward_exact_match  # A reward function, you can customise any python function as reward
)

# 5. Train the agent with GRPO steps
brain.train(training_dataset, num_iterations=10)

KeepALifeUS · 2026-02-12T20:33:07Z

KeepALifeUS
Feb 12, 2026

Great work on ToolBrain! This addresses a real gap in the ecosystem.

Some thoughts from my experience with RL for agents:

GRPO is a solid choice — I've worked extensively with PPO and DQN for agent systems (ml-ppo, ml-dqn), and GRPO's group-relative approach handles the reward scaling issues that plague vanilla policy gradient methods.
Reward function design is the hard part
- Sparse rewards (task success/failure) → slow learning
- Dense rewards risk reward hacking
- LLM-as-judge adds latency but often necessary for nuanced evaluation

Consider state-based rewards
Instead of only rewarding final outcomes:

def intermediate_reward(state):
    # Reward progress through shared state
    return len(state["completed_steps"]) / state["total_steps"]

This gives denser signal without manual reward shaping.

Token efficiency as implicit reward
In my multi-agent work, I found that penalizing token usage naturally pushes agents toward more efficient tool use:
```
reward = task_success - 0.001 * tokens_used
```

Question: Does ToolBrain support custom state representations for the RL loop? The standard observation space (conversation history) can be noisy — I've had better results with structured state vectors.

Happy to share more patterns from my implementations!

1 reply

quyk67uet Feb 13, 2026

Hi @KeepALifeUS, thanks for the thoughtful feedback, great to connect with someone working deeply in RL for agents.

On GRPO and Reward Design

You’re right, GRPO was a deliberate choice to address reward scaling issues we ran into with vanilla policy gradients.

For the harder side of reward design, ToolBrain uses a hybrid setup. The API makes it easy to plug in patterns like:

Dense progress rewards
Token-efficiency penalties

We also support ranking-based LLM-as-a-judge signals for more nuanced evaluation when heuristic rewards aren’t sufficient.

On Custom State Representations

Great question on structured state vectors, this is exactly where the Adapter pattern comes in.

By default, the SmolAgentAdapter uses conversation history as the observation space. But the framework is modular by design. If you want structured vectors instead of text, you can:

Subclass BaseAgentAdapter or SmolAgentAdapter
Override _build_input_for_rl_from_memory
Extract structured signals directly from memory, such as:
- Step counts
- Tool states
- Success flags
Format them as state vectors for the RL loop

This keeps the separation clean, the training logic and agent runtime remain unchanged. Only the representation layer is modified.

Curious to hear what state/reward patterns you’ve been experimenting with. Appreciate the discussion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning SmolAgents with Reinforcement Learning #1798

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Finetuning SmolAgents with Reinforcement Learning #1798

Uh oh!

Uh oh!

lamthuy Oct 8, 2025

Replies: 1 comment · 1 reply

Uh oh!

KeepALifeUS Feb 12, 2026

Uh oh!

quyk67uet Feb 13, 2026

On GRPO and Reward Design

On Custom State Representations

lamthuy
Oct 8, 2025

Replies: 1 comment 1 reply

KeepALifeUS
Feb 12, 2026