Add Toucan-1.5M dataset implementation #154

neubig · 2025-10-29T18:45:21Z

Summary

This PR adds support for the Toucan-1.5M dataset to the agent-data-protocol repository, following the existing patterns from other tool-use datasets like orca_agentinstruct.

Fixes #153

Changes Made

extract_raw.py: Loads data from the HuggingFace dataset "Agent-Ark/Toucan-1.5M" with config "Kimi-K2"
schema_raw.py: Defines Pydantic models for the raw data structure including Message and SchemaRaw classes
raw_to_standardized.py: Converts raw messages to standardized format with proper handling of:
- System messages as text observations
- User messages as text observations
- Assistant messages with function calls as API actions
- Function responses as text observations
- Function name conversion from hyphenated to underscore format for Python compatibility
api.py: Defines the exa_search_web_search_exa function used in the dataset
Sample files: Created sample_raw.json, sample_std.json, and sample_sft.json following the established format
requirements.txt: Added necessary dependencies (datasets, pydantic)

Dataset Details

The Toucan-1.5M dataset contains over 1.5 million tool-agent conversations with:

Messages in conversational format (system, user, assistant, function)
Tool declarations and function calls
Quality assessments and metadata
Focus on web search and information retrieval tasks

Testing

All tests pass successfully:

✅ Dataset structure validation
✅ Raw schema validation
✅ Standardized schema validation
✅ SFT conversion validation

Implementation Notes

Function names with hyphens (e.g., "exa-search-web_search_exa") are converted to valid Python identifiers (e.g., "exa_search_web_search_exa") in the standardized format
Follows the same patterns as existing tool-use datasets in the repository
Maintains compatibility with the existing schema and conversion pipeline

@neubig can click here to continue refining the PR

openhands-ai · 2025-10-29T18:46:37Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Check Docstrings

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #154 at branch `openhands/add-toucan-1-5m-dataset`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

…SFT files - Updated browser.py fill() signature to match browsergym-core 0.14.2 API - Generated sample_sft_openhands.json with --is_web=no flag - Generated sample_sft_sweagent.json for sweagent agent - All tests passing for toucan_1_5m dataset

Fix formatting issues from pre-commit hooks

ae3df21

openhands-ai bot mentioned this pull request Oct 29, 2025

Add Toucan-1.5M dataset #153

Open

openhands-agent added 2 commits October 29, 2025 19:29

Fix docstring formatting and add sample_sft agent-specific files

494eaf2

neubig requested a review from yueqis October 29, 2025 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Toucan-1.5M dataset implementation #154

Add Toucan-1.5M dataset implementation #154

Uh oh!

neubig commented Oct 29, 2025

Uh oh!

openhands-ai bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Toucan-1.5M dataset implementation #154

Are you sure you want to change the base?

Add Toucan-1.5M dataset implementation #154

Uh oh!

Conversation

neubig commented Oct 29, 2025

Summary

Changes Made

Dataset Details

Testing

Implementation Notes

Uh oh!

openhands-ai bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants