Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Oct 29, 2025

Summary

This PR adds support for the Toucan-1.5M dataset to the agent-data-protocol repository, following the existing patterns from other tool-use datasets like orca_agentinstruct.

Fixes #153

Changes Made

  • extract_raw.py: Loads data from the HuggingFace dataset "Agent-Ark/Toucan-1.5M" with config "Kimi-K2"
  • schema_raw.py: Defines Pydantic models for the raw data structure including Message and SchemaRaw classes
  • raw_to_standardized.py: Converts raw messages to standardized format with proper handling of:
    • System messages as text observations
    • User messages as text observations
    • Assistant messages with function calls as API actions
    • Function responses as text observations
    • Function name conversion from hyphenated to underscore format for Python compatibility
  • api.py: Defines the exa_search_web_search_exa function used in the dataset
  • Sample files: Created sample_raw.json, sample_std.json, and sample_sft.json following the established format
  • requirements.txt: Added necessary dependencies (datasets, pydantic)

Dataset Details

The Toucan-1.5M dataset contains over 1.5 million tool-agent conversations with:

  • Messages in conversational format (system, user, assistant, function)
  • Tool declarations and function calls
  • Quality assessments and metadata
  • Focus on web search and information retrieval tasks

Testing

All tests pass successfully:

  • ✅ Dataset structure validation
  • ✅ Raw schema validation
  • ✅ Standardized schema validation
  • ✅ SFT conversion validation

Implementation Notes

  • Function names with hyphens (e.g., "exa-search-web_search_exa") are converted to valid Python identifiers (e.g., "exa_search_web_search_exa") in the standardized format
  • Follows the same patterns as existing tool-use datasets in the repository
  • Maintains compatibility with the existing schema and conversion pipeline

@neubig can click here to continue refining the PR

@openhands-ai
Copy link

openhands-ai bot commented Oct 29, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Check Docstrings

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #154 at branch `openhands/add-toucan-1-5m-dataset`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

…SFT files

- Updated browser.py fill() signature to match browsergym-core 0.14.2 API
- Generated sample_sft_openhands.json with --is_web=no flag
- Generated sample_sft_sweagent.json for sweagent agent
- All tests passing for toucan_1_5m dataset
@neubig neubig requested a review from yueqis October 29, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Toucan-1.5M dataset

3 participants