Add trajectory condensation utility for OpenHands SFT data #157

neubig · 2025-10-30T02:57:14Z

Summary

This PR adds a utility script that applies OpenHands SDK context condensation to SFT trajectories, splitting them when condensation occurs.

Motivation

Long agent trajectories can exceed context limits during training. The OpenHands SDK includes a context condenser that summarizes and removes older conversation turns when a threshold is exceeded. This utility applies that condensation to stored trajectories, creating training data that reflects the condensed context that would be seen during inference.

Implementation

Main Components

condense_trajectories.py: Main utility script that:
- Loads SFT trajectories from JSON
- Converts conversations to MessageEvent objects
- Applies context condensation using OpenHands SDK
- Splits trajectories when condensation occurs (since prefix changes)
- Outputs condensed trajectories in the same SFT format
mock_condenser.py: A mock condenser for testing that:
- Implements the same interface as LLMSummarizingCondenser
- Triggers condensation based on simple threshold
- Creates mock summaries without requiring LLM calls
README_condense_trajectories.md: Comprehensive documentation

Key Features

Flexible condenser support: Can use either LLM-based or mock condenser
Configurable parameters: max_size and keep_first control condensation behavior
Trajectory splitting: Creates new trajectory segments when condensation occurs
Comprehensive logging: Tracks condensation events and segment creation
Format preservation: Output maintains the same SFT format as input

Input/Output

Input: sample_sft_openhands.json with N trajectories
Output: Same format with N*M trajectories, where M = average condensations + 1
Each segment gets a unique ID: {original_id}_seg{index}

Testing

The script has been tested on multiple datasets:

agenttuning_alfworld: 5 trajectories → 23 segments (avg 4.6x split)
swe-smith: 5 trajectories → 29 segments (avg 5.8x split)
swe-gym: 5 trajectories → 21 segments (avg 4.2x split)

Example verification with jq:

jq '[.[] | {id, conversations: (.conversations | length)}]' output.json

Usage

With mock condenser (for testing):

python scripts/condense_trajectories.py \
  datasets/swe-smith/sample_sft/sample_sft_openhands.json \
  output_condensed.json \
  --max-size 12 \
  --keep-first 2 \
  --use-mock-condenser

With LLM condenser:

export LLM_API_KEY="your-api-key"
python scripts/condense_trajectories.py \
  datasets/swe-smith/sample_sft/sample_sft_openhands.json \
  output_condensed.json \
  --max-size 120 \
  --keep-first 4

Dependencies

Requires openhands-sdk to be installed
No changes to existing code or dependencies

Future Work

Potential enhancements:

Batch processing of entire datasets
Configuration files for different condensation strategies
Integration with training pipeline
Analysis tools for condensation patterns

@neubig can click here to continue refining the PR

This utility applies OpenHands SDK context condensation to SFT trajectories. Features: - Imports openhands-sdk as a library - Initializes context condenser (LLMSummarizingCondenser or MockCondenser) - Applies condensation at appropriate timing based on max_size threshold - Splits trajectories when condensation occurs (prefix changes) - Outputs condensed trajectories in same SFT format Input: N trajectories Output: N*M trajectories (M = average condensations + 1) Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-10-30T02:58:30Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #157 at branch `add-trajectory-condenser-utility`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add trajectory condensation utility for OpenHands SFT data #157

Add trajectory condensation utility for OpenHands SFT data #157

Uh oh!

neubig commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add trajectory condensation utility for OpenHands SFT data #157

Are you sure you want to change the base?

Add trajectory condensation utility for OpenHands SFT data #157

Uh oh!

Conversation

neubig commented Oct 30, 2025

Summary

Motivation

Implementation

Main Components

Key Features

Input/Output

Testing

Usage

With mock condenser (for testing):

With LLM condenser:

Dependencies

Future Work

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants