Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Oct 30, 2025

Summary

This PR adds a utility script that applies OpenHands SDK context condensation to SFT trajectories, splitting them when condensation occurs.

Motivation

Long agent trajectories can exceed context limits during training. The OpenHands SDK includes a context condenser that summarizes and removes older conversation turns when a threshold is exceeded. This utility applies that condensation to stored trajectories, creating training data that reflects the condensed context that would be seen during inference.

Implementation

Main Components

  1. condense_trajectories.py: Main utility script that:

    • Loads SFT trajectories from JSON
    • Converts conversations to MessageEvent objects
    • Applies context condensation using OpenHands SDK
    • Splits trajectories when condensation occurs (since prefix changes)
    • Outputs condensed trajectories in the same SFT format
  2. mock_condenser.py: A mock condenser for testing that:

    • Implements the same interface as LLMSummarizingCondenser
    • Triggers condensation based on simple threshold
    • Creates mock summaries without requiring LLM calls
  3. README_condense_trajectories.md: Comprehensive documentation

Key Features

  • Flexible condenser support: Can use either LLM-based or mock condenser
  • Configurable parameters: max_size and keep_first control condensation behavior
  • Trajectory splitting: Creates new trajectory segments when condensation occurs
  • Comprehensive logging: Tracks condensation events and segment creation
  • Format preservation: Output maintains the same SFT format as input

Input/Output

  • Input: sample_sft_openhands.json with N trajectories
  • Output: Same format with N*M trajectories, where M = average condensations + 1
  • Each segment gets a unique ID: {original_id}_seg{index}

Testing

The script has been tested on multiple datasets:

  • agenttuning_alfworld: 5 trajectories → 23 segments (avg 4.6x split)
  • swe-smith: 5 trajectories → 29 segments (avg 5.8x split)
  • swe-gym: 5 trajectories → 21 segments (avg 4.2x split)

Example verification with jq:

jq '[.[] | {id, conversations: (.conversations | length)}]' output.json

Usage

With mock condenser (for testing):

python scripts/condense_trajectories.py \
  datasets/swe-smith/sample_sft/sample_sft_openhands.json \
  output_condensed.json \
  --max-size 12 \
  --keep-first 2 \
  --use-mock-condenser

With LLM condenser:

export LLM_API_KEY="your-api-key"
python scripts/condense_trajectories.py \
  datasets/swe-smith/sample_sft/sample_sft_openhands.json \
  output_condensed.json \
  --max-size 120 \
  --keep-first 4

Dependencies

  • Requires openhands-sdk to be installed
  • No changes to existing code or dependencies

Future Work

Potential enhancements:

  • Batch processing of entire datasets
  • Configuration files for different condensation strategies
  • Integration with training pipeline
  • Analysis tools for condensation patterns

@neubig can click here to continue refining the PR

This utility applies OpenHands SDK context condensation to SFT trajectories.

Features:
- Imports openhands-sdk as a library
- Initializes context condenser (LLMSummarizingCondenser or MockCondenser)
- Applies condensation at appropriate timing based on max_size threshold
- Splits trajectories when condensation occurs (prefix changes)
- Outputs condensed trajectories in same SFT format

Input: N trajectories
Output: N*M trajectories (M = average condensations + 1)

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Oct 30, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #157 at branch `add-trajectory-condenser-utility`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants