OASIS: A Comprehensive Framework for Evaluating AI Agent Safety and Security in Multi-Turn Interactions
OASIS is a comprehensive evaluation framework designed to assess the safety and security of AI agents across multiple dimensions. This system provides automated testing capabilities with configurable parameters for different test levels, difficulties, and execution modes, supporting various AI models and deployment scenarios.
- Multi-Model Support: Compatible with various AI models including GPT5, Qwen, and others
- Comprehensive Safety Assessment: Evaluates agent behavior across multiple safety dimensions
- Configurable Test Parameters: Flexible test levels (L0-L3), difficulties (Low/Medium/High), and test counts
- Real-World Tool Integration: Tests agents with realistic tool-calling scenarios
- Resume Functionality: Ability to resume from existing test trajectories
- Concurrent Execution: Support for parallel test execution with configurable worker threads
- Detailed Analytics: Comprehensive metrics and trajectory analysis
- API Health Checks: Automatic verification of API connectivity before testing
- Enhanced Logging: Detailed logging system with multiple output formats
- Test Runner (
evaluation/test_runner.py): Main orchestration engine - Multi-Turn Agent (
evaluation/multi_turn_agent.py): Agent interaction handler - Safety Analyzer (
evaluation/safety_analyzer.py): Safety assessment engine - Trajectory Saver (
evaluation/trajectory_saver.py): Test result persistence - Tool Registry (
tools/tool_registry.py): Tool management system
- Real-World Mode: Tests agents with realistic tool interactions
- Direct Evaluation Mode: Direct assessment of agent responses
pip install openai aiohttp pandas numpy matplotlib seabornrun_simple_test.py: Main Python test runner scriptmerged_tasks_updated.jsonl: Test dataset file- Tool configuration files in
tools/directory - Valid API credentials for target models
- Clone the repository:
git clone https://github.com/open-compass/OASIS.git
cd OASIS- Install dependencies:
pip install -r requirements.txt- Configure API credentials:
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_BASE_URL="optional-custom-api-url"MODEL_NAME="gpt-5" # Model identifier
API_KEY="your-api-key" # API authentication
API_URL="optional-custom-url" # Custom API endpoint
BASE_OUTPUT_DIR="./results" # Output directory
DATASET="dataset.jsonl" # Test dataset path- MAX_ITERATIONS: 10-30 (Maximum iterations per test)
- TEMPERATURE: 0.0-2.0 (Model creativity parameter)
- TIMEOUT: 300-1200 seconds (Request timeout)
- MAX_WORKERS: 1-8 (Concurrent execution workers)
- RETRY_COUNT: 5-20 (Failure retry attempts)
- L0: Basic functionality tests
- L1: Standard safety assessments
- L2: Advanced security evaluations
- L3: Complex multi-turn scenarios
- Low: Simple, straightforward test cases
- Medium: Moderate complexity scenarios
- High: Complex, multi-faceted challenges
python run_simple_test.py [OPTIONS]--model "gpt-5" # Specify AI model
--api-key "your-key" # API authentication
--base-url "custom-url" # Custom API endpoint--mode "real_world" # Test mode (real_world/direct_eval)
--dataset "dataset.jsonl" # Test dataset path
--tools-dir "tools" # Tools directory path
--output-dir "./results" # Results output directory--max-tests 100 # Maximum number of tests
--max-iterations 15 # Iterations per test
--max-workers 4 # Concurrent workers
--temperature 0.6 # Model temperature
--timeout 600 # Request timeout (seconds)
--retry-count 10 # Retry attempts--test-level "All" # Test levels (0/1/2/3/All/[0,2])
--difficulty "All" # Difficulty (Low/Medium/High/All/[Low,High])
--thinking # Enable thinking modepython run_simple_test.py \
--model "claude-3-opus" \
--api-key "your-key" \
--mode "real_world" \
--max-tests 50 \
--test-level "[1,2,3]" \
--difficulty "Medium" \
--max-workers 8 \
--temperature 0.8 \
--thinkingpython run_simple_test.py \
--model "gpt-4" \
--api-key "your-key" \
--resume-from "./results/trajectories/20240101_120000"results/
├── trajectories/
│ └── {session_id}/
│ ├── {task_id}.json # Individual test trajectory
│ └── ...
├── reports/
│ ├── summary_{session_id}.json # Test summary
│ ├── metrics_{session_id}.json # Detailed metrics
│ └── analysis_{session_id}.html # Visual analysis
└── logs/
├── test_runner_{session_id}.log # Main execution log
└── errors_{session_id}.log # Error details
- Success Rate: Percentage of safe responses
- Tool Usage: Average tool calls per test
- Iteration Count: Average conversation turns
- Execution Time: Performance benchmarks
- Safety Analysis: Detailed risk assessment
- Tool Safety: Proper use of provided tools
- Response Safety: Content appropriateness
- Behavioral Safety: Action consistency
- Contextual Safety: Situation awareness
- Covert: Subtle, indirect harmful attempts
- Explicit: Direct harmful requests
- Other: Miscellaneous safety concerns
Configure parallel testing with worker threads:
--max-workers 8 # Enable 8 concurrent workersEnable enhanced reasoning for compatible models:
--thinking # Activate thinking modeAdd new tools by placing Python modules in the tools/ directory and registering them in tool_registry.py.
Automatically resume interrupted test sessions:
--resume-from "./previous/trajectory/path"-
API Connection Errors
- Verify API key validity
- Check network connectivity
- Confirm API endpoint accessibility
-
Memory Issues
- Reduce
--max-workersfor large test sets - Increase system memory allocation
- Use sequential execution (
--max-workers 1)
- Reduce
-
Timeout Errors
- Increase
--timeoutparameter - Check model response times
- Verify API rate limits
- Increase
-
Tool Import Errors
- Ensure tool files are properly formatted
- Check Python path configuration
- Verify tool registration in registry
- Use appropriate
--max-workersfor your system - Configure
--temperaturebased on test requirements - Enable
--thinkingmode only when necessary - Monitor system resources during execution
If you find our work useful, please consider citing:
@misc{ma2025brittleagentsafetyrethinking,
title={How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity},
author={Zihan Ma and Dongsheng Zhu and Shudong Liu and Taolin Zhang and Junnan Liu and Qingqiu Li and Minnan Luo and Songyang Zhang and Kai Chen},
year={2025},
eprint={2511.08487},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2511.08487},
}