OASIS: A Comprehensive Framework for Evaluating AI Agent Safety and Security in Multi-Turn Interactions

Overview

OASIS is a comprehensive evaluation framework designed to assess the safety and security of AI agents across multiple dimensions. This system provides automated testing capabilities with configurable parameters for different test levels, difficulties, and execution modes, supporting various AI models and deployment scenarios.

Key Features

Multi-Model Support: Compatible with various AI models including GPT5, Qwen, and others
Comprehensive Safety Assessment: Evaluates agent behavior across multiple safety dimensions
Configurable Test Parameters: Flexible test levels (L0-L3), difficulties (Low/Medium/High), and test counts
Real-World Tool Integration: Tests agents with realistic tool-calling scenarios
Resume Functionality: Ability to resume from existing test trajectories
Concurrent Execution: Support for parallel test execution with configurable worker threads
Detailed Analytics: Comprehensive metrics and trajectory analysis
API Health Checks: Automatic verification of API connectivity before testing
Enhanced Logging: Detailed logging system with multiple output formats

System Architecture

Core Components

Test Runner (evaluation/test_runner.py): Main orchestration engine
Multi-Turn Agent (evaluation/multi_turn_agent.py): Agent interaction handler
Safety Analyzer (evaluation/safety_analyzer.py): Safety assessment engine
Trajectory Saver (evaluation/trajectory_saver.py): Test result persistence
Tool Registry (tools/tool_registry.py): Tool management system

Test Modes

Real-World Mode: Tests agents with realistic tool interactions
Direct Evaluation Mode: Direct assessment of agent responses

Prerequisites

Required Dependencies

pip install openai aiohttp pandas numpy matplotlib seaborn

Required Files

run_simple_test.py: Main Python test runner script
merged_tasks_updated.jsonl: Test dataset file
Tool configuration files in tools/ directory
Valid API credentials for target models

Installation

Clone the repository:

git clone https://github.com/open-compass/OASIS.git
cd OASIS

Install dependencies:

pip install -r requirements.txt

Configure API credentials:

export OPENAI_API_KEY="your-api-key-here"
export OPENAI_BASE_URL="optional-custom-api-url"

Configuration

Model Configuration

MODEL_NAME="gpt-5"              # Model identifier
API_KEY="your-api-key"          # API authentication
API_URL="optional-custom-url"   # Custom API endpoint
BASE_OUTPUT_DIR="./results"     # Output directory
DATASET="dataset.jsonl"       # Test dataset path

Test Parameters

MAX_ITERATIONS: 10-30 (Maximum iterations per test)
TEMPERATURE: 0.0-2.0 (Model creativity parameter)
TIMEOUT: 300-1200 seconds (Request timeout)
MAX_WORKERS: 1-8 (Concurrent execution workers)
RETRY_COUNT: 5-20 (Failure retry attempts)

Test Levels

L0: Basic functionality tests
L1: Standard safety assessments
L2: Advanced security evaluations
L3: Complex multi-turn scenarios

Difficulty Levels

Low: Simple, straightforward test cases
Medium: Moderate complexity scenarios
High: Complex, multi-faceted challenges

Usage

Basic Command Syntax

python run_simple_test.py [OPTIONS]

Essential Parameters

Model Selection

--model "gpt-5"                    # Specify AI model
--api-key "your-key"              # API authentication
--base-url "custom-url"           # Custom API endpoint

Test Configuration

--mode "real_world"               # Test mode (real_world/direct_eval)
--dataset "dataset.jsonl"         # Test dataset path
--tools-dir "tools"               # Tools directory path
--output-dir "./results"          # Results output directory

Execution Control

--max-tests 100                   # Maximum number of tests
--max-iterations 15               # Iterations per test
--max-workers 4                   # Concurrent workers
--temperature 0.6                   # Model temperature
--timeout 600                     # Request timeout (seconds)
--retry-count 10                  # Retry attempts

Filtering Options

--test-level "All"                # Test levels (0/1/2/3/All/[0,2])
--difficulty "All"                # Difficulty (Low/Medium/High/All/[Low,High])
--thinking                        # Enable thinking mode

Usage Examples

Configuration

python run_simple_test.py \
  --model "claude-3-opus" \
  --api-key "your-key" \
  --mode "real_world" \
  --max-tests 50 \
  --test-level "[1,2,3]" \
  --difficulty "Medium" \
  --max-workers 8 \
  --temperature 0.8 \
  --thinking

Resume Previous Test

python run_simple_test.py \
  --model "gpt-4" \
  --api-key "your-key" \
  --resume-from "./results/trajectories/20240101_120000"

Output Structure

Results Directory Layout

results/
├── trajectories/
│   └── {session_id}/
│       ├── {task_id}.json      # Individual test trajectory
│       └── ...
├── reports/
│   ├── summary_{session_id}.json     # Test summary
│   ├── metrics_{session_id}.json     # Detailed metrics
│   └── analysis_{session_id}.html    # Visual analysis
└── logs/
    ├── test_runner_{session_id}.log  # Main execution log
    └── errors_{session_id}.log       # Error details

Key Metrics

Success Rate: Percentage of safe responses
Tool Usage: Average tool calls per test
Iteration Count: Average conversation turns
Execution Time: Performance benchmarks
Safety Analysis: Detailed risk assessment

Safety Assessment Framework

Evaluation Dimensions

Tool Safety: Proper use of provided tools
Response Safety: Content appropriateness
Behavioral Safety: Action consistency
Contextual Safety: Situation awareness

Harm Categories

Covert: Subtle, indirect harmful attempts
Explicit: Direct harmful requests
Other: Miscellaneous safety concerns

Advanced Features

Concurrent Execution

Configure parallel testing with worker threads:

--max-workers 8    # Enable 8 concurrent workers

Thinking Mode

Enable enhanced reasoning for compatible models:

--thinking         # Activate thinking mode

Custom Tool Integration

Add new tools by placing Python modules in the tools/ directory and registering them in tool_registry.py.

Resume Capability

Automatically resume interrupted test sessions:

--resume-from "./previous/trajectory/path"

Troubleshooting

Common Issues

API Connection Errors
- Verify API key validity
- Check network connectivity
- Confirm API endpoint accessibility
Memory Issues
- Reduce --max-workers for large test sets
- Increase system memory allocation
- Use sequential execution (--max-workers 1)
Timeout Errors
- Increase --timeout parameter
- Check model response times
- Verify API rate limits
Tool Import Errors
- Ensure tool files are properly formatted
- Check Python path configuration
- Verify tool registration in registry

Performance Optimization

Use appropriate --max-workers for your system
Configure --temperature based on test requirements
Enable --thinking mode only when necessary
Monitor system resources during execution

📚 Citation

If you find our work useful, please consider citing:

@misc{ma2025brittleagentsafetyrethinking,
      title={How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity}, 
      author={Zihan Ma and Dongsheng Zhu and Shudong Liu and Taolin Zhang and Junnan Liu and Qingqiu Li and Minnan Luo and Songyang Zhang and Kai Chen},
      year={2025},
      eprint={2511.08487},
      archivePrefix={arXiv},
      primaryClass={cs.MA},
      url={https://arxiv.org/abs/2511.08487}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
evaluation		evaluation
tools		tools
README.md		README.md
merged_tasks_updated.jsonl		merged_tasks_updated.jsonl
run_simple_test.py		run_simple_test.py

open-compass/OASIS

Folders and files

Latest commit

History

Repository files navigation

OASIS: A Comprehensive Framework for Evaluating AI Agent Safety and Security in Multi-Turn Interactions

Overview

Key Features

System Architecture

Core Components

Test Modes

Prerequisites

Required Dependencies

Required Files

Installation

Configuration

Model Configuration

Test Parameters

Test Levels

Difficulty Levels

Usage

Basic Command Syntax

Essential Parameters

Model Selection

Test Configuration

Execution Control

Filtering Options

Usage Examples

Configuration

Resume Previous Test

Output Structure

Results Directory Layout

Key Metrics

Safety Assessment Framework

Evaluation Dimensions

Harm Categories

Advanced Features

Concurrent Execution

Thinking Mode

Custom Tool Integration

Resume Capability

Troubleshooting

Common Issues

Performance Optimization

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages