Feature - Add hallucination monitor to gpt-oss #155

Julianblock · 2025-08-26T21:49:25Z

PR: feat(monitoring): Hallucination Monitor (SC+NLI+Numeric, optional RAG) with API, CLI, HTML, tests, CI

🎯 What/Why

This PR implements a comprehensive Hallucination Monitor for GPT-OSS that detects and quantifies hallucination risk in LLM outputs using multiple detection signals. The system provides both programmatic API and CLI interfaces, generates beautiful HTML reports, and includes extensive testing and CI integration.

Key Features

5 Detection Signals: Self-Consistency, NLI Faithfulness, Numeric Sanity, Retrieval Support, and Jailbreak Heuristics
Configurable: Customizable weights, thresholds, and detection parameters
Lightweight: CPU-optional with fallback heuristics for all components
Deterministic: Seeded RNG for reproducible results
Beautiful Reports: Interactive HTML reports with highlighted spans
CLI Interface: Easy command-line usage with file inputs
Production Ready: Comprehensive testing, CI integration, and documentation

Screenshots:

🎯 Generated Report || analytics || insights

Web Interface for testing

🏗️ Architecture

gpt_oss/monitoring/
├── __init__.py                 # Main exports
├── halluci_monitor.py          # Main API and orchestration
├── config.py                   # Configuration dataclasses
├── detectors/                  # Detection modules
│   ├── self_consistency.py     # SC: k-resampling + semantic agreement
│   ├── nli_faithfulness.py     # NLI: sentence-level entailment
│   ├── numeric_sanity.py       # NS: arithmetic + unit consistency
│   ├── retrieval_support.py    # RS: context document verification
│   └── jailbreak_heuristics.py # JB: safety risk patterns
├── highlight/                  # Span highlighting utilities
│   ├── span_align.py          # Character/token span mapping
│   └── html_report.py         # HTML report generation
├── metrics/                    # Scoring and aggregation
│   └── scoring.py             # Risk score computation
└── examples/                   # Usage examples and test data

Signal Flow

Input: (prompt, completion, context_docs)
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Detectors                            │
├─────────────────────────────────────────────────────────┤
│ Self-Consistency  │ NLI Faithfulness │ Numeric Sanity   │
│ (k samples)       │ (entailment)     │ (arithmetic)     │
├─────────────────────────────────────────────────────────┤
│ Retrieval Support │ Jailbreak Heuristics               │
│ (context match)   │ (safety patterns)                  │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Aggregation                          │
├─────────────────────────────────────────────────────────┤
│ Weighted Risk Score = 1 - (w₁×NLI + w₂×SC + w₃×RS +   │
│                         w₄×NS + w₅×(1-JB))            │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Output                               │
├─────────────────────────────────────────────────────────┤
│ • Risk Score [0,1]                                      │
│ • Risk Level (low/medium/high)                          │
│ • Individual Signal Scores                              │
│ • Highlighted Spans                                     │
│ • HTML Report (optional)                                │
└─────────────────────────────────────────────────────────┘

🔧 Detection Signals

1. Self-Consistency (SC) - Weight: 0.25

Purpose: Assess semantic consistency across multiple generations
Method: Generate k samples, compute pairwise cosine similarity using sentence embeddings
Implementation: Uses sentence-transformers/all-MiniLM-L6-v2 with character-based fallback

2. NLI Faithfulness (NLI) - Weight: 0.35

Purpose: Check if completion sentences are entailed by prompt + context
Method: Split completion into sentences, use NLI model for entailment probability
Implementation: Uses lightweight NLI model with word overlap fallback

3. Numeric Sanity (NS) - Weight: 0.15

Purpose: Detect arithmetic and unit conversion errors
Method: Extract numbers/units, check conversions (km↔mi, kg↔lb, °C↔°F), verify arithmetic
Implementation: Comprehensive unit conversion library with tolerance-based checking

4. Retrieval Support (RS) - Weight: 0.20

Purpose: Verify claims against provided context documents
Method: Chunk context documents, compute similarity with completion sentences
Implementation: Sentence transformers for semantic similarity with lexical fallback

5. Jailbreak Heuristics (JB) - Weight: 0.05

Purpose: Detect potential safety risks and jailbreak attempts
Method: Pattern matching, keyword analysis, formatting heuristics
Implementation: Regex patterns and keyword lists for risk assessment

📊 Scoring Algorithm

risk_score = 1 - (w₁×NLI + w₂×SC + w₃×RS + w₄×NS + w₅×(1-JB))

Risk Level Classification:

Low: risk_score < 0.4
Medium: 0.4 ≤ risk_score < 0.7
High: risk_score ≥ 0.7

🚀 Usage Examples

CLI Usage

# Basic usage
gpt-oss-monitor --prompt prompt.txt --completion output.txt

# With context documents and HTML report
gpt-oss-monitor --prompt prompt.txt --completion output.txt --contexts ctx1.txt ctx2.txt --html

# Custom configuration
gpt-oss-monitor --prompt prompt.txt --completion output.txt --k 10 --temperature 0.8 --html

Python API

from gpt_oss.monitoring import HallucinationMonitor, MonitorConfig

# Initialize monitor
monitor = HallucinationMonitor()

# Evaluate a completion
results = monitor.evaluate(
    prompt="What is the capital of France?",
    completion="Paris is the capital of France with 2.2 million people.",
    context_docs=["Paris is the capital and most populous city of France."]
)

print(f"Risk Level: {results['risk_level']}")
print(f"Risk Score: {results['risk_score']:.3f}")

Custom Configuration

from gpt_oss.monitoring import MonitorConfig, MonitorThresholds

config = MonitorConfig(
    k_samples=10,
    temperature=0.8,
    enable_retrieval_support=True,
    enable_jailbreak_heuristics=True,
    thresholds=MonitorThresholds(high=0.8, medium=0.5),
    html_report=True,
    report_dir="my_reports"
)

monitor = HallucinationMonitor(config)

📁 Files Added/Modified

New Files

gpt_oss/monitoring/__init__.py - Main exports
gpt_oss/monitoring/halluci_monitor.py - Main API and orchestration
gpt_oss/monitoring/config.py - Configuration dataclasses
gpt_oss/monitoring/detectors/__init__.py - Detector exports
gpt_oss/monitoring/detectors/self_consistency.py - Self-consistency detector
gpt_oss/monitoring/detectors/nli_faithfulness.py - NLI faithfulness detector
gpt_oss/monitoring/detectors/numeric_sanity.py - Numeric sanity detector
gpt_oss/monitoring/detectors/retrieval_support.py - Retrieval support detector
gpt_oss/monitoring/detectors/jailbreak_heuristics.py - Jailbreak heuristics detector
gpt_oss/monitoring/highlight/__init__.py - Highlight utilities exports
gpt_oss/monitoring/highlight/span_align.py - Span alignment utilities
gpt_oss/monitoring/highlight/html_report.py - HTML report generator
gpt_oss/monitoring/metrics/__init__.py - Metrics exports
gpt_oss/monitoring/metrics/scoring.py - Risk score computation
gpt_oss/monitoring/__main__.py - CLI entry point
gpt_oss/monitoring/requirements-monitor.txt - Monitoring dependencies
gpt_oss/monitoring/examples/README.md - Usage examples
gpt_oss/monitoring/examples/truthfulqa_mini.jsonl - Test data
gpt_oss/monitoring/examples/fever_mini.jsonl - Test data
tests/monitoring/test_monitor_basic.py - Basic tests
tests/monitoring/test_numeric_sanity.py - Numeric sanity tests
tests/monitoring/test_nli_faithfulness.py - NLI faithfulness tests
tests/monitoring/test_self_consistency.py - Self-consistency tests
.github/workflows/ci-monitoring.yml - CI workflow
docs/monitoring_design.md - Design documentation

Modified Files

pyproject.toml - Added monitoring dependencies and CLI entry point
README.md - Added Hallucination Monitor section

🧪 Testing

Test Coverage

Unit Tests: Individual detector functionality
Integration Tests: End-to-end monitoring pipeline
Configuration Tests: Config validation and serialization
CLI Tests: Command-line interface functionality

Test Categories

test_monitor_basic.py - Core functionality and configuration
test_numeric_sanity.py - Numeric detection and unit conversions
test_nli_faithfulness.py - NLI detection and entailment
test_self_consistency.py - Self-consistency and similarity

CI Integration

GitHub Actions workflow for monitoring tests
Python 3.10, 3.11, 3.12 matrix testing
Dependency caching for faster builds
Fast test suite for CI, full suite for development

📦 Dependencies

Core Dependencies

numpy>=1.21.0 - Numerical computations
scipy>=1.7.0 - Scientific computing
regex>=2021.0.0 - Regular expressions
tqdm>=4.62.0 - Progress bars

NLP/ML Dependencies (Optional)

sentence-transformers>=2.2.0 - Semantic similarity
transformers>=4.20.0 - NLI models
torch>=1.12.0 - PyTorch backend
jinja2>=3.0.0 - HTML template rendering

Installation

# Install with monitoring dependencies
pip install -e ".[monitoring]"

# Install CLI tool
pip install -e ".[monitoring]"
# The 'gpt-oss-monitor' command will be available

🎨 HTML Reports

The system generates beautiful, interactive HTML reports with:

Risk Assessment Summary: Visual risk score and level
Signal Breakdown: Individual detector scores with descriptions
Highlighted Text: Color-coded spans showing issues
Detailed Analysis: Collapsible sections for each detector
Configuration Display: All settings used for the evaluation
Responsive Design: Works on desktop and mobile

Report Features

Color Coding: Green (safe), Yellow (medium risk), Red (high risk), Purple (critical)
Interactive Elements: Collapsible sections, hover tooltips
Auto-expand: High-risk sections automatically expanded
Export Ready: Self-contained HTML files

🔧 Configuration

MonitorConfig Options

k_samples: Number of samples for self-consistency (default: 5)
temperature: Generation temperature (default: 0.7)
max_new_tokens: Maximum tokens for generation (default: 512)
enable_retrieval_support: Enable retrieval support detection (default: True)
enable_jailbreak_heuristics: Enable jailbreak detection (default: True)
thresholds: Risk level thresholds (default: high=0.7, medium=0.4)
weights: Signal weights (configurable)
html_report: Generate HTML reports (default: True)
report_dir: Output directory for reports (default: "runs")

Customization Points

Weights: Adjust signal importance for your use case
Thresholds: Modify risk level boundaries
Models: Replace default models with custom ones
Features: Enable/disable specific detectors

🚀 Performance

Optimization Strategies

Lazy Loading: Models loaded only when needed
Fallback Mechanisms: Graceful degradation when models unavailable
Caching: Embedding and similarity caching
Parallelization: Independent detector execution

Resource Requirements

CPU: Lightweight fallback mode available
GPU: Optional for faster inference
Memory: ~2GB for full model loading
Storage: ~500MB for model downloads

🔮 Future Enhancements

Planned Features

Temporal Consistency: Check temporal claim validity
Entropy-based Uncertainty: Model confidence estimation
LLM-as-Judge: Use LLM to evaluate other LLM outputs
Multi-language Support: Extend to other languages
Custom Detectors: Plugin architecture for custom signals

Extension Points

Custom Detectors: Implement new detection signals
Model Adapters: Support additional model types
Report Formats: Add new output formats
Integration: Connect with existing systems

📋 Checklist

Implementation

All 5 detection signals implemented
Configurable weights and thresholds
Fallback mechanisms for all components
HTML report generation
CLI interface with comprehensive options
Python API with full type hints
Span highlighting and alignment
Deterministic results with seeded RNG

Testing

Documentation

Production Readiness

🎯 Impact

This Hallucination Monitor provides:

Immediate Value: Production-ready hallucination detection for GPT-OSS
Research Foundation: Extensible framework for hallucination research
Developer Experience: Easy-to-use API and CLI for integration
Quality Assurance: Automated detection of potential issues
Transparency: Detailed reports and explanations for all detections

The system is designed to be:

Lightweight: Minimal dependencies with fallbacks
Configurable: Adaptable to different use cases
Extensible: Easy to add new detection methods
Production-Ready: Comprehensive testing and documentation

🔗 Related

Design Document: docs/monitoring_design.md
Examples: gpt_oss/monitoring/examples/README.md
Tests: tests/monitoring/
CI: .github/workflows/ci-monitoring.yml

Ready for Review ✅

This PR implements a complete, production-ready Hallucination Monitor for GPT-OSS with comprehensive testing, documentation, and CI integration. The system provides both programmatic and command-line interfaces, generates beautiful HTML reports, and includes extensive customization options.

…ations - Beautiful gauge charts, radar plots, and bar charts - Interactive configuration sliders for all parameters - Quick example buttons for instant testing - Real-time analysis with color-coded risk highlighting - HTML report generation and download - Responsive design with gradient styling - Dependency checking and graceful fallbacks

- Remove emojis and casual styling for professional appearance - Add GPT-OSS logo and OpenAI-inspired color scheme - Clean, minimal design with proper typography and spacing - Professional risk indicators and signal cards - Improved mobile responsiveness - Better completion highlighting with proper spans display - Fix HTML report generation issues - Add comprehensive demo script in gpt_oss/monitoring/demo/ - Demo showcases all features: basic usage, detection signals, reports, web interface, CLI, advanced features - Professional HTML reports with beautiful design - Complete documentation and examples

- Add actual GPT-OSS SVG logo to header - Update color scheme to match sexy blue gradient (#667eea to #764ba2) - Beautiful gradient headers in web interface and HTML reports - Enhanced visual styling with shadows and hover effects - Professional branding throughout the interface - Update Python version requirement to support 3.13 - Consistent sexy blue theme across all components

- Remove temporary demo scripts from root directory - Add comprehensive .gitignore for demo files, pycache, and virtual environments - Enhance README with detailed demo instructions and web interface guide - Add professional demo notes and usage examples - Ensure clean, production-ready codebase for PR

Julianblock added 3 commits August 26, 2025 14:39

production ready hallucination detection

f435175

Remove temporary test and demo files

92a3c26

Remove demo files from PR

4d545ee

Julianblock changed the title ~~Feature/add hallucination monitor~~ Feature - Add hallucination monitor to gpt-oss Aug 26, 2025

Julianblock added 5 commits August 26, 2025 14:59

change ReadME verbage

463c6f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature - Add hallucination monitor to gpt-oss #155

Feature - Add hallucination monitor to gpt-oss #155

Uh oh!

Julianblock commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Feature - Add hallucination monitor to gpt-oss #155

Are you sure you want to change the base?

Feature - Add hallucination monitor to gpt-oss #155

Uh oh!

Conversation

Julianblock commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: feat(monitoring): Hallucination Monitor (SC+NLI+Numeric, optional RAG) with API, CLI, HTML, tests, CI

🎯 What/Why

Key Features

🎯 Generated Report || analytics || insights

Web Interface for testing

🏗️ Architecture

Signal Flow

🔧 Detection Signals

1. Self-Consistency (SC) - Weight: 0.25

2. NLI Faithfulness (NLI) - Weight: 0.35

3. Numeric Sanity (NS) - Weight: 0.15

4. Retrieval Support (RS) - Weight: 0.20

5. Jailbreak Heuristics (JB) - Weight: 0.05

📊 Scoring Algorithm

🚀 Usage Examples

CLI Usage

Python API

Custom Configuration

📁 Files Added/Modified

New Files

Modified Files

🧪 Testing

Test Coverage

Test Categories

CI Integration

📦 Dependencies

Core Dependencies

NLP/ML Dependencies (Optional)

Installation

🎨 HTML Reports

Report Features

🔧 Configuration

MonitorConfig Options

Customization Points

🚀 Performance

Optimization Strategies

Resource Requirements

🔮 Future Enhancements

Planned Features

Extension Points

📋 Checklist

Implementation

Testing

Documentation

Production Readiness

🎯 Impact

🔗 Related

Uh oh!

Uh oh!

Julianblock commented Aug 26, 2025 •

edited

Loading