Skip to content

Conversation

Julianblock
Copy link

@Julianblock Julianblock commented Aug 26, 2025

PR: feat(monitoring): Hallucination Monitor (SC+NLI+Numeric, optional RAG) with API, CLI, HTML, tests, CI

🎯 What/Why

This PR implements a comprehensive Hallucination Monitor for GPT-OSS that detects and quantifies hallucination risk in LLM outputs using multiple detection signals. The system provides both programmatic API and CLI interfaces, generates beautiful HTML reports, and includes extensive testing and CI integration.

Key Features

  • 5 Detection Signals: Self-Consistency, NLI Faithfulness, Numeric Sanity, Retrieval Support, and Jailbreak Heuristics
  • Configurable: Customizable weights, thresholds, and detection parameters
  • Lightweight: CPU-optional with fallback heuristics for all components
  • Deterministic: Seeded RNG for reproducible results
  • Beautiful Reports: Interactive HTML reports with highlighted spans
  • CLI Interface: Easy command-line usage with file inputs
  • Production Ready: Comprehensive testing, CI integration, and documentation

Screenshots:

🎯 Generated Report || analytics || insights

Screenshot 2025-08-26 at 3 30 43 PM Screenshot 2025-08-26 at 3 31 10 PM

Web Interface for testing

Screenshot 2025-08-26 at 3 37 06 PM

🏗️ Architecture

gpt_oss/monitoring/
├── __init__.py                 # Main exports
├── halluci_monitor.py          # Main API and orchestration
├── config.py                   # Configuration dataclasses
├── detectors/                  # Detection modules
│   ├── self_consistency.py     # SC: k-resampling + semantic agreement
│   ├── nli_faithfulness.py     # NLI: sentence-level entailment
│   ├── numeric_sanity.py       # NS: arithmetic + unit consistency
│   ├── retrieval_support.py    # RS: context document verification
│   └── jailbreak_heuristics.py # JB: safety risk patterns
├── highlight/                  # Span highlighting utilities
│   ├── span_align.py          # Character/token span mapping
│   └── html_report.py         # HTML report generation
├── metrics/                    # Scoring and aggregation
│   └── scoring.py             # Risk score computation
└── examples/                   # Usage examples and test data

Signal Flow

Input: (prompt, completion, context_docs)
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Detectors                            │
├─────────────────────────────────────────────────────────┤
│ Self-Consistency  │ NLI Faithfulness │ Numeric Sanity   │
│ (k samples)       │ (entailment)     │ (arithmetic)     │
├─────────────────────────────────────────────────────────┤
│ Retrieval Support │ Jailbreak Heuristics               │
│ (context match)   │ (safety patterns)                  │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Aggregation                          │
├─────────────────────────────────────────────────────────┤
│ Weighted Risk Score = 1 - (w₁×NLI + w₂×SC + w₃×RS +   │
│                         w₄×NS + w₅×(1-JB))            │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│                    Output                               │
├─────────────────────────────────────────────────────────┤
│ • Risk Score [0,1]                                      │
│ • Risk Level (low/medium/high)                          │
│ • Individual Signal Scores                              │
│ • Highlighted Spans                                     │
│ • HTML Report (optional)                                │
└─────────────────────────────────────────────────────────┘

🔧 Detection Signals

1. Self-Consistency (SC) - Weight: 0.25

  • Purpose: Assess semantic consistency across multiple generations
  • Method: Generate k samples, compute pairwise cosine similarity using sentence embeddings
  • Implementation: Uses sentence-transformers/all-MiniLM-L6-v2 with character-based fallback

2. NLI Faithfulness (NLI) - Weight: 0.35

  • Purpose: Check if completion sentences are entailed by prompt + context
  • Method: Split completion into sentences, use NLI model for entailment probability
  • Implementation: Uses lightweight NLI model with word overlap fallback

3. Numeric Sanity (NS) - Weight: 0.15

  • Purpose: Detect arithmetic and unit conversion errors
  • Method: Extract numbers/units, check conversions (km↔mi, kg↔lb, °C↔°F), verify arithmetic
  • Implementation: Comprehensive unit conversion library with tolerance-based checking

4. Retrieval Support (RS) - Weight: 0.20

  • Purpose: Verify claims against provided context documents
  • Method: Chunk context documents, compute similarity with completion sentences
  • Implementation: Sentence transformers for semantic similarity with lexical fallback

5. Jailbreak Heuristics (JB) - Weight: 0.05

  • Purpose: Detect potential safety risks and jailbreak attempts
  • Method: Pattern matching, keyword analysis, formatting heuristics
  • Implementation: Regex patterns and keyword lists for risk assessment

📊 Scoring Algorithm

risk_score = 1 - (w₁×NLI + w₂×SC + w₃×RS + w₄×NS + w₅×(1-JB))

Risk Level Classification:

  • Low: risk_score < 0.4
  • Medium: 0.4 ≤ risk_score < 0.7
  • High: risk_score ≥ 0.7

🚀 Usage Examples

CLI Usage

# Basic usage
gpt-oss-monitor --prompt prompt.txt --completion output.txt

# With context documents and HTML report
gpt-oss-monitor --prompt prompt.txt --completion output.txt --contexts ctx1.txt ctx2.txt --html

# Custom configuration
gpt-oss-monitor --prompt prompt.txt --completion output.txt --k 10 --temperature 0.8 --html

Python API

from gpt_oss.monitoring import HallucinationMonitor, MonitorConfig

# Initialize monitor
monitor = HallucinationMonitor()

# Evaluate a completion
results = monitor.evaluate(
    prompt="What is the capital of France?",
    completion="Paris is the capital of France with 2.2 million people.",
    context_docs=["Paris is the capital and most populous city of France."]
)

print(f"Risk Level: {results['risk_level']}")
print(f"Risk Score: {results['risk_score']:.3f}")

Custom Configuration

from gpt_oss.monitoring import MonitorConfig, MonitorThresholds

config = MonitorConfig(
    k_samples=10,
    temperature=0.8,
    enable_retrieval_support=True,
    enable_jailbreak_heuristics=True,
    thresholds=MonitorThresholds(high=0.8, medium=0.5),
    html_report=True,
    report_dir="my_reports"
)

monitor = HallucinationMonitor(config)

📁 Files Added/Modified

New Files

  • gpt_oss/monitoring/__init__.py - Main exports
  • gpt_oss/monitoring/halluci_monitor.py - Main API and orchestration
  • gpt_oss/monitoring/config.py - Configuration dataclasses
  • gpt_oss/monitoring/detectors/__init__.py - Detector exports
  • gpt_oss/monitoring/detectors/self_consistency.py - Self-consistency detector
  • gpt_oss/monitoring/detectors/nli_faithfulness.py - NLI faithfulness detector
  • gpt_oss/monitoring/detectors/numeric_sanity.py - Numeric sanity detector
  • gpt_oss/monitoring/detectors/retrieval_support.py - Retrieval support detector
  • gpt_oss/monitoring/detectors/jailbreak_heuristics.py - Jailbreak heuristics detector
  • gpt_oss/monitoring/highlight/__init__.py - Highlight utilities exports
  • gpt_oss/monitoring/highlight/span_align.py - Span alignment utilities
  • gpt_oss/monitoring/highlight/html_report.py - HTML report generator
  • gpt_oss/monitoring/metrics/__init__.py - Metrics exports
  • gpt_oss/monitoring/metrics/scoring.py - Risk score computation
  • gpt_oss/monitoring/__main__.py - CLI entry point
  • gpt_oss/monitoring/requirements-monitor.txt - Monitoring dependencies
  • gpt_oss/monitoring/examples/README.md - Usage examples
  • gpt_oss/monitoring/examples/truthfulqa_mini.jsonl - Test data
  • gpt_oss/monitoring/examples/fever_mini.jsonl - Test data
  • tests/monitoring/test_monitor_basic.py - Basic tests
  • tests/monitoring/test_numeric_sanity.py - Numeric sanity tests
  • tests/monitoring/test_nli_faithfulness.py - NLI faithfulness tests
  • tests/monitoring/test_self_consistency.py - Self-consistency tests
  • .github/workflows/ci-monitoring.yml - CI workflow
  • docs/monitoring_design.md - Design documentation

Modified Files

  • pyproject.toml - Added monitoring dependencies and CLI entry point
  • README.md - Added Hallucination Monitor section

🧪 Testing

Test Coverage

  • Unit Tests: Individual detector functionality
  • Integration Tests: End-to-end monitoring pipeline
  • Configuration Tests: Config validation and serialization
  • CLI Tests: Command-line interface functionality

Test Categories

  • test_monitor_basic.py - Core functionality and configuration
  • test_numeric_sanity.py - Numeric detection and unit conversions
  • test_nli_faithfulness.py - NLI detection and entailment
  • test_self_consistency.py - Self-consistency and similarity

CI Integration

  • GitHub Actions workflow for monitoring tests
  • Python 3.10, 3.11, 3.12 matrix testing
  • Dependency caching for faster builds
  • Fast test suite for CI, full suite for development

📦 Dependencies

Core Dependencies

  • numpy>=1.21.0 - Numerical computations
  • scipy>=1.7.0 - Scientific computing
  • regex>=2021.0.0 - Regular expressions
  • tqdm>=4.62.0 - Progress bars

NLP/ML Dependencies (Optional)

  • sentence-transformers>=2.2.0 - Semantic similarity
  • transformers>=4.20.0 - NLI models
  • torch>=1.12.0 - PyTorch backend
  • jinja2>=3.0.0 - HTML template rendering

Installation

# Install with monitoring dependencies
pip install -e ".[monitoring]"

# Install CLI tool
pip install -e ".[monitoring]"
# The 'gpt-oss-monitor' command will be available

🎨 HTML Reports

The system generates beautiful, interactive HTML reports with:

  • Risk Assessment Summary: Visual risk score and level
  • Signal Breakdown: Individual detector scores with descriptions
  • Highlighted Text: Color-coded spans showing issues
  • Detailed Analysis: Collapsible sections for each detector
  • Configuration Display: All settings used for the evaluation
  • Responsive Design: Works on desktop and mobile

Report Features

  • Color Coding: Green (safe), Yellow (medium risk), Red (high risk), Purple (critical)
  • Interactive Elements: Collapsible sections, hover tooltips
  • Auto-expand: High-risk sections automatically expanded
  • Export Ready: Self-contained HTML files

🔧 Configuration

MonitorConfig Options

  • k_samples: Number of samples for self-consistency (default: 5)
  • temperature: Generation temperature (default: 0.7)
  • max_new_tokens: Maximum tokens for generation (default: 512)
  • enable_retrieval_support: Enable retrieval support detection (default: True)
  • enable_jailbreak_heuristics: Enable jailbreak detection (default: True)
  • thresholds: Risk level thresholds (default: high=0.7, medium=0.4)
  • weights: Signal weights (configurable)
  • html_report: Generate HTML reports (default: True)
  • report_dir: Output directory for reports (default: "runs")

Customization Points

  1. Weights: Adjust signal importance for your use case
  2. Thresholds: Modify risk level boundaries
  3. Models: Replace default models with custom ones
  4. Features: Enable/disable specific detectors

🚀 Performance

Optimization Strategies

  • Lazy Loading: Models loaded only when needed
  • Fallback Mechanisms: Graceful degradation when models unavailable
  • Caching: Embedding and similarity caching
  • Parallelization: Independent detector execution

Resource Requirements

  • CPU: Lightweight fallback mode available
  • GPU: Optional for faster inference
  • Memory: ~2GB for full model loading
  • Storage: ~500MB for model downloads

🔮 Future Enhancements

Planned Features

  1. Temporal Consistency: Check temporal claim validity
  2. Entropy-based Uncertainty: Model confidence estimation
  3. LLM-as-Judge: Use LLM to evaluate other LLM outputs
  4. Multi-language Support: Extend to other languages
  5. Custom Detectors: Plugin architecture for custom signals

Extension Points

  • Custom Detectors: Implement new detection signals
  • Model Adapters: Support additional model types
  • Report Formats: Add new output formats
  • Integration: Connect with existing systems

📋 Checklist

Implementation

  • All 5 detection signals implemented
  • Configurable weights and thresholds
  • Fallback mechanisms for all components
  • HTML report generation
  • CLI interface with comprehensive options
  • Python API with full type hints
  • Span highlighting and alignment
  • Deterministic results with seeded RNG

Testing

  • Unit tests for all detectors
  • Integration tests for full pipeline
  • Configuration validation tests
  • CLI functionality tests
  • CI workflow integration
  • Test data and examples

Documentation

  • Comprehensive docstrings
  • Design documentation
  • Usage examples and tutorials
  • README integration
  • API documentation
  • CLI help and examples

Production Readiness

  • Error handling and graceful degradation
  • Logging and debugging support
  • Performance optimization
  • Security considerations
  • Dependency management
  • Version compatibility

🎯 Impact

This Hallucination Monitor provides:

  1. Immediate Value: Production-ready hallucination detection for GPT-OSS
  2. Research Foundation: Extensible framework for hallucination research
  3. Developer Experience: Easy-to-use API and CLI for integration
  4. Quality Assurance: Automated detection of potential issues
  5. Transparency: Detailed reports and explanations for all detections

The system is designed to be:

  • Lightweight: Minimal dependencies with fallbacks
  • Configurable: Adaptable to different use cases
  • Extensible: Easy to add new detection methods
  • Production-Ready: Comprehensive testing and documentation

🔗 Related

  • Design Document: docs/monitoring_design.md
  • Examples: gpt_oss/monitoring/examples/README.md
  • Tests: tests/monitoring/
  • CI: .github/workflows/ci-monitoring.yml

Ready for Review

This PR implements a complete, production-ready Hallucination Monitor for GPT-OSS with comprehensive testing, documentation, and CI integration. The system provides both programmatic and command-line interfaces, generates beautiful HTML reports, and includes extensive customization options.

@Julianblock Julianblock changed the title Feature/add hallucination monitor Feature - Add hallucination monitor to gpt-oss Aug 26, 2025
…ations

- Beautiful gauge charts, radar plots, and bar charts
- Interactive configuration sliders for all parameters
- Quick example buttons for instant testing
- Real-time analysis with color-coded risk highlighting
- HTML report generation and download
- Responsive design with gradient styling
- Dependency checking and graceful fallbacks
- Remove emojis and casual styling for professional appearance
- Add GPT-OSS logo and OpenAI-inspired color scheme
- Clean, minimal design with proper typography and spacing
- Professional risk indicators and signal cards
- Improved mobile responsiveness
- Better completion highlighting with proper spans display
- Fix HTML report generation issues
- Add comprehensive demo script in gpt_oss/monitoring/demo/
- Demo showcases all features: basic usage, detection signals, reports, web interface, CLI, advanced features
- Professional HTML reports with beautiful design
- Complete documentation and examples
- Add actual GPT-OSS SVG logo to header
- Update color scheme to match sexy blue gradient (#667eea to #764ba2)
- Beautiful gradient headers in web interface and HTML reports
- Enhanced visual styling with shadows and hover effects
- Professional branding throughout the interface
- Update Python version requirement to support 3.13
- Consistent sexy blue theme across all components
- Remove temporary demo scripts from root directory
- Add comprehensive .gitignore for demo files, pycache, and virtual environments
- Enhance README with detailed demo instructions and web interface guide
- Add professional demo notes and usage examples
- Ensure clean, production-ready codebase for PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant