Add local models via vLLM

# Plan: Add Local Model Support for vLLM Integration

## Overview
Add support for running benchmarks on local models hosted via vLLM (or other OpenAI-compatible servers like Ollama, LM Studio). This will enable testing local models alongside cloud providers using the same benchmark infrastructure.

## Implementation Strategy
Use the **OpenAI-compatible API pattern** (same approach as sciCORE) to minimize code changes and maximize compatibility.

---

## Step 1: Add 'local' to Supported APIs

**File:** `scripts/simple_ai_clients.py`

**Action:** Update `SUPPORTED_APIS` list (lines 18-23)

```python
SUPPORTED_APIS = ['openai',
                  'genai',
                  'anthropic',
                  'mistral',
                  'openrouter',
                  'scicore',
                  'local']  # Add this
```

---

## Step 2: Implement Client Initialization

**File:** `scripts/simple_ai_clients.py` → `init_client()` method (after line 83)

**Action:** Add initialization block for local provider

```python
if self.api == 'local':
    import os
    base_url = os.getenv('LOCAL_API_URL', 'http://localhost:8000/v1')
    api_key = self.api_key if self.api_key else 'not-needed'
    self.api_client = OpenAI(
        base_url=base_url,
        api_key=api_key,
    )
```

**Notes:**
- Reads base URL from environment variable (default: vLLM's standard port)
- API key optional (many local servers don't require authentication)
- Works with vLLM, Ollama, LM Studio, etc.

---

## Step 3: Implement Prompt Method

**File:** `scripts/simple_ai_clients.py` → `prompt()` method (after line 273)

**Action:** Add request handling for local provider

```python
if self.api == 'local':
    workload_json = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
        ]
    },
    {
        "role": "system",
        "content": self.gpt_role_description
    }]

    for img_path in self.image_resources:
        with open(img_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode("utf-8")
        workload_json[0]['content'].append(
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        )

    kwargs = {
        "messages": workload_json,
        "model": model,
        "temperature": self.temperature,
        "max_tokens": 16000,
    }

    if self.dataclass:
        # Try strict mode first, fall back to json_object mode if that fails
        schema = self.dataclass.model_json_schema()
        try:
            kwargs_strict = kwargs.copy()
            kwargs_strict["response_format"] = {
                "type": "json_schema",
                "json_schema": {
                    "name": self.dataclass.__name__,
                    "strict": True,
                    "schema": schema
                }
            }
            chat_completion = self.api_client.chat.completions.create(**kwargs_strict)
            answer = chat_completion
        except Exception as e:
            logging.warning(f"Local model strict mode failed, falling back to json_object mode: {e}")
            kwargs["response_format"] = {"type": "json_object"}
            schema_prompt = f"\n\nYou MUST respond with valid JSON matching this exact schema: {json.dumps(schema)}"
            kwargs["messages"][1]["content"] = self.gpt_role_description + schema_prompt
            chat_completion = self.api_client.chat.completions.create(**kwargs)
            answer = chat_completion
    else:
        chat_completion = self.api_client.chat.completions.create(**kwargs)
        answer = chat_completion
```

**Notes:**
- Uses OpenAI-compatible format (same as OpenRouter/sciCORE)
- Supports images via base64 encoding
- Handles structured output with fallback strategy
- Sets reasonable max_tokens for local models

---

## Step 4: Implement Response Parsing

**File:** `scripts/simple_ai_clients.py` → `create_answer()` method (after line 626)

**Action:** Add response parsing for local provider

```python
elif self.api == 'local':
    # Local models return OpenAI-compatible responses
    if hasattr(response, 'usage') and response.usage:
        answer['usage'] = {
            'input_tokens': response.usage.prompt_tokens,
            'output_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
        }

    # Convert response to JSON-serializable format
    raw_data = {
        'id': response.id,
        'model': response.model,
        'choices': [{
            'finish_reason': choice.finish_reason,
            'index': choice.index,
            'message': {
                'content': choice.message.content,
                'role': choice.message.role,
            }
        } for choice in response.choices],
        'usage': {
            'prompt_tokens': response.usage.prompt_tokens,
            'completion_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
        } if hasattr(response, 'usage') and response.usage else {}
    }

    answer['raw'] = raw_data

    if self.dataclass:
        # Parse JSON response and validate with Pydantic
        try:
            content = response.choices[0].message.content
            # Try to extract JSON if it's wrapped in markdown code blocks
            if "```json" in content:
                import re
                json_match = re.search(r'```json\s*([\s\S]*?)\s*```', content)
                if json_match:
                    content = json_match.group(1)
            elif "```" in content:
                import re
                json_match = re.search(r'```\s*([\s\S]*?)\s*```', content)
                if json_match:
                    content = json_match.group(1)

            json_response = json.loads(content)
            pydantic_response = self.dataclass(**json_response)
            answer['response_text'] = pydantic_response.model_dump()
        except json.JSONDecodeError as e:
            logging.error(f"Failed to parse local model JSON response: {e}")
            answer['response_text'] = ""
            answer['error'] = 'JSONDecodeError'
            answer['error_message'] = str(e)
        except Exception as e:
            logging.warning(f"Failed to validate local model structured response: {e}")
            answer['response_text'] = response.choices[0].message.content
    else:
        answer['response_text'] = response.choices[0].message.content
```

**Notes:**
- Extracts token usage for cost tracking
- Handles JSON extraction from markdown code blocks
- Validates structured output with Pydantic
- Graceful error handling

---

## Step 5: Update API Key Handling

**File:** `scripts/run_benchmarks.py` → `get_api_key()` function (lines 32-37)

**Action:** Make API key optional for local provider

```python
def get_api_key(provider):
    """Get the API key for the provider."""
    api_key = os.getenv(f'{provider.upper()}_API_KEY')
    if not api_key and provider != 'local':  # Local provider doesn't require key
        raise ValueError(f"No API key found for {provider.upper()}")
    return api_key if api_key else 'not-needed'
```

**File:** `scripts/dhbm.py` (similar update needed at lines 20-27)

---

## Step 6: Add Environment Configuration

**File:** `.env`

**Action:** Add local model configuration variables

```env
# Local Model Configuration (vLLM, Ollama, etc.)
LOCAL_API_URL=http://localhost:8000/v1
LOCAL_API_KEY=  # Optional, leave empty if not needed
```

**Notes:**
- `LOCAL_API_URL` can be changed for different local servers:
  - vLLM: `http://localhost:8000/v1`
  - Ollama: `http://localhost:11434/v1`
  - LM Studio: `http://localhost:1234/v1`

---

## Step 7: Add Test Configurations

**File:** `benchmarks/benchmarks_tests.csv`

**Action:** Add test rows for local models

**Example entries:**
```csv
T9001,metadata_extraction,local,llava-v1.6-34b,Document,0.0,You are a historian with keyword knowledge...,prompt.txt,,false
T9002,fraktur,local,qwen2-vl-7b-instruct,Document,0.0,You are a historian with keyword knowledge,prompt_optimized.txt,,false
T9003,bibliographic_data,local,mistral-7b-instruct,Document,0.0,You are a Historian,,,false
```

**Notes:**
- Use actual model names from your vLLM server
- Can mix local and cloud provider tests in same benchmark runs

---

## Step 8: Update Pricing Data (Optional)

**File:** `scripts/data/pricing.json`

**Action:** Add local model pricing (if you want cost tracking)

```json
"local": {
  "llava-v1.6-34b": {
    "input": 0.0,
    "output": 0.0
  },
  "qwen2-vl-7b-instruct": {
    "input": 0.0,
    "output": 0.0
  }
}
```

---

## Testing Plan

1. **Start vLLM server** with a multimodal model:
   ```bash
   vllm serve llava-hf/llava-v1.6-vicuna-7b-hf
   ```

2. **Test connection** using the test script:
   ```bash
   python scripts/test_scicore.py  # Adapt for local endpoint
   ```

3. **Run single benchmark**:
   ```bash
   python scripts/dhbm.py --name "bibliographic_data" --provider "local" --model "llava-v1.6-vicuna-7b-hf"
   ```

4. **Run batch benchmarks**:
   ```bash
   python scripts/run_benchmarks.py  # With T9xxx entries in CSV
   ```

---

## Expected Outcomes

After implementation, you will be able to:

✅ Run benchmarks on local vLLM-hosted models
✅ Test multiple local models by specifying different model names
✅ Get accurate token usage tracking
✅ Use structured output (Pydantic schemas) with local models
✅ Process images with multimodal models
✅ Mix local and cloud provider tests in the same benchmark runs
✅ Switch between vLLM, Ollama, LM Studio by changing `LOCAL_API_URL`

---

## Compatibility Note

This implementation supports **any OpenAI-compatible local server**, including:
- vLLM (recommended for performance)
- Ollama (easiest setup)
- LM Studio (best GUI)
- Text Generation Inference (HuggingFace)
- LocalAI (universal compatibility)

Simply change the `LOCAL_API_URL` environment variable to switch between them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add local models via vLLM #72

Plan: Add Local Model Support for vLLM Integration

Overview

Implementation Strategy

Step 1: Add 'local' to Supported APIs

Step 2: Implement Client Initialization

Step 3: Implement Prompt Method

Step 4: Implement Response Parsing

Step 5: Update API Key Handling

Step 6: Add Environment Configuration

Step 7: Add Test Configurations

Step 8: Update Pricing Data (Optional)

Testing Plan

Expected Outcomes

Compatibility Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add local models via vLLM #72

Description

Plan: Add Local Model Support for vLLM Integration

Overview

Implementation Strategy

Step 1: Add 'local' to Supported APIs

Step 2: Implement Client Initialization

Step 3: Implement Prompt Method

Step 4: Implement Response Parsing

Step 5: Update API Key Handling

Step 6: Add Environment Configuration

Step 7: Add Test Configurations

Step 8: Update Pricing Data (Optional)

Testing Plan

Expected Outcomes

Compatibility Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions