Skip to content

Add local models via vLLM #72

@MHindermann

Description

@MHindermann

Plan: Add Local Model Support for vLLM Integration

Overview

Add support for running benchmarks on local models hosted via vLLM (or other OpenAI-compatible servers like Ollama, LM Studio). This will enable testing local models alongside cloud providers using the same benchmark infrastructure.

Implementation Strategy

Use the OpenAI-compatible API pattern (same approach as sciCORE) to minimize code changes and maximize compatibility.


Step 1: Add 'local' to Supported APIs

File: scripts/simple_ai_clients.py

Action: Update SUPPORTED_APIS list (lines 18-23)

SUPPORTED_APIS = ['openai',
                  'genai',
                  'anthropic',
                  'mistral',
                  'openrouter',
                  'scicore',
                  'local']  # Add this

Step 2: Implement Client Initialization

File: scripts/simple_ai_clients.pyinit_client() method (after line 83)

Action: Add initialization block for local provider

if self.api == 'local':
    import os
    base_url = os.getenv('LOCAL_API_URL', 'http://localhost:8000/v1')
    api_key = self.api_key if self.api_key else 'not-needed'
    self.api_client = OpenAI(
        base_url=base_url,
        api_key=api_key,
    )

Notes:

  • Reads base URL from environment variable (default: vLLM's standard port)
  • API key optional (many local servers don't require authentication)
  • Works with vLLM, Ollama, LM Studio, etc.

Step 3: Implement Prompt Method

File: scripts/simple_ai_clients.pyprompt() method (after line 273)

Action: Add request handling for local provider

if self.api == 'local':
    workload_json = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
        ]
    },
    {
        "role": "system",
        "content": self.gpt_role_description
    }]

    for img_path in self.image_resources:
        with open(img_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode("utf-8")
        workload_json[0]['content'].append(
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        )

    kwargs = {
        "messages": workload_json,
        "model": model,
        "temperature": self.temperature,
        "max_tokens": 16000,
    }

    if self.dataclass:
        # Try strict mode first, fall back to json_object mode if that fails
        schema = self.dataclass.model_json_schema()
        try:
            kwargs_strict = kwargs.copy()
            kwargs_strict["response_format"] = {
                "type": "json_schema",
                "json_schema": {
                    "name": self.dataclass.__name__,
                    "strict": True,
                    "schema": schema
                }
            }
            chat_completion = self.api_client.chat.completions.create(**kwargs_strict)
            answer = chat_completion
        except Exception as e:
            logging.warning(f"Local model strict mode failed, falling back to json_object mode: {e}")
            kwargs["response_format"] = {"type": "json_object"}
            schema_prompt = f"\n\nYou MUST respond with valid JSON matching this exact schema: {json.dumps(schema)}"
            kwargs["messages"][1]["content"] = self.gpt_role_description + schema_prompt
            chat_completion = self.api_client.chat.completions.create(**kwargs)
            answer = chat_completion
    else:
        chat_completion = self.api_client.chat.completions.create(**kwargs)
        answer = chat_completion

Notes:

  • Uses OpenAI-compatible format (same as OpenRouter/sciCORE)
  • Supports images via base64 encoding
  • Handles structured output with fallback strategy
  • Sets reasonable max_tokens for local models

Step 4: Implement Response Parsing

File: scripts/simple_ai_clients.pycreate_answer() method (after line 626)

Action: Add response parsing for local provider

elif self.api == 'local':
    # Local models return OpenAI-compatible responses
    if hasattr(response, 'usage') and response.usage:
        answer['usage'] = {
            'input_tokens': response.usage.prompt_tokens,
            'output_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
        }

    # Convert response to JSON-serializable format
    raw_data = {
        'id': response.id,
        'model': response.model,
        'choices': [{
            'finish_reason': choice.finish_reason,
            'index': choice.index,
            'message': {
                'content': choice.message.content,
                'role': choice.message.role,
            }
        } for choice in response.choices],
        'usage': {
            'prompt_tokens': response.usage.prompt_tokens,
            'completion_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
        } if hasattr(response, 'usage') and response.usage else {}
    }

    answer['raw'] = raw_data

    if self.dataclass:
        # Parse JSON response and validate with Pydantic
        try:
            content = response.choices[0].message.content
            # Try to extract JSON if it's wrapped in markdown code blocks
            if "```json" in content:
                import re
                json_match = re.search(r'```json\s*([\s\S]*?)\s*```', content)
                if json_match:
                    content = json_match.group(1)
            elif "```" in content:
                import re
                json_match = re.search(r'```\s*([\s\S]*?)\s*```', content)
                if json_match:
                    content = json_match.group(1)

            json_response = json.loads(content)
            pydantic_response = self.dataclass(**json_response)
            answer['response_text'] = pydantic_response.model_dump()
        except json.JSONDecodeError as e:
            logging.error(f"Failed to parse local model JSON response: {e}")
            answer['response_text'] = ""
            answer['error'] = 'JSONDecodeError'
            answer['error_message'] = str(e)
        except Exception as e:
            logging.warning(f"Failed to validate local model structured response: {e}")
            answer['response_text'] = response.choices[0].message.content
    else:
        answer['response_text'] = response.choices[0].message.content

Notes:

  • Extracts token usage for cost tracking
  • Handles JSON extraction from markdown code blocks
  • Validates structured output with Pydantic
  • Graceful error handling

Step 5: Update API Key Handling

File: scripts/run_benchmarks.pyget_api_key() function (lines 32-37)

Action: Make API key optional for local provider

def get_api_key(provider):
    """Get the API key for the provider."""
    api_key = os.getenv(f'{provider.upper()}_API_KEY')
    if not api_key and provider != 'local':  # Local provider doesn't require key
        raise ValueError(f"No API key found for {provider.upper()}")
    return api_key if api_key else 'not-needed'

File: scripts/dhbm.py (similar update needed at lines 20-27)


Step 6: Add Environment Configuration

File: .env

Action: Add local model configuration variables

# Local Model Configuration (vLLM, Ollama, etc.)
LOCAL_API_URL=http://localhost:8000/v1
LOCAL_API_KEY=  # Optional, leave empty if not needed

Notes:

  • LOCAL_API_URL can be changed for different local servers:
    • vLLM: http://localhost:8000/v1
    • Ollama: http://localhost:11434/v1
    • LM Studio: http://localhost:1234/v1

Step 7: Add Test Configurations

File: benchmarks/benchmarks_tests.csv

Action: Add test rows for local models

Example entries:

T9001,metadata_extraction,local,llava-v1.6-34b,Document,0.0,You are a historian with keyword knowledge...,prompt.txt,,false
T9002,fraktur,local,qwen2-vl-7b-instruct,Document,0.0,You are a historian with keyword knowledge,prompt_optimized.txt,,false
T9003,bibliographic_data,local,mistral-7b-instruct,Document,0.0,You are a Historian,,,false

Notes:

  • Use actual model names from your vLLM server
  • Can mix local and cloud provider tests in same benchmark runs

Step 8: Update Pricing Data (Optional)

File: scripts/data/pricing.json

Action: Add local model pricing (if you want cost tracking)

"local": {
  "llava-v1.6-34b": {
    "input": 0.0,
    "output": 0.0
  },
  "qwen2-vl-7b-instruct": {
    "input": 0.0,
    "output": 0.0
  }
}

Testing Plan

  1. Start vLLM server with a multimodal model:

    vllm serve llava-hf/llava-v1.6-vicuna-7b-hf
  2. Test connection using the test script:

    python scripts/test_scicore.py  # Adapt for local endpoint
  3. Run single benchmark:

    python scripts/dhbm.py --name "bibliographic_data" --provider "local" --model "llava-v1.6-vicuna-7b-hf"
  4. Run batch benchmarks:

    python scripts/run_benchmarks.py  # With T9xxx entries in CSV

Expected Outcomes

After implementation, you will be able to:

✅ Run benchmarks on local vLLM-hosted models
✅ Test multiple local models by specifying different model names
✅ Get accurate token usage tracking
✅ Use structured output (Pydantic schemas) with local models
✅ Process images with multimodal models
✅ Mix local and cloud provider tests in the same benchmark runs
✅ Switch between vLLM, Ollama, LM Studio by changing LOCAL_API_URL


Compatibility Note

This implementation supports any OpenAI-compatible local server, including:

  • vLLM (recommended for performance)
  • Ollama (easiest setup)
  • LM Studio (best GUI)
  • Text Generation Inference (HuggingFace)
  • LocalAI (universal compatibility)

Simply change the LOCAL_API_URL environment variable to switch between them.

Metadata

Metadata

Labels

help wantedExtra attention is needed

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions