-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Plan: Add Local Model Support for vLLM Integration
Overview
Add support for running benchmarks on local models hosted via vLLM (or other OpenAI-compatible servers like Ollama, LM Studio). This will enable testing local models alongside cloud providers using the same benchmark infrastructure.
Implementation Strategy
Use the OpenAI-compatible API pattern (same approach as sciCORE) to minimize code changes and maximize compatibility.
Step 1: Add 'local' to Supported APIs
File: scripts/simple_ai_clients.py
Action: Update SUPPORTED_APIS list (lines 18-23)
SUPPORTED_APIS = ['openai',
'genai',
'anthropic',
'mistral',
'openrouter',
'scicore',
'local'] # Add thisStep 2: Implement Client Initialization
File: scripts/simple_ai_clients.py → init_client() method (after line 83)
Action: Add initialization block for local provider
if self.api == 'local':
import os
base_url = os.getenv('LOCAL_API_URL', 'http://localhost:8000/v1')
api_key = self.api_key if self.api_key else 'not-needed'
self.api_client = OpenAI(
base_url=base_url,
api_key=api_key,
)Notes:
- Reads base URL from environment variable (default: vLLM's standard port)
- API key optional (many local servers don't require authentication)
- Works with vLLM, Ollama, LM Studio, etc.
Step 3: Implement Prompt Method
File: scripts/simple_ai_clients.py → prompt() method (after line 273)
Action: Add request handling for local provider
if self.api == 'local':
workload_json = [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
]
},
{
"role": "system",
"content": self.gpt_role_description
}]
for img_path in self.image_resources:
with open(img_path, "rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode("utf-8")
workload_json[0]['content'].append(
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
)
kwargs = {
"messages": workload_json,
"model": model,
"temperature": self.temperature,
"max_tokens": 16000,
}
if self.dataclass:
# Try strict mode first, fall back to json_object mode if that fails
schema = self.dataclass.model_json_schema()
try:
kwargs_strict = kwargs.copy()
kwargs_strict["response_format"] = {
"type": "json_schema",
"json_schema": {
"name": self.dataclass.__name__,
"strict": True,
"schema": schema
}
}
chat_completion = self.api_client.chat.completions.create(**kwargs_strict)
answer = chat_completion
except Exception as e:
logging.warning(f"Local model strict mode failed, falling back to json_object mode: {e}")
kwargs["response_format"] = {"type": "json_object"}
schema_prompt = f"\n\nYou MUST respond with valid JSON matching this exact schema: {json.dumps(schema)}"
kwargs["messages"][1]["content"] = self.gpt_role_description + schema_prompt
chat_completion = self.api_client.chat.completions.create(**kwargs)
answer = chat_completion
else:
chat_completion = self.api_client.chat.completions.create(**kwargs)
answer = chat_completionNotes:
- Uses OpenAI-compatible format (same as OpenRouter/sciCORE)
- Supports images via base64 encoding
- Handles structured output with fallback strategy
- Sets reasonable max_tokens for local models
Step 4: Implement Response Parsing
File: scripts/simple_ai_clients.py → create_answer() method (after line 626)
Action: Add response parsing for local provider
elif self.api == 'local':
# Local models return OpenAI-compatible responses
if hasattr(response, 'usage') and response.usage:
answer['usage'] = {
'input_tokens': response.usage.prompt_tokens,
'output_tokens': response.usage.completion_tokens,
'total_tokens': response.usage.total_tokens,
}
# Convert response to JSON-serializable format
raw_data = {
'id': response.id,
'model': response.model,
'choices': [{
'finish_reason': choice.finish_reason,
'index': choice.index,
'message': {
'content': choice.message.content,
'role': choice.message.role,
}
} for choice in response.choices],
'usage': {
'prompt_tokens': response.usage.prompt_tokens,
'completion_tokens': response.usage.completion_tokens,
'total_tokens': response.usage.total_tokens,
} if hasattr(response, 'usage') and response.usage else {}
}
answer['raw'] = raw_data
if self.dataclass:
# Parse JSON response and validate with Pydantic
try:
content = response.choices[0].message.content
# Try to extract JSON if it's wrapped in markdown code blocks
if "```json" in content:
import re
json_match = re.search(r'```json\s*([\s\S]*?)\s*```', content)
if json_match:
content = json_match.group(1)
elif "```" in content:
import re
json_match = re.search(r'```\s*([\s\S]*?)\s*```', content)
if json_match:
content = json_match.group(1)
json_response = json.loads(content)
pydantic_response = self.dataclass(**json_response)
answer['response_text'] = pydantic_response.model_dump()
except json.JSONDecodeError as e:
logging.error(f"Failed to parse local model JSON response: {e}")
answer['response_text'] = ""
answer['error'] = 'JSONDecodeError'
answer['error_message'] = str(e)
except Exception as e:
logging.warning(f"Failed to validate local model structured response: {e}")
answer['response_text'] = response.choices[0].message.content
else:
answer['response_text'] = response.choices[0].message.contentNotes:
- Extracts token usage for cost tracking
- Handles JSON extraction from markdown code blocks
- Validates structured output with Pydantic
- Graceful error handling
Step 5: Update API Key Handling
File: scripts/run_benchmarks.py → get_api_key() function (lines 32-37)
Action: Make API key optional for local provider
def get_api_key(provider):
"""Get the API key for the provider."""
api_key = os.getenv(f'{provider.upper()}_API_KEY')
if not api_key and provider != 'local': # Local provider doesn't require key
raise ValueError(f"No API key found for {provider.upper()}")
return api_key if api_key else 'not-needed'File: scripts/dhbm.py (similar update needed at lines 20-27)
Step 6: Add Environment Configuration
File: .env
Action: Add local model configuration variables
# Local Model Configuration (vLLM, Ollama, etc.)
LOCAL_API_URL=http://localhost:8000/v1
LOCAL_API_KEY= # Optional, leave empty if not neededNotes:
LOCAL_API_URLcan be changed for different local servers:- vLLM:
http://localhost:8000/v1 - Ollama:
http://localhost:11434/v1 - LM Studio:
http://localhost:1234/v1
- vLLM:
Step 7: Add Test Configurations
File: benchmarks/benchmarks_tests.csv
Action: Add test rows for local models
Example entries:
T9001,metadata_extraction,local,llava-v1.6-34b,Document,0.0,You are a historian with keyword knowledge...,prompt.txt,,false
T9002,fraktur,local,qwen2-vl-7b-instruct,Document,0.0,You are a historian with keyword knowledge,prompt_optimized.txt,,false
T9003,bibliographic_data,local,mistral-7b-instruct,Document,0.0,You are a Historian,,,false
Notes:
- Use actual model names from your vLLM server
- Can mix local and cloud provider tests in same benchmark runs
Step 8: Update Pricing Data (Optional)
File: scripts/data/pricing.json
Action: Add local model pricing (if you want cost tracking)
"local": {
"llava-v1.6-34b": {
"input": 0.0,
"output": 0.0
},
"qwen2-vl-7b-instruct": {
"input": 0.0,
"output": 0.0
}
}Testing Plan
-
Start vLLM server with a multimodal model:
vllm serve llava-hf/llava-v1.6-vicuna-7b-hf
-
Test connection using the test script:
python scripts/test_scicore.py # Adapt for local endpoint -
Run single benchmark:
python scripts/dhbm.py --name "bibliographic_data" --provider "local" --model "llava-v1.6-vicuna-7b-hf"
-
Run batch benchmarks:
python scripts/run_benchmarks.py # With T9xxx entries in CSV
Expected Outcomes
After implementation, you will be able to:
✅ Run benchmarks on local vLLM-hosted models
✅ Test multiple local models by specifying different model names
✅ Get accurate token usage tracking
✅ Use structured output (Pydantic schemas) with local models
✅ Process images with multimodal models
✅ Mix local and cloud provider tests in the same benchmark runs
✅ Switch between vLLM, Ollama, LM Studio by changing LOCAL_API_URL
Compatibility Note
This implementation supports any OpenAI-compatible local server, including:
- vLLM (recommended for performance)
- Ollama (easiest setup)
- LM Studio (best GUI)
- Text Generation Inference (HuggingFace)
- LocalAI (universal compatibility)
Simply change the LOCAL_API_URL environment variable to switch between them.