POST /v1/vision/ocr exposes Apple Vision OCR over HTTP in the same local server as the OpenAI-compatible chat endpoints.
It is intended for:
- direct OCR over HTTP without shelling out to
afm vision - clients that already produce OpenAI-style
image_urlcontent parts - local document extraction workflows that need structured text, table output, and page-level metadata
- macOS 26.0 or later
- Apple Silicon
- Apple Vision available on the host machine
The endpoint accepts one or more OCR inputs in a single request.
{
"file": "/tmp/invoice.pdf"
}{
"data": "iVBORw0KGgoAAAANSUhEUgAA...",
"filename": "scan.png",
"media_type": "image/png"
}{
"data": "data:application/pdf;base64,JVBERi0xLjcK..."
}{
"image_url": {
"url": "data:image/png;base64,..."
}
}{
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Read this document" },
{
"type": "image_url",
"image_url": {
"url": "data:application/pdf;base64,..."
}
}
]
}
]
}Use a multipart file field. Optional OCR controls can be supplied as form fields alongside the upload.
curl http://localhost:9999/v1/vision/ocr \
-F "file=@/tmp/invoice.pdf" \
-F "recognition_level=accurate" \
-F "languages=en-US"{
"recognition_level": "accurate",
"uses_language_correction": true,
"languages": ["en-US"],
"max_pages": 10,
"table": false,
"debug": false
}Supported options:
recognition_level:accurateorfastuses_language_correction: enables Vision language correctionlanguages: preferred OCR language tagsmax_pages: page cap for multi-page PDFstable: returns structured table extraction in the document payloaddebug: returns raw Vision detection output instead of OCR documents
Current guardrails:
- max input size: 25 MB
- max pages per document: 50 by default
- max image dimension: 4096 px on either side
- supported formats:
png,jpg,jpeg,heic,pdf
Successful responses return a vision.ocr object:
{
"object": "vision.ocr",
"mode": "text",
"documents": [
{
"file": "/tmp/invoice.pdf",
"source_type": "file",
"text": "Page 1...\n\nPage 2...",
"full_text": "Page 1...\n\nPage 2...",
"page_count": 2,
"document_hints": ["invoice", "multi_page", "table_like"],
"pages": [
{
"page_number": 1,
"text": "Page 1...",
"width": 1024,
"height": 768,
"text_blocks": [],
"tables": []
}
],
"text_blocks": [],
"tables": []
}
],
"combined_text": "Page 1...\n\nPage 2...",
"document_hints": ["invoice", "multi_page", "table_like"]
}Notable fields:
mode:text,table, ordebugdocuments: one entry per resolved OCR inputpages: per-page text, dimensions, blocks, and tablestext_blocks: flattened OCR text blocks with confidence and bounding boxestables: structured tables with headers, rows, row objects, CSV, and bounding boxescombined_text: all OCR text joined across all documentsdocument_hints: inferred hints such asinvoice,multi_page, ortable_like
The endpoint maps OCR failures to HTTP statuses:
400 Bad Request: missing input, unsupported format, invalid base64 or data URL404 Not Found: local file path does not exist413 Payload Too Large: request exceeds the OCR file-size limit422 Unprocessable Entity: page-limit exceeded, image too large, unreadable image, no text found, no tables found, segmentation failure503 Service Unavailable: Apple Vision OCR is unavailable on the current platform
Errors use the same JSON envelope style as the rest of the OpenAI-compatible API:
{
"error": {
"message": "The specified file was not found",
"type": "invalid_request_error"
}
}Foundation chat requests can auto-run Apple Vision OCR before prompting the model.
This only happens when all of the following are true:
- the request includes image content in
messages[].content[] - the request includes the built-in tool named
apple_vision_ocr tool_choiceis omitted,auto,required, or explicitly selectsapple_vision_ocr
When that path is taken, OCR text is injected into the message content before the Foundation model sees the prompt.
Validated locally with:
swift test --disable-sandboxResult at the time of implementation:
293 tests in 19 suites passed