Skip to content

Enhancement: Add on-device text embeddings via Apple NaturalLanguage #118

@jesserobbins

Description

@jesserobbins

Continuing the theme of exposing Apple's on-device ML stack through the afm API surface, I'd like to propose adding a /v1/embeddings endpoint. Starting scope is intentionally narrow: Apple's NLContextualEmbedding only — no MLX embedding backend, no third-party models. Once the surface is stable we can add more backends behind the same API.

What this adds

Embeddings (/v1/embeddings)

OpenAI-compatible embeddings endpoint backed by NLContextualEmbedding from the NaturalLanguage framework. Accepts a string or array of strings, returns float vectors plus usage counts. Rejects empty/whitespace inputs with 400.

Shipped model IDs:

  • apple-nl-contextual-en — English
  • apple-nl-contextual-multi — Latin-script multilingual (Apple's NL contextual multilingual model is Latin-script only; non-Latin scripts are out of scope for this backend)

Native dimension and max input length come from the framework at load time (NLContextualEmbedding.dimension / maximumSequenceLength) rather than being hard-coded, so the values stay correct across OS updates.

CLI: afm embeddings -m apple-nl-contextual-en -i \"hello world\"
CLI: afm embeddings --list (enumerates shipped model IDs)

Model listing

Embedding models show up in GET /v1/models alongside chat models so existing OpenAI clients can discover them.

Why this matters

Embeddings are the missing piece for a lot of local-first workflows — RAG, semantic search, clustering, dedup — and everything else in afm is already on-device. Adding embeddings through the same OpenAI-compatible interface means existing tooling (LangChain, LlamaIndex, custom scripts) works against afm without code changes.

NLContextualEmbedding runs on the ANE where available, requires no model download beyond what the OS ships, and has no network dependency. That's the right default for a local server.

Implementation

One feature commit plus review-hardening follow-ups:

  1. EmbeddingsController + EmbeddingBackend protocol + NLContextualEmbeddingBackend + EmbeddingModelRegistry
  2. Request/response types (EmbeddingRequest, EmbeddingResponse, usage accounting)
  3. CLI (afm embeddings) wired through ArgumentParser
  4. Route registration and /v1/models integration
  5. Unit tests for the controller and registry, plus Scripts/test-embeddings.sh and an openai Python client round-trip

Deliberately out of scope for this first cut:

  • MLX embedding backend (sentence-transformers, BGE, etc.)
  • Non-Latin script coverage
  • Matryoshka / dimension truncation (the NL backend doesn't support it)
  • Batch endpoint parity

Happy to open the PR against main once there's interest — branch is ready locally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions