Skip to content

Switch embeddings to AI Gateway Qwen3-Embedding-4B; keep Together reranking #666

@BjornMelin

Description

@BjornMelin

Summary

Switch all text embeddings to Vercel AI Gateway using alibaba/qwen3-embedding-4b as the canonical embedding model, while keeping Together.ai for reranking via AI SDK. This finalizes the decision to migrate off Together embeddings for higher quality and consistent gateway routing.

Decision (final)

  • Embeddings: Use AI Gateway model ID alibaba/qwen3-embedding-4b for all embedding generation (API, RAG, memory). The model page shows embed usage with this exact ID.
  • Reranking: Keep @ai-sdk/togetherai for reranking (Mixedbread Mxbai-Rerank-Large-V2) as today.
  • Dimensions: Update all embedding dimensionality expectations and storage to 2560 (Qwen3-Embedding-4B).

Background / current state

  • Embedding model selection lives in src/lib/ai/embeddings/text-embedding-model.ts and is currently openai/text-embedding-3-small (1536-d) when AI Gateway/OpenAI keys are present; otherwise deterministic 1536-d fallback.
  • Embeddings are used in /api/embeddings, RAG indexing and retrieval, and memory search (src/lib/rag/*, src/lib/memory/*).
  • Reranking is already implemented via AI SDK rerank() with Together.ai in src/lib/rag/reranker.ts.

Rationale

  • Vercel AI Gateway lists alibaba/qwen3-embedding-4b and provides direct embed usage with that model ID.
  • AI SDK embed / embedMany are the canonical embedding APIs; AI SDK rerank is the canonical reranking API.
  • User-provided research summary (UNVERIFIED) indicates Qwen3-Embedding-4B outperforms older 2023–2024 embedding models on MTEB (English and multilingual) with stronger retrieval, and offers instruction-tuned multilingual/code performance.

Implementation spec

1) Embedding model + dimensions

  • Set TEXT_EMBEDDING_MODEL_ID = "alibaba/qwen3-embedding-4b" in src/lib/ai/embeddings/text-embedding-model.ts.
  • Set TEXT_EMBEDDING_DIMENSIONS = 2560 and align deterministic fallback to 2560-d (or remove deterministic fallback if it conflicts with production expectations).
  • Preferred: When AI_GATEWAY_API_KEY is set, always use the gateway string ID. Decide whether to remove OpenAI fallback to avoid silent model divergence; if kept, it must be explicit and documented.

2) Update embedding call sites

  • Ensure all usages of getTextEmbeddingModel() / TEXT_EMBEDDING_DIMENSIONS are updated and validated:
    • src/app/api/embeddings/route.ts
    • src/lib/rag/indexer.ts (embedMany)
    • src/lib/rag/retriever.ts
    • src/lib/memory/supabase-adapter.ts
    • Any other embedding usage surfaced by rg "embed|embedMany|TEXT_EMBEDDING_DIMENSIONS".
  • Verify cache keys incorporating model ID (e.g., RAG search) include the new ID to prevent collisions.

3) Database + schema migration

  • Update pgvector columns from vector(1536)vector(2560):
    • public.accommodation_embeddings.embedding
    • public.rag_documents.embedding
    • memories.turn_embeddings.embedding
    • RPCs: match_accommodation_embeddings, match_rag_documents, hybrid_rag_search, match_turn_embeddings.
  • Rebuild HNSW/IVFFlat indexes and revisit ef_search defaults for 2560-d vectors.
  • Update Supabase types (src/lib/supabase/database.types.ts) and schema sources (src/domain/schemas/supabase.ts).

4) Backfill / reindex plan

  • Provide a migration/backfill step to re-embed existing rows and rebuild indexes.
  • Reindex:
    • RAG documents
    • accommodation embeddings
    • memory turn embeddings
  • Document batching, timeouts, and rate limits to avoid runaway costs.

5) Docs & configuration

  • Update runbooks + quick-start docs that reference 1536 dims and text-embedding-3-small.
  • Update architecture docs that mention vector(1536) or OpenAI embedding defaults.

6) Tests & telemetry

  • Update unit tests that assert 1536 dimensions and any mocks to 2560.
  • Ensure telemetry captures model ID and any provider metadata for gateway usage.

Acceptance criteria

  • All production embeddings use AI Gateway alibaba/qwen3-embedding-4b.
  • All vector columns and RPCs accept 2560-d vectors and queries pass.
  • Reranking remains Together-based and optional when TOGETHER_AI_API_KEY is absent.
  • Backfill plan documented and executed for existing data.
  • Tests and docs updated.

Risks / mitigations

  • Larger vectors increase storage/index costs and query latency; mitigate by tuning HNSW params and monitoring latency/recall.
  • Backfill time/cost; mitigate with controlled batch sizes and off-peak execution.

References

User research summary (UNVERIFIED)

  • Qwen3-Embedding-4B reportedly leads recent MTEB (English + multilingual) with strong retrieval gains over older models.
  • Qwen3 models are instruction-aware, multilingual, and support long context; 4B variant chosen for highest quality.
  • Together m2-bert-80M-32k-retrieval is considered older and uncompetitive vs 2025+ models.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions