| sidebar_label | Embeddings |
|---|
This guide demonstrates how to use the built-in embedding functionality provided by langchain-oceanbase. The built-in embedding uses the all-MiniLM-L6-v2 model (384 dimensions) and requires no external API keys, making it perfect for quick prototyping and testing.
The built-in embedding functionality requires the optional pyseekdb extra, which uses ONNX runtime for local inference:
pip install -qU "langchain-oceanbase[pyseekdb]"DefaultEmbeddingFunction is the default embedding function provided by pyseekdb and can be used directly:
from langchain_oceanbase import DefaultEmbeddingFunction
# Create embedding function
ef = DefaultEmbeddingFunction()
print(f"Embedding dimension: {ef.dimension}") # 384
# Generate embedding for a single text
single_text = "Hello world"
single_embedding = ef(single_text)
print(f"Single text embedding count: {len(single_embedding)}")
print(f"Single text embedding dimension: {len(single_embedding[0]) if single_embedding else 0}")
# Generate embeddings for multiple texts (batch processing)
texts = ["Hello world", "How are you?", "Python programming", "Machine learning"]
embeddings = ef(texts)
print(f"\nBatch text count: {len(texts)}")
print(f"Generated embedding count: {len(embeddings)}")
print(f"Each embedding dimension: {len(embeddings[0]) if embeddings else 0}")DefaultEmbeddingFunctionAdapter implements LangChain's Embeddings interface, allowing seamless integration into the LangChain ecosystem:
from langchain_oceanbase.embedding_utils import DefaultEmbeddingFunctionAdapter
# Create adapter
adapter = DefaultEmbeddingFunctionAdapter()
print(f"Embedding dimension: {adapter.dimension}")
# Generate embeddings for multiple documents (batch)
documents = ["Hello world", "How are you?", "Python programming"]
doc_embeddings = adapter.embed_documents(documents)
print(f"\nDocument count: {len(documents)}")
print(f"Generated document embedding count: {len(doc_embeddings)}")
print(f"Each document embedding dimension: {len(doc_embeddings[0]) if doc_embeddings else 0}")
# Generate embedding for a query (single text)
query = "Hello"
query_embedding = adapter.embed_query(query)
print(f"\nQuery text: '{query}'")
print(f"Query embedding dimension: {len(query_embedding)}")
print(f"Query embedding type: {type(query_embedding)}") # Should be a 1D listThe simplest way is to set embedding_function=None when creating OceanbaseVectorStore, and the system will automatically use the default embedding function:
from langchain_oceanbase.vectorstores import OceanbaseVectorStore
from langchain_core.documents import Document
# Connection configuration
connection_args = {
"host": "127.0.0.1",
"port": "2881",
"user": "root@test",
"password": "",
"db_name": "test",
}
# Use default embedding (set embedding_function=None)
vector_store = OceanbaseVectorStore(
embedding_function=None, # Automatically uses DefaultEmbeddingFunction
table_name="embedding_demo",
connection_args=connection_args,
vidx_metric_type="l2",
drop_old=True,
embedding_dim=384, # all-MiniLM-L6-v2 dimension
)
print("✓ Vector store created successfully, using default embedding")Add documents and perform similarity search using the default embedding:
# Add documents
documents = [
Document(page_content="Machine learning is a subset of artificial intelligence"),
Document(page_content="Python is a popular programming language"),
Document(page_content="OceanBase is a distributed relational database"),
Document(page_content="Deep learning uses neural networks for pattern recognition"),
]
ids = vector_store.add_documents(documents)
print(f"✓ Successfully added {len(ids)} documents")
# Perform similarity search
results = vector_store.similarity_search("artificial intelligence", k=2)
print(f"\nSimilarity search for 'artificial intelligence' returned {len(results)} results:")
for i, doc in enumerate(results, 1):
print(f"{i}. {doc.page_content}")Use embeddings to compute similarity between texts:
from langchain_oceanbase.embedding_utils import DefaultEmbeddingFunctionAdapter
import math
# Create adapter
adapter = DefaultEmbeddingFunctionAdapter()
# Define similar and dissimilar texts
text1 = "Machine learning is a subset of artificial intelligence"
text2 = "AI and machine learning are related technologies"
text3 = "The weather today is sunny and warm"
# Generate embeddings
emb1 = adapter.embed_query(text1)
emb2 = adapter.embed_query(text2)
emb3 = adapter.embed_query(text3)
# Compute cosine similarity
def cosine_similarity(vec1, vec2):
"""Compute cosine similarity between two vectors"""
dot_product = sum(a * b for a, b in zip(vec1, vec2))
norm1 = math.sqrt(sum(a * a for a in vec1))
norm2 = math.sqrt(sum(b * b for b in vec2))
return dot_product / (norm1 * norm2) if norm1 * norm2 > 0 else 0.0
sim_12 = cosine_similarity(emb1, emb2)
sim_13 = cosine_similarity(emb1, emb3)
print(f"Text 1: '{text1[:50]}...'")
print(f"Text 2: '{text2[:50]}...'")
print(f"Text 3: '{text3[:50]}...'")
print(f"\nSimilarity between Text 1 and Text 2: {sim_12:.4f}")
print(f"Similarity between Text 1 and Text 3: {sim_13:.4f}")
print(f"\n✓ Related text similarity ({sim_12:.4f}) > Unrelated text similarity ({sim_13:.4f})")DefaultEmbeddingFunctionAdapter supports custom model names and ONNX runtime providers:
# Use default configuration
adapter_default = DefaultEmbeddingFunctionAdapter()
print(f"Default configuration - Dimension: {adapter_default.dimension}")
# Custom model name (currently only 'all-MiniLM-L6-v2' is supported)
adapter_custom = DefaultEmbeddingFunctionAdapter(
model_name="all-MiniLM-L6-v2",
preferred_providers=["CPUExecutionProvider"] # Specify CPU usage
)
print(f"Custom configuration - Dimension: {adapter_custom.dimension}")
# Test embedding generation
texts = ["Test embedding generation"]
embeddings_default = adapter_default.embed_documents(texts)
embeddings_custom = adapter_custom.embed_documents(texts)
print(f"\nDefault configuration embedding dimension: {len(embeddings_default[0])}")
print(f"Custom configuration embedding dimension: {len(embeddings_custom[0])}")Demonstrate the performance advantage of batch processing over single processing:
import time
from langchain_oceanbase.embedding_utils import DefaultEmbeddingFunctionAdapter
adapter = DefaultEmbeddingFunctionAdapter()
texts = ["Text " + str(i) for i in range(100)] # 100 texts
# Method 1: Batch processing
start_time = time.time()
batch_embeddings = adapter.embed_documents(texts)
batch_time = time.time() - start_time
print(f"Batch processing {len(texts)} texts took: {batch_time:.4f} seconds")
print(f"Average per text: {batch_time/len(texts)*1000:.2f} milliseconds")
# Method 2: Single processing (not recommended)
start_time = time.time()
single_embeddings = [adapter.embed_query(text) for text in texts]
single_time = time.time() - start_time
print(f"\nSingle processing {len(texts)} texts took: {single_time:.4f} seconds")
print(f"Average per text: {single_time/len(texts)*1000:.2f} milliseconds")
print(f"\n✓ Batch processing is {single_time/batch_time:.2f}x faster than single processing")Build a complete RAG (Retrieval Augmented Generation) application using the default embedding:
from langchain_oceanbase.vectorstores import OceanbaseVectorStore
from langchain_core.documents import Document
# Create vector store using default embedding
rag_vector_store = OceanbaseVectorStore(
embedding_function=None, # Use default embedding
table_name="rag_knowledge_base",
connection_args=connection_args,
vidx_metric_type="l2",
drop_old=True,
embedding_dim=384,
)
# Add knowledge base documents
knowledge_docs = [
Document(
page_content="LangChain is a framework for developing applications powered by language models",
metadata={"source": "langchain_docs", "topic": "framework"}
),
Document(
page_content="OceanBase is a distributed relational database developed by Ant Group",
metadata={"source": "oceanbase_docs", "topic": "database"}
),
Document(
page_content="Vector databases are specialized databases for storing and searching vector embeddings",
metadata={"source": "vector_db_docs", "topic": "database"}
),
Document(
page_content="Hybrid search combines multiple search modalities for better results",
metadata={"source": "search_docs", "topic": "search"}
),
]
ids = rag_vector_store.add_documents(knowledge_docs)
print(f"✓ Knowledge base created with {len(ids)} documents")
# Create retriever
retriever = rag_vector_store.as_retriever(search_kwargs={"k": 2})
# Perform retrieval
query = "What is LangChain?"
results = retriever.invoke(query)
print(f"\nQuery: '{query}'")
print(f"Retrieved {len(results)} relevant documents:")
for i, doc in enumerate(results, 1):
print(f"\n{i}. {doc.page_content}")
print(f" Source: {doc.metadata.get('source', 'N/A')}")
print(f" Topic: {doc.metadata.get('topic', 'N/A')}")- No API Keys Required: Uses local ONNX models, no external API calls needed
- Quick Start: Perfect for rapid prototyping and testing
- LangChain Compatible: Fully compatible with LangChain's
Embeddingsinterface - Batch Processing: Supports efficient batch embedding generation
- Automatic Integration: Can be automatically used in
OceanbaseVectorStore
- Model: all-MiniLM-L6-v2
- Dimension: 384
- Inference Engine: ONNX Runtime
- Supported Platforms: CPU (default), optional GPU
- Rapid prototyping
- Local development and testing
- Scenarios that don't require high-precision embeddings
- Applications that need to run offline
- Cost-sensitive applications
- Requires
pyseekdbpackage to be installed - Model files will be downloaded on first use (approximately 80MB)
- For production environments, consider using more powerful embedding models (e.g., OpenAI, DashScope, etc.)