Document Chunking and Embedding Example

This example demonstrates how to chunk a document, generate embeddings, and store them in Chroma Cloud for semantic search and retrieval.

Overview

The example performs the following operations:

Ingestion Mode: Chunks a document (document.txt) into smaller pieces, generates embeddings using Jina AI, and stores them in Chroma Cloud
Query Mode: Performs semantic search on the stored documents using natural language queries

Prerequisites

PHP 8.1 or higher
Chroma Cloud account with API key
Jina AI API key (for embeddings)
Composer dependencies installed (composer install)

Setup

Set your API keys as environment variables:

export CHROMA_API_KEY="your-chroma-cloud-api-key"
export JINA_API_KEY="your-jina-api-key"

Or pass them via CLI arguments (see Usage below).

Usage

Ingest Mode

Chunk and store the document to Chroma Cloud:

php index.php -mode ingest

With custom options:

php index.php -mode ingest \
  --api-key "your-chroma-api-key" \
  --jina-key "your-jina-api-key" \
  --tenant "my-tenant" \
  --database "my-database"

Query Mode

Search the stored documents:

php index.php -mode query --query "What happened at the Dartmouth Workshop?"

With custom options:

php index.php -mode query \
  --query "Who proposed the Turing Test?" \
  --api-key "your-chroma-api-key" \
  --jina-key "your-jina-api-key" \
  --tenant "my-tenant" \
  --database "my-database"

CLI Arguments

Argument	Description	Default	Required
`-mode`	Operation mode: `ingest` or `query`	-	Yes
`--query`	Query text for search (query mode only)	"Which event marked the birth of symbolic AI?"	No
`--api-key`	Chroma Cloud API key	`CHROMA_API_KEY` env var	Yes
`--jina-key`	Jina AI API key for embeddings	`JINA_API_KEY` env var	Yes
`--tenant`	Chroma Cloud tenant name	`default_tenant`	No
`--database`	Chroma Cloud database name	`default_database`	No
`--collection-name`	Collection name to use	`history_of_ai`	No

Example Queries

Try these example queries to test the semantic search:

# Historical events
php index.php -mode query --query "What happened at the Dartmouth Workshop?"

# People and contributions
php index.php -mode query --query "Who proposed the Turing Test?"

# Technical breakthroughs
php index.php -mode query --query "What was the significance of AlexNet in 2012?"

# Concepts and explanations
php index.php -mode query --query "How do Large Language Models and Generative AI work?"

# Historical figures
php index.php -mode query --query "Who is considered the first computer programmer?"

How It Works

Document Chunking

The document is chunked based on:

CHAPTER markers: New chapters create new chunks
PAGE markers: New pages create new chunks
Text accumulation: Text between markers is accumulated into chunks

Each chunk includes:

Unique ID
Document text
Metadata (chapter and page information)

Embedding Generation

Uses Jina AI's embedding function to convert text chunks into vector embeddings
Embeddings are generated in batch for efficiency
All chunks are embedded before storage

Storage

Chunks are stored in a Chroma Cloud collection
The collection is recreated on each ingestion (previous data is deleted)
Each chunk maintains its metadata for filtering and context

Querying

Natural language queries are converted to embeddings using the same Jina AI function
Vector similarity search finds the most relevant chunks
Results include distance scores, documents, and metadata

Output

Ingest Mode

--- Chroma Cloud Example: ingest Mode ---
Tenant: default_tenant, Database: default_database
Connected to Chroma Cloud version: 0.1.0
Starting Ingestion...
Parsed 9 chunks from document.
Embedding and adding 9 items...
Ingestion Complete!

Query Mode

--- Chroma Cloud Example: query Mode ---
Tenant: default_tenant, Database: default_database
Connected to Chroma Cloud version: 0.1.0
Querying: "What happened at the Dartmouth Workshop?"

--- Results ---
[0] (Distance: 0.123)
Location: CHAPTER 1: The Dawn of Thinking Machines, PAGE 3
Content: The 1956 Dartmouth Workshop is widely considered the founding event of AI as a field. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon brought together...
---------------------------

Customization

Using a Different Document

Replace document.txt with your own document. The chunking logic will automatically process it based on CHAPTER and PAGE markers.

Using a Different Embedding Function

Modify index.php to use a different embedding function:

use Codewithkyrian\ChromaDB\Embeddings\OpenAIEmbeddingFunction;

$ef = new OpenAIEmbeddingFunction($config['openai_key']);

Custom Chunking Strategy

Modify the chunkDocument() function to implement your own chunking logic (e.g., by sentence, by paragraph, fixed-size chunks, etc.).

Troubleshooting

Error: Chroma Cloud API Key is required

Set CHROMA_API_KEY environment variable or use --api-key argument

Error: Jina API Key is required

Set JINA_API_KEY environment variable or use --jina-key argument

Error: Collection not found

Run ingestion mode first to create and populate the collection

No results returned

Ensure the collection was successfully ingested
Try different query phrasings
Check that the query is related to the document content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Chunking and Embedding Example

Overview

Prerequisites

Setup

Usage

Ingest Mode

Query Mode

CLI Arguments

Example Queries

How It Works

Document Chunking

Embedding Generation

Storage

Querying

Output

Ingest Mode

Query Mode

Customization

Using a Different Document

Using a Different Embedding Function

Custom Chunking Strategy

Troubleshooting

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Document Chunking and Embedding Example

Overview

Prerequisites

Setup

Usage

Ingest Mode

Query Mode

CLI Arguments

Example Queries

How It Works

Document Chunking

Embedding Generation

Storage

Querying

Output

Ingest Mode

Query Mode

Customization

Using a Different Document

Using a Different Embedding Function

Custom Chunking Strategy

Troubleshooting