feat: Update embedding logic to bulk#1037
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.
Changes:
- Introduces a
MarkdownChunkmodel and updates chunking/output/tests to useHeading+ChunkText. - Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
- Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| EssentialCSharp.Chat/Program.cs | Updates chunk stats/output to use MarkdownChunk fields. |
| EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs | Updates assertions for the new chunk model (ChunkText). |
| EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs | Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances. |
| EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs | Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly. |
| EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs | Implements batch embedding + staging-then-swap upload strategy using Npgsql. |
| EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs | Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex. |
| EssentialCSharp.Chat.Shared/Services/AISearchService.cs | Adds heading-based deduplication of vector search results. |
| EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs | Adds ChunkIndex as stored metadata for chunks. |
BenjaminMichaelis
left a comment
There was a problem hiding this comment.
Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:
Must fix:
candidates_list(AISearchService.cs) violates C# camelCase convention — should becandidatesList.NpgsqlDataSource?is nullable/optional in the constructor butGenerateBookContentEmbeddingsAndUploadToVectorStorethrows immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.- Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.
Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.
Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.
- Rename candidates_list → candidatesList (C# camelCase convention) - Make NpgsqlDataSource required in EmbeddingService constructor (always registered in DI; optional+throw was misleading anti-pattern) - Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call to respect Azure OpenAI input limit - Validate collectionName against safe identifier regex before SQL use - Add best-effort staging cleanup on UpsertAsync failure (nested try so cleanup exception cannot mask the original) - Document ChapterNumber nullability on BookContentChunk property and ToBookContentChunks public method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ment
- Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is
not OperationCanceledException)` + Console.Error.WriteLine so cleanup
failures are visible without masking the original exception
- Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live)
plus DROP TABLE statements, not "three SQL RENAMEs"
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
078bc93 to
d25e8b8
Compare
EmbeddingService constructor now requires NpgsqlDataSource (no longer optional). Tests that construct EmbeddingService for AISearchService scenarios must supply a mock, even though upload functionality is not exercised.
… isolation NpgsqlDataSource is an abstract class with no default constructor, so Moq/Castle.DynamicProxy cannot create a proxy for it. Since the upload path (staging-swap) is the only code that needs the data source, and AISearchServiceTests only exercise the search path, making the parameter optional (nullable) with a lazy null check fixes test isolation without weakening production safety.
- EmbeddingService: embed and assign in a single batch pass instead of collecting all embeddings then assigning; avoids holding texts list and all embeddings in memory simultaneously for large uploads - Program.cs WriteChunkingResult: guard against empty Chunks list before calling Average/Max/Min (throws InvalidOperationException on empty) - AISearchServiceTests: add dedup-by-heading test verifying that only the highest-scoring chunk per heading is kept and results are ordered by descending score
|
@copilot resolve the merge conflicts in this pull request |
Addressed in
Merge conflicts are resolved in Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
- appsettings.json: remove AllowedMcpTools (dead config — MCP client is commented out; wire up with AIOptions property when MCP is activated) - EmbeddingService: merge embed+upsert into a single batch loop so peak memory is bounded to one batch instead of the full dataset; update doc comments to reflect 4-step flow (was 5) - Program.cs WriteChunkingResult: remove redundant [Heading] line since ChunkText already starts with the heading prefix from TextChunker
9cc6ada to
74ed475
Compare
Keep detailed SystemPrompt from PR branch; drop AllowedMcpTools (dead config - MCP client is commented out, removed in prior commit).
…mment - appsettings.json: revert SystemPrompt to main's value (prompt change is out of scope for this PR); restore AllowedMcpTools list which IS consumed by AIChatService — removing it silently denied all MCP tools - EmbeddingService: correct the batch-loop comment to accurately state that all chunks are materialized upfront via ToList(), while only embedding vectors are bounded per-batch
…ndex schema migration - EmbeddingService: replace bookContents.ToList() + Chunk() with a streaming buffer pattern (List<BookContentChunk>(EmbeddingBatchSize)). Each batch is filled from the IEnumerable, embedded, and upserted before moving on, so only one batch of chunks + vectors lives in memory at a time. Track total count with a running int. - BookContentChunk: add <remarks> to ChunkIndex documenting that it is a new column requiring a collection rebuild on existing deployments.
Description
Describe your changes here.
Fixes #Issue_Number (if available)
Ensure that your pull request has followed all the steps below: