feat: Update embedding logic to bulk by BenjaminMichaelis · Pull Request #1037 · IntelliTect/EssentialCSharp.Web

BenjaminMichaelis · 2026-04-26T15:21:37Z

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

Code compilation
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

Copilot

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
EssentialCSharp.Chat/Program.cs	Updates chunk stats/output to use `MarkdownChunk` fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs	Updates assertions for the new chunk model (`ChunkText`).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs	Preserves blank lines for paragraph-aware splitting and emits `MarkdownChunk` instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs	Adds `MarkdownChunk` record and changes `FileChunkingResult.Chunks` type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs	Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs	Updates conversion to `BookContentChunk` using `MarkdownChunk` and adds deterministic IDs + `ChunkIndex`.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs	Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs	Adds `ChunkIndex` as stored metadata for chunks.

BenjaminMichaelis

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:

candidates_list (AISearchService.cs) violates C# camelCase convention — should be candidatesList.
NpgsqlDataSource? is nullable/optional in the constructor but GenerateBookContentEmbeddingsAndUploadToVectorStore throws immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.
Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.

Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.

Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.

- Rename candidates_list → candidatesList (C# camelCase convention) - Make NpgsqlDataSource required in EmbeddingService constructor (always registered in DI; optional+throw was misleading anti-pattern) - Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call to respect Azure OpenAI input limit - Validate collectionName against safe identifier regex before SQL use - Add best-effort staging cleanup on UpsertAsync failure (nested try so cleanup exception cannot mask the original) - Document ChapterNumber nullability on BookContentChunk property and ToBookContentChunks public method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ment - Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is not OperationCanceledException)` + Console.Error.WriteLine so cleanup failures are visible without masking the original exception - Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live) plus DROP TABLE statements, not "three SQL RENAMEs" Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

EmbeddingService constructor now requires NpgsqlDataSource (no longer optional). Tests that construct EmbeddingService for AISearchService scenarios must supply a mock, even though upload functionality is not exercised.

… isolation NpgsqlDataSource is an abstract class with no default constructor, so Moq/Castle.DynamicProxy cannot create a proxy for it. Since the upload path (staging-swap) is the only code that needs the data source, and AISearchServiceTests only exercise the search path, making the parameter optional (nullable) with a lazy null check fixes test isolation without weakening production safety.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

- EmbeddingService: embed and assign in a single batch pass instead of collecting all embeddings then assigning; avoids holding texts list and all embeddings in memory simultaneously for large uploads - Program.cs WriteChunkingResult: guard against empty Chunks list before calling Average/Max/Min (throws InvalidOperationException on empty) - AISearchServiceTests: add dedup-by-heading test verifying that only the highest-scoring chunk per heading is kept and results are ordered by descending score

BenjaminMichaelis · 2026-05-09T05:47:39Z

@copilot resolve the merge conflicts in this pull request

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Copilot · 2026-05-09T05:56:34Z

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:...

Addressed in 9cc6ada: constructor now requires NpgsqlDataSource (no nullable runtime failure path), embedding batching is enforced with provider-safe chunking, collection-name safety validation and staging cleanup are in place, and naming/comment consistency updates are included.

@copilot resolve the merge conflicts in this pull request

Merge conflicts are resolved in 9cc6ada by merging origin/main into this branch and fixing the remaining conflict in EssentialCSharp.Web/appsettings.json.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

api.hcaptcha.com
- Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/aab9542ad3c74703ad2bcad1fd33ebe5 (dns block)
api.pwnedpasswords.com
- Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/075ee2601b764926b7db9d7940c29f1d (dns block)
- Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --report-trx --server dotnettestcli --dotnet-test-pipe /tmp/09d4a978b071488eb63d28840a400a98 (dns block)
- Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/aab9542ad3c74703ad2bcad1fd33ebe5 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

- appsettings.json: remove AllowedMcpTools (dead config — MCP client is commented out; wire up with AIOptions property when MCP is activated) - EmbeddingService: merge embed+upsert into a single batch loop so peak memory is bounded to one batch instead of the full dataset; update doc comments to reflect 4-step flow (was 5) - Program.cs WriteChunkingResult: remove redundant [Heading] line since ChunkText already starts with the heading prefix from TextChunker

Keep detailed SystemPrompt from PR branch; drop AllowedMcpTools (dead config - MCP client is commented out, removed in prior commit).

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

…mment - appsettings.json: revert SystemPrompt to main's value (prompt change is out of scope for this PR); restore AllowedMcpTools list which IS consumed by AIChatService — removing it silently denied all MCP tools - EmbeddingService: correct the batch-loop comment to accurately state that all chunks are materialized upfront via ToList(), while only embedding vectors are bounded per-batch

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

…ndex schema migration - EmbeddingService: replace bookContents.ToList() + Chunk() with a streaming buffer pattern (List<BookContentChunk>(EmbeddingBatchSize)). Each batch is filled from the IEnumerable, embedded, and upserted before moving on, so only one batch of chunks + vectors lives in memory at a time. Track total count with a running int. - BookContentChunk: add <remarks> to ChunkIndex documenting that it is a new column requiring a collection rebuild on existing deployments.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings April 26, 2026 15:21

Copilot started reviewing on behalf of BenjaminMichaelis April 26, 2026 15:21 View session

github-code-quality Bot found potential problems Apr 26, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs Fixed

Copilot AI reviewed Apr 26, 2026

View reviewed changes

BenjaminMichaelis commented May 6, 2026

View reviewed changes

github-code-quality Bot found potential problems May 7, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed

BenjaminMichaelis and others added 3 commits May 6, 2026 20:52

Update embedding logic

76d6d17

BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 078bc93 to d25e8b8 Compare May 7, 2026 03:53

BenjaminMichaelis added 2 commits May 8, 2026 21:20

BenjaminMichaelis requested a review from Copilot May 9, 2026 05:05

Copilot started reviewing on behalf of BenjaminMichaelis May 9, 2026 05:05 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat/Program.cs

Comment thread EssentialCSharp.Chat.Shared/Services/AISearchService.cs

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated

BenjaminMichaelis added 2 commits May 8, 2026 22:21

refactor: use Select to separate Trim transform from stateful loop body

4fe45bf

BenjaminMichaelis requested a review from Copilot May 9, 2026 05:47

Copilot started work on behalf of BenjaminMichaelis May 9, 2026 05:47 View session

Copilot started reviewing on behalf of BenjaminMichaelis May 9, 2026 05:48 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread EssentialCSharp.Web/appsettings.json

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated

Comment thread EssentialCSharp.Chat/Program.cs

Copilot finished work on behalf of BenjaminMichaelis May 9, 2026 05:57

BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 9cc6ada to 74ed475 Compare May 9, 2026 06:02

merge: resolve conflict in appsettings.json

9d803b1

Keep detailed SystemPrompt from PR branch; drop AllowedMcpTools (dead config - MCP client is commented out, removed in prior commit).

BenjaminMichaelis requested a review from Copilot May 9, 2026 06:51

Copilot started reviewing on behalf of BenjaminMichaelis May 9, 2026 06:52 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread EssentialCSharp.Web/appsettings.json Outdated

Comment thread EssentialCSharp.Web/appsettings.json Outdated

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated

BenjaminMichaelis requested a review from Copilot May 9, 2026 17:10

Copilot started reviewing on behalf of BenjaminMichaelis May 9, 2026 17:11 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated

Comment thread EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs

BenjaminMichaelis requested a review from Copilot May 9, 2026 23:16

Copilot started reviewing on behalf of BenjaminMichaelis May 9, 2026 23:16 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs

BenjaminMichaelis merged commit bf6fab5 into main May 10, 2026
11 checks passed

BenjaminMichaelis deleted the bmichaelis/EmbeddingUpdates branch May 10, 2026 04:00

Conversation

BenjaminMichaelis commented Apr 26, 2026

Description

Ensure that your pull request has followed all the steps below:

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BenjaminMichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BenjaminMichaelis commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 9, 2026 •

edited

Loading