Skip to content

feat: Update embedding logic to bulk#1037

Merged
BenjaminMichaelis merged 11 commits into
mainfrom
bmichaelis/EmbeddingUpdates
May 10, 2026
Merged

feat: Update embedding logic to bulk#1037
BenjaminMichaelis merged 11 commits into
mainfrom
bmichaelis/EmbeddingUpdates

Conversation

@BenjaminMichaelis
Copy link
Copy Markdown
Member

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

  • Code compilation
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

Copilot AI review requested due to automatic review settings April 26, 2026 15:21
Comment thread EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

  • Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
  • Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
  • Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
EssentialCSharp.Chat/Program.cs Updates chunk stats/output to use MarkdownChunk fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs Updates assertions for the new chunk model (ChunkText).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs Adds ChunkIndex as stored metadata for chunks.

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
Comment thread EssentialCSharp.Chat.Shared/Services/AISearchService.cs Outdated
Copy link
Copy Markdown
Member Author

@BenjaminMichaelis BenjaminMichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:

  1. candidates_list (AISearchService.cs) violates C# camelCase convention — should be candidatesList.
  2. NpgsqlDataSource? is nullable/optional in the constructor but GenerateBookContentEmbeddingsAndUploadToVectorStore throws immediately if null. This is a DI anti-pattern — either require it in the constructor or split into two classes. As-is, the service compiles and resolves but explodes only when the method is called.
  3. Azure OpenAI batch limit (~2048 inputs) is mentioned in the summary comment but not enforced. A large book could silently exceed this and fail at runtime — consider batching internally.

Should fix:
4. SQL RENAME statements use raw string interpolation with collectionName, which is caller-controlled. Even though it currently comes from a constant, consider asserting/validating the name only contains safe characters (e.g., alphanumeric + underscore) to prevent accidental SQL issues.
5. If staging.UpsertAsync(chunkList, cancellationToken) throws, the staging table is left behind. Consider wrapping in try/catch and deleting staging on failure.

Nit:
6. ExtractChapterNumber silently returning null for non-chapter files is a meaningful behavioral change from the previous InvalidOperationException — worth a code comment noting this is intentional and that callers handle null.

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Fixed
BenjaminMichaelis and others added 3 commits May 6, 2026 20:52
- Rename candidates_list → candidatesList (C# camelCase convention)
- Make NpgsqlDataSource required in EmbeddingService constructor (always
  registered in DI; optional+throw was misleading anti-pattern)
- Add EmbeddingBatchSize = 2048 constant and batch the GenerateAsync call
  to respect Azure OpenAI input limit
- Validate collectionName against safe identifier regex before SQL use
- Add best-effort staging cleanup on UpsertAsync failure (nested try so
  cleanup exception cannot mask the original)
- Document ChapterNumber nullability on BookContentChunk property and
  ToBookContentChunks public method

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ment

- Replace `catch { }` with `catch (Exception cleanupEx) when (cleanupEx is
  not OperationCanceledException)` + Console.Error.WriteLine so cleanup
  failures are visible without masking the original exception
- Correct method summary: swap uses two SQL RENAMEs (live→old, staging→live)
  plus DROP TABLE statements, not "three SQL RENAMEs"

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BenjaminMichaelis BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 078bc93 to d25e8b8 Compare May 7, 2026 03:53
EmbeddingService constructor now requires NpgsqlDataSource (no longer
optional). Tests that construct EmbeddingService for AISearchService
scenarios must supply a mock, even though upload functionality is not
exercised.
… isolation

NpgsqlDataSource is an abstract class with no default constructor, so
Moq/Castle.DynamicProxy cannot create a proxy for it. Since the upload
path (staging-swap) is the only code that needs the data source, and
AISearchServiceTests only exercise the search path, making the parameter
optional (nullable) with a lazy null check fixes test isolation without
weakening production safety.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread EssentialCSharp.Chat/Program.cs
Comment thread EssentialCSharp.Chat.Shared/Services/AISearchService.cs
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
- EmbeddingService: embed and assign in a single batch pass instead of
  collecting all embeddings then assigning; avoids holding texts list
  and all embeddings in memory simultaneously for large uploads

- Program.cs WriteChunkingResult: guard against empty Chunks list before
  calling Average/Max/Min (throws InvalidOperationException on empty)

- AISearchServiceTests: add dedup-by-heading test verifying that only
  the highest-scoring chunk per heading is kept and results are ordered
  by descending score
@BenjaminMichaelis BenjaminMichaelis requested a review from Copilot May 9, 2026 05:47
@BenjaminMichaelis
Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread EssentialCSharp.Web/appsettings.json
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
Comment thread EssentialCSharp.Chat/Program.cs
Copy link
Copy Markdown
Contributor

Copilot AI commented May 9, 2026

Good overall approach — bulk embedding, deterministic IDs, and the staging-swap pattern are all solid improvements. A few things worth addressing before merging:

Must fix:...

Addressed in 9cc6ada: constructor now requires NpgsqlDataSource (no nullable runtime failure path), embedding batching is enforced with provider-safe chunking, collection-name safety validation and staging cleanup are in place, and naming/comment consistency updates are included.

@copilot resolve the merge conflicts in this pull request

Merge conflicts are resolved in 9cc6ada by merging origin/main into this branch and fixing the remaining conflict in EssentialCSharp.Web/appsettings.json.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • api.hcaptcha.com
    • Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/aab9542ad3c74703ad2bcad1fd33ebe5 (dns block)
  • api.pwnedpasswords.com
    • Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/075ee2601b764926b7db9d7940c29f1d (dns block)
    • Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --report-trx --server dotnettestcli --dotnet-test-pipe /tmp/09d4a978b071488eb63d28840a400a98 (dns block)
    • Triggering command: /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests /home/REDACTED/work/EssentialCSharp.Web/EssentialCSharp.Web/EssentialCSharp.Web.Tests/bin/Release/net10.0/EssentialCSharp.Web.Tests --server dotnettestcli --dotnet-test-pipe /tmp/aab9542ad3c74703ad2bcad1fd33ebe5 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

- appsettings.json: remove AllowedMcpTools (dead config — MCP client
  is commented out; wire up with AIOptions property when MCP is activated)

- EmbeddingService: merge embed+upsert into a single batch loop so peak
  memory is bounded to one batch instead of the full dataset; update
  doc comments to reflect 4-step flow (was 5)

- Program.cs WriteChunkingResult: remove redundant [Heading] line since
  ChunkText already starts with the heading prefix from TextChunker
@BenjaminMichaelis BenjaminMichaelis force-pushed the bmichaelis/EmbeddingUpdates branch from 9cc6ada to 74ed475 Compare May 9, 2026 06:02
Keep detailed SystemPrompt from PR branch; drop AllowedMcpTools (dead
config - MCP client is commented out, removed in prior commit).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread EssentialCSharp.Web/appsettings.json Outdated
Comment thread EssentialCSharp.Web/appsettings.json Outdated
Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
…mment

- appsettings.json: revert SystemPrompt to main's value (prompt change
  is out of scope for this PR); restore AllowedMcpTools list which IS
  consumed by AIChatService — removing it silently denied all MCP tools

- EmbeddingService: correct the batch-loop comment to accurately state
  that all chunks are materialized upfront via ToList(), while only
  embedding vectors are bounded per-batch
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Outdated
Comment thread EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs
…ndex schema migration

- EmbeddingService: replace bookContents.ToList() + Chunk() with a
  streaming buffer pattern (List<BookContentChunk>(EmbeddingBatchSize)).
  Each batch is filled from the IEnumerable, embedded, and upserted
  before moving on, so only one batch of chunks + vectors lives in
  memory at a time. Track total count with a running int.

- BookContentChunk: add <remarks> to ChunkIndex documenting that it is
  a new column requiring a collection rebuild on existing deployments.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs
@BenjaminMichaelis BenjaminMichaelis merged commit bf6fab5 into main May 10, 2026
11 checks passed
@BenjaminMichaelis BenjaminMichaelis deleted the bmichaelis/EmbeddingUpdates branch May 10, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants