Skip to content

Commit 7918e52

Browse files
committed
RDoc-3507 Expose OverlapTokens in ChunkingOptions
1 parent 508c9ae commit 7918e52

File tree

3 files changed

+68
-36
lines changed

3 files changed

+68
-36
lines changed
32 KB
Loading
15.2 KB
Loading

docs/ai-integration/generating-embeddings/content/_embeddings-generation-task-csharp.mdx

Lines changed: 68 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -74,16 +74,27 @@ import CodeBlock from '@theme/CodeBlock';
7474
1. **Collection**
7575
Enter or select the source document collection from the dropdown.
7676
2. **Embeddings source**
77-
Select `Paths` to define the source content by specifying document properties.
78-
3. **Source text path**
79-
Enter the property name from the document that contains the text for embedding generation.
80-
4. **Chunking method**
81-
Select the method for splitting the source text into chunks.
82-
Learn more in [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens).
83-
5. **Max tokens per chunk**
84-
Enter the maximum number of tokens allowed per chunk (this depends on the service provider).
85-
6. **Add path configuration**
77+
Select `Paths` to define the source content by document properties.
78+
3. **Path configuration**
79+
Specify which document properties to extract text from, and how the text should be chunked into embeddings.
80+
81+
* **Source text path**
82+
Enter the property name from the document that contains the text for embedding generation.
83+
* **Chunking method**
84+
Select the method for splitting the source text into chunks.
85+
Learn more in [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens).
86+
* **Max tokens per chunk**
87+
Enter the maximum number of tokens allowed per chunk (this depends on the service provider).
88+
* **Overlap tokens**
89+
Enter the number of tokens to repeat at the start of each chunk from the end of the previous one.
90+
This helps preserve context between chunks by carrying over some tokens from one to the next.
91+
Applies only to the _"Plain Text: Split Paragraphs"_ and _"Markdown: Split Paragraphs"_ chunking methods.
92+
93+
4. **Add path configuration**
8694
Click to add the specified to the list.
95+
5. **List of paths**
96+
Displays the document properties you added for embedding generation.
97+
8798
* **Define the embeddings source - using SCRIPT**:
8899

89100
![Create embeddings generation task - source by script](../assets/add-ai-task-4-script.png)
@@ -92,11 +103,17 @@ import CodeBlock from '@theme/CodeBlock';
92103
Select `Script` to define the source content and chunking methods using a JavaScript script.
93104
2. **Script**
94105
Refer to section [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens) for available JavaScript methods.
95-
3. **Chunking method**
106+
3. **Default chunking method**
96107
The selected chunking method will be used by default when no method is specified in the script.
97108
e.g., when the script contains: `Name: this.Name`.
98-
4. **Max tokens per chunk**:
99-
Enter the default value to use when no specific value is set for the chunking method in the script.
109+
4. **Default max tokens per chunk**:
110+
Enter the default value to use when no specific value is set for the chunking method in the script.
111+
This is the maximum number of tokens allowed per chunk (depends on the service provider).
112+
5. **Default overlap tokens**
113+
Enter the default value to use when no specific value is set for the chunking method in the script.
114+
This is the number of tokens to repeat at the start of each chunk from the end of the previous one.
115+
Applies only to the _"Plain Text: Split Paragraphs"_ and _"Markdown: Split Paragraphs"_ chunking methods.
116+
100117
* **Define quantization and expiration -
101118
for the generated embeddings from the source documents**:
102119

@@ -191,8 +208,12 @@ var embeddingsTaskConfiguration = new EmbeddingsGenerationConfiguration
191208
Path = "Description",
192209
ChunkingOptions = new()
193210
{
194-
ChunkingMethod = ChunkingMethod.PlainTextSplitLines,
195-
MaxTokensPerChunk = 2048
211+
ChunkingMethod = ChunkingMethod.PlainTextSplitParagraphs,
212+
MaxTokensPerChunk = 2048,
213+
214+
// 'OverlapTokens' is only applicable when ChunkingMethod is
215+
// 'PlainTextSplitParagraphs' or 'MarkDownSplitParagraphs'
216+
OverlapTokens = 128
196217
}
197218
},
198219
],
@@ -213,8 +234,8 @@ var embeddingsTaskConfiguration = new EmbeddingsGenerationConfiguration
213234
EmbeddingsCacheForQueryingExpiration = TimeSpan.FromDays(14)
214235
};
215236

216-
// Deploy the connection string to the server:
217-
// ===========================================
237+
// Deploy the embeddings generation task to the server:
238+
// ====================================================
218239
var addEmbeddingsGenerationTaskOp =
219240
new AddEmbeddingsGenerationOperation(embeddingsTaskConfiguration);
220241
var addAiIntegrationTaskResult = store.Maintenance.Send(addEmbeddingsGenerationTaskOp);
@@ -256,16 +277,17 @@ EmbeddingsTransformation = new EmbeddingsTransformation()
256277
// The text content will be split into chunks of up to 2048 tokens.
257278
Name: text.split(this.Name, 2048),
258279
259-
// Process the document 'Description' field using method text.splitLines().
280+
// Process the document 'Description' field using method text.splitParagraphs().
260281
// The text content will be split into chunks of up to 2048 tokens.
261-
Description: text.splitLines(this.Description, 2048)
282+
// 128 overlapping tokens will be repeated at the start of each chunk
283+
// from the end of the previous one.
284+
Description: text.splitParagraphs(this.Description, 2048, 128)
262285
});"
263286
},
264287
```
265288
</TabItem>
266289

267-
* If no chunking method is provided in the script,
268-
you can set the default chunking method and the maximum tokens per chunk to be used as follows:
290+
* If no chunking method is provided in the script, you can set default values as follows:
269291

270292
<TabItem value="create_embeddings_task_3" label="create_embeddings_task_3">
271293
```csharp
@@ -280,8 +302,7 @@ EmbeddingsTransformation = new EmbeddingsTransformation()
280302
Description: this.Description
281303
});",
282304

283-
// Specify the default chunking method and max tokens per chunk
284-
// to use in the script
305+
// Specify the default chunking options to use in the script
285306
ChunkingOptions = new ChunkingOptions()
286307
{
287308
ChunkingMethod = ChunkingMethod.PlainTextSplit,
@@ -340,7 +361,9 @@ These methods determine how input text is split before being sent to the provide
340361

341362
* `PlainText: Split Paragraphs`
342363
Uses the Semantic Kernel _SplitPlainTextParagraphs_ method.
343-
Combines consecutive lines to form paragraphs while ensuring each paragraph is as complete as possible without exceeding the specified token limit.
364+
Combines consecutive lines to form paragraphs while ensuring each paragraph is as complete as possible without exceeding the specified token limit.
365+
Optionally, set an overlap between chunks using the _overlapTokens_ parameter, which repeats the last _n_ tokens from one chunk at the start of the next.
366+
This helps preserve context continuity across paragraph boundaries.
344367

345368
**Applies to**:
346369
Fields containing an array of plain text strings.
@@ -360,7 +383,10 @@ These methods determine how input text is split before being sent to the provide
360383
* `Markdown: Split Paragraphs`
361384
Uses the Semantic Kernel _SplitMarkdownParagraphs_ method.
362385
Groups lines into coherent paragraphs at designated paragraph breaks while ensuring each paragraph remains within the specified token limit.
363-
Preserves markdown formatting to maintain structure.
386+
Markdown formatting is preserved.
387+
Optionally, set an overlap between chunks using the _overlapTokens_ parameter, which repeats the last _n_ tokens from one chunk at the start of the next.
388+
This helps preserve context continuity across paragraph boundaries.
389+
364390

365391
**Applies to**:
366392
Fields containing an array of strings with markdown content.
@@ -384,25 +410,27 @@ These methods determine how input text is split before being sent to the provide
384410
// =================================
385411

386412
// Plain text methods:
387-
text.split(text, maxTokensPerChunk);
388-
text.splitLines(text, maxTokensPerChunk);
389-
text.splitParagraphs(lines, maxTokensPerChunk);
413+
text.split(text | [text], maxTokensPerLine);
414+
text.splitLines(text | [text], maxTokensPerLine);
415+
text.splitParagraphs(line | [line], maxTokensPerLine, overlapTokens?);
390416

391417
// Markdown methods:
392-
markdown.splitLines(text, maxTokensPerChunk);
393-
markdown.splitParagraphs(lines, maxTokensPerChunk);
418+
markdown.splitLines(text | [text], maxTokensPerLine);
419+
markdown.splitParagraphs(line | [line], maxTokensPerLine, overlapTokens?);
394420

395421
// HTML processing:
396-
html.strip(htmlText, maxTokensPerChunk);
422+
html.strip(htmlText | [htmlText], maxTokensPerChunk);
397423
```
398424
</TabItem>
399425
400-
| Parameter | Type | Description |
401-
|-----------------------|------------|----------------------------------------------|
402-
| **text** | `string` | A plain text or markdown string to split. |
403-
| **lines** | `string[]` | An array of text lines to split into chunks. |
404-
| **htmlText** | `string` | A string containing HTML content to process. |
405-
| **maxTokensPerChunk** | `number` | The maximum tokens allowed per chunk. |
426+
| Parameter | Type | Description |
427+
|------------------------------------------|-----------|------------------------------------------------------------------ |
428+
| **text** | `string` | A plain text or markdown string to split. |
429+
| **line** | `string` | A single line or paragraph of text. |
430+
| **[text] / [line]** | `string[]`| An array of text or lines to split into chunks. |
431+
| **htmlText** | `string` | A string containing HTML content to process. |
432+
| **maxTokensPerChunk / maxTokensPerLine** | `number` | The maximum number of tokens allowed per chunk.<br/>Default is `512`. |
433+
| **overlapTokens** | `number` (optional) | The number of tokens to overlap between consecutive chunks. Helps preserve context continuity across chunks (e.g., between paragraphs).<br/>Default is `0`. |
406434
407435
## Syntax
408436
@@ -451,6 +479,10 @@ public class ChunkingOptions
451479
{
452480
public ChunkingMethod ChunkingMethod { get; set; } // Default is PlainTextSplit
453481
public int MaxTokensPerChunk { get; set; } = 512;
482+
483+
// 'OverlapTokens' is only applicable when ChunkingMethod is
484+
// 'PlainTextSplitParagraphs' or 'MarkDownSplitParagraphs'
485+
public int OverlapTokens { get; set; } = 0;
454486
}
455487

456488
public enum ChunkingMethod

0 commit comments

Comments
 (0)