You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai-integration/generating-embeddings/content/_embeddings-generation-task-csharp.mdx
+68-36Lines changed: 68 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,16 +74,27 @@ import CodeBlock from '@theme/CodeBlock';
74
74
1.**Collection**
75
75
Enter or select the source document collection from the dropdown.
76
76
2.**Embeddings source**
77
-
Select `Paths` to define the source content by specifying document properties.
78
-
3.**Source text path**
79
-
Enter the property name from the document that contains the text for embedding generation.
80
-
4.**Chunking method**
81
-
Select the method for splitting the source text into chunks.
82
-
Learn more in [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens).
83
-
5.**Max tokens per chunk**
84
-
Enter the maximum number of tokens allowed per chunk (this depends on the service provider).
85
-
6.**Add path configuration**
77
+
Select `Paths` to define the source content by document properties.
78
+
3.**Path configuration**
79
+
Specify which document properties to extract text from, and how the text should be chunked into embeddings.
80
+
81
+
***Source text path**
82
+
Enter the property name from the document that contains the text for embedding generation.
83
+
***Chunking method**
84
+
Select the method for splitting the source text into chunks.
85
+
Learn more in [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens).
86
+
***Max tokens per chunk**
87
+
Enter the maximum number of tokens allowed per chunk (this depends on the service provider).
88
+
***Overlap tokens**
89
+
Enter the number of tokens to repeat at the start of each chunk from the end of the previous one.
90
+
This helps preserve context between chunks by carrying over some tokens from one to the next.
91
+
Applies only to the _"Plain Text: Split Paragraphs"_ and _"Markdown: Split Paragraphs"_ chunking methods.
92
+
93
+
4.**Add path configuration**
86
94
Click to add the specified to the list.
95
+
5.**List of paths**
96
+
Displays the document properties you added for embedding generation.
97
+
87
98
***Define the embeddings source - using SCRIPT**:
88
99
89
100

@@ -92,11 +103,17 @@ import CodeBlock from '@theme/CodeBlock';
92
103
Select `Script` to define the source content and chunking methods using a JavaScript script.
93
104
2.**Script**
94
105
Refer to section [Chunking methods and tokens](../../../ai-integration/generating-embeddings/embeddings-generation-task.mdx#chunking-methods-and-tokens) for available JavaScript methods.
95
-
3.**Chunking method**
106
+
3.**Default chunking method**
96
107
The selected chunking method will be used by default when no method is specified in the script.
97
108
e.g., when the script contains: `Name: this.Name`.
98
-
4.**Max tokens per chunk**:
99
-
Enter the default value to use when no specific value is set for the chunking method in the script.
109
+
4.**Default max tokens per chunk**:
110
+
Enter the default value to use when no specific value is set for the chunking method in the script.
111
+
This is the maximum number of tokens allowed per chunk (depends on the service provider).
112
+
5.**Default overlap tokens**
113
+
Enter the default value to use when no specific value is set for the chunking method in the script.
114
+
This is the number of tokens to repeat at the start of each chunk from the end of the previous one.
115
+
Applies only to the _"Plain Text: Split Paragraphs"_ and _"Markdown: Split Paragraphs"_ chunking methods.
116
+
100
117
***Define quantization and expiration -
101
118
for the generated embeddings from the source documents**:
102
119
@@ -191,8 +208,12 @@ var embeddingsTaskConfiguration = new EmbeddingsGenerationConfiguration
@@ -280,8 +302,7 @@ EmbeddingsTransformation = new EmbeddingsTransformation()
280
302
Description: this.Description
281
303
});",
282
304
283
-
// Specify the default chunking method and max tokens per chunk
284
-
// to use in the script
305
+
// Specify the default chunking options to use in the script
285
306
ChunkingOptions=newChunkingOptions()
286
307
{
287
308
ChunkingMethod=ChunkingMethod.PlainTextSplit,
@@ -340,7 +361,9 @@ These methods determine how input text is split before being sent to the provide
340
361
341
362
*`PlainText: Split Paragraphs`
342
363
Uses the Semantic Kernel _SplitPlainTextParagraphs_ method.
343
-
Combines consecutive lines to form paragraphs while ensuring each paragraph is as complete as possible without exceeding the specified token limit.
364
+
Combines consecutive lines to form paragraphs while ensuring each paragraph is as complete as possible without exceeding the specified token limit.
365
+
Optionally, set an overlap between chunks using the _overlapTokens_ parameter, which repeats the last _n_ tokens from one chunk at the start of the next.
366
+
This helps preserve context continuity across paragraph boundaries.
344
367
345
368
**Applies to**:
346
369
Fields containing an array of plain text strings.
@@ -360,7 +383,10 @@ These methods determine how input text is split before being sent to the provide
360
383
*`Markdown: Split Paragraphs`
361
384
Uses the Semantic Kernel _SplitMarkdownParagraphs_ method.
362
385
Groups lines into coherent paragraphs at designated paragraph breaks while ensuring each paragraph remains within the specified token limit.
363
-
Preserves markdown formatting to maintain structure.
386
+
Markdown formatting is preserved.
387
+
Optionally, set an overlap between chunks using the _overlapTokens_ parameter, which repeats the last _n_ tokens from one chunk at the start of the next.
388
+
This helps preserve context continuity across paragraph boundaries.
389
+
364
390
365
391
**Applies to**:
366
392
Fields containing an array of strings with markdown content.
@@ -384,25 +410,27 @@ These methods determine how input text is split before being sent to the provide
| **text** | `string` | A plain text or markdown string to split. |
429
+
| **line** | `string` | A single line or paragraph of text. |
430
+
| **[text] / [line]** | `string[]`| An array of text or lines to split into chunks. |
431
+
| **htmlText** | `string` | A string containing HTML content to process. |
432
+
| **maxTokensPerChunk / maxTokensPerLine** | `number` | The maximum number of tokens allowed per chunk.<br/>Default is `512`. |
433
+
| **overlapTokens** | `number` (optional) | The number of tokens to overlap between consecutive chunks. Helps preserve context continuity across chunks (e.g., between paragraphs).<br/>Default is `0`. |
406
434
407
435
## Syntax
408
436
@@ -451,6 +479,10 @@ public class ChunkingOptions
451
479
{
452
480
public ChunkingMethod ChunkingMethod { get; set; } // Default is PlainTextSplit
453
481
public int MaxTokensPerChunk { get; set; } =512;
482
+
483
+
// 'OverlapTokens' is only applicable when ChunkingMethod is
484
+
// 'PlainTextSplitParagraphs' or 'MarkDownSplitParagraphs'
0 commit comments