From 13595da9f3c682949352a7404fbf8c9a10fd79ae Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Fri, 11 Oct 2024 15:17:32 +0900 Subject: [PATCH 1/5] in_tail: Add a description and note for Unicode.Encoding parameter Signed-off-by: Hiroshi Hatake --- pipeline/inputs/tail.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index 9d56320b7..a9f348309 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -37,6 +37,7 @@ The plugin supports the following configuration parameters: | `static_batch_size` | Set the maximum number of bytes to process per iteration for the monitored static files (files that already exist upon Fluent Bit start). | `50M` | | `file_cache_advise` | Set the `posix_fadvise` in `POSIX_FADV_DONTNEED` mode. This reduces the usage of the kernel file cache. This option is ignored if not running on Linux. | `on` | | `threaded` | Indicates whether to run this input in its own [thread](../../administration/multithreading.md#inputs). | `false` | +| `Unicode.Encoding` | Set the Unicode character encoding of the file data. This parameter requests two-byte aligned chunk and buffer sizes. If data is not aligned for two bytes, Fluent Bit will use two-byte alignment automatically to avoid character breakages on consuming boundaries. Supported values: `UTF-16LE`, `UTF-16BE`, and `auto`. | `none` | ## Buffers and memory management @@ -77,6 +78,17 @@ If no database file is present, positioning behavior depends on the value of `re The database file essentially stores `inode=offset` so it should be unique per instance of the plugin, for example if you have two tail inputs then use two separate `db` files for each. That way each tail input can independently track its own state. +{% hint style="info" %} +Note that `Unicode.Encoding` depends on simdutf library which is written in C++11 or above. +So, the older platforms are not supported for this feature. +In addition, `Unicode.Encoding auto` is not covered for the all of the usages. +This is because sometimes this auto-detecting for character encodings makes a mistake to guess the correct encoding. + +We recommend to use `UTF-16LE` or `UTF-16BE` if the target file encoding is pre-determined or known beforehand. +In details, this parameter requests to use 2-bytes aligned chunk and buffer sizes. +If they are not aligned for 2 bytes, Fluent Bit will use 2-bytes alignments automatically to avoid character breakages on consuming boundaries. +{% endhint %} + ## Monitor a large number of files To monitor a large number of files, you can increase the `inotify` settings in your Linux environment by modifying the following `sysctl` parameters: From cbfe1e919f0b6def4a82940c7bb2585f6be81b90 Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Tue, 8 Jul 2025 12:06:21 +0900 Subject: [PATCH 2/5] Update pipeline/inputs/tail.md Co-authored-by: Alexa Kreizinger Signed-off-by: Hiroshi Hatake --- pipeline/inputs/tail.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index a9f348309..3a39949b6 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -79,14 +79,9 @@ If no database file is present, positioning behavior depends on the value of `re The database file essentially stores `inode=offset` so it should be unique per instance of the plugin, for example if you have two tail inputs then use two separate `db` files for each. That way each tail input can independently track its own state. {% hint style="info" %} -Note that `Unicode.Encoding` depends on simdutf library which is written in C++11 or above. -So, the older platforms are not supported for this feature. -In addition, `Unicode.Encoding auto` is not covered for the all of the usages. -This is because sometimes this auto-detecting for character encodings makes a mistake to guess the correct encoding. - -We recommend to use `UTF-16LE` or `UTF-16BE` if the target file encoding is pre-determined or known beforehand. -In details, this parameter requests to use 2-bytes aligned chunk and buffer sizes. -If they are not aligned for 2 bytes, Fluent Bit will use 2-bytes alignments automatically to avoid character breakages on consuming boundaries. +The `Unicode.Encoding` parameter is dependent on the simdutf library, which is itself dependent on C++ version 11 or later. In environments that use earlier versions of C++, the `Unicode.Encoding` parameter will fail. + +Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file. {% endhint %} ## Monitor a large number of files From e54556fba6a0af5ab39f4068f5156d3d99c7efbe Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Wed, 22 Oct 2025 16:57:18 +0900 Subject: [PATCH 3/5] in_tail: Add generic.encoding parameter descriptions Also I added the reason why we need to support these parameters and how to use them. Signed-off-by: Hiroshi Hatake --- pipeline/inputs/tail.md | 91 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index 3a39949b6..5b0b5d3bb 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -38,6 +38,7 @@ The plugin supports the following configuration parameters: | `file_cache_advise` | Set the `posix_fadvise` in `POSIX_FADV_DONTNEED` mode. This reduces the usage of the kernel file cache. This option is ignored if not running on Linux. | `on` | | `threaded` | Indicates whether to run this input in its own [thread](../../administration/multithreading.md#inputs). | `false` | | `Unicode.Encoding` | Set the Unicode character encoding of the file data. This parameter requests two-byte aligned chunk and buffer sizes. If data is not aligned for two bytes, Fluent Bit will use two-byte alignment automatically to avoid character breakages on consuming boundaries. Supported values: `UTF-16LE`, `UTF-16BE`, and `auto`. | `none` | +| `Generic.Encoding` | Set the non-Unicode encoding of the file data. Supported values: `ShiftJIS`, `UHC`, `GBK`, `GB18030`, `Big5`, `Win866`, `Win874`, `Win1250`, `Win1251`, `Win1252`, `Win2513`, `Win1254`, `Win1255`, and `Win1256`. | `none` | ## Buffers and memory management @@ -84,6 +85,13 @@ The `Unicode.Encoding` parameter is dependent on the simdutf library, which is i Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file. {% endhint %} +{% hint style="info" %} +The `Unicode.Encoding` parameter is dependent on the simdutf library, which is itself dependent on C++ version 11 or later. In environments that use earlier versions of C++, the `Unicode.Encoding` parameter will fail. + +Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file. +{% endhint %} + + ## Monitor a large number of files To monitor a large number of files, you can increase the `inotify` settings in your Linux environment by modifying the following `sysctl` parameters: @@ -464,3 +472,86 @@ While file rotation is handled, there are risks of potential log loss when using - Final note: the `Path` patterns can't match the rotated files. Otherwise, the rotated file would be read again and lead to duplicate records. {% endhint %} + +## Character Encoding Conversion + +This feature allows Fluent Bit to convert logs from various character encodings into the standard UTF-8 format. +This is crucial for processing logs from systems, especially Windows, that use legacy or non-UTF-8 encodings. +Proper conversion ensures that your log data is correctly parsed, indexed, and searchable. + +### When to Use This Feature + +You should use this feature if your log files or messages are not in UTF-8 and you are seeing garbled or incorrectly rendered characters. +This is common in environments that use: + +* Modern Windows applications that log in UTF-16. + +* Legacy Windows systems with applications that use traditional code pages (e.g., ShiftJIS, GBK, Win1252). + +### Configuration Parameters + +To enable encoding conversion, you will use one of the following two parameters within an input plugin configuration. + +1. `Unicode.Encoding` + +Use this parameter for high-performance conversion of UTF-16 encoded logs to UTF-8. This method utilizes modern processor features (SIMD instructions) to accelerate the conversion process, making it highly efficient. + +* Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16. +* Supported Values: + * UTF-16LE (Little-Endian) + * UTF-16BE (Big-Endian) + +2. `Generic.Encoding` + +Use this parameter to convert from a wide variety of other character encodings, particularly legacy Windows code pages. + +* Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe. +* Supported Values: You can use any of the names or aliases listed below. + +### East Asian Encodings +* `ShiftJIS` (Aliases: `SJIS`, `CP932`, `Windows-31J`) +* `GB18030` +* `GBK`: (Alias: `CP936`) +* `UHC` (Unified Hangul Code): (Aliases: `CP949` and `Windows-949`) +* `Big5`: (Alias: `CP950`) + +### Windows (ANSI) Encodings +* `Win1250` (Central European): (Alias: `CP1250`) +* `Win1251` (Cyrillic): (Alias: `CP1251`) +* `Win1252` (Western European / Latin): (Alias: `CP1252`) +* `Win1253` (Greek): (Alias: `CP1253`) +* `Win1254` (Turkish): (Alias: `CP1254`) +* `Win1255` (Hebrew): (Alias: `CP1255`) +* `Win1256` (Arabic): (Alias: `CP1256`) + +### DOS (OEM) Encodings +* `Win866` (Cyrillic - DOS): (Alias: `CP866`) +* `Win874` (Thai): (Alias: `CP874`) + +### Configuration Example + +Here is an example of how to use `Generic.Encoding` with the Tail input plugin to read a log file encoded in ShiftJIS. + +{% tabs %} +{% tab title="fluent-bit.yaml" %} + +```yaml +pipeline: + inputs: + - name: tail + path: /var/log/containers/*.log + generic.encoding: ShiftJIS +``` + +{% endtab %} +{% tab title="fluent-bit.conf" %} + +```text +[INPUT] + Name tail + Path C:\path\to\your\sjis.log + Generic.Encoding ShiftJIS +``` + +{% endtab %} +{% endtabs %} \ No newline at end of file From 37e837d2931c586679ac4224e4b6f7fbab13d0ec Mon Sep 17 00:00:00 2001 From: Hiroshi Hatake Date: Wed, 22 Oct 2025 17:00:54 +0900 Subject: [PATCH 4/5] Suppress lint warnings Signed-off-by: Hiroshi Hatake --- pipeline/inputs/tail.md | 49 ++++++++++++++++++++++------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index 5b0b5d3bb..93adf9663 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -484,9 +484,9 @@ Proper conversion ensures that your log data is correctly parsed, indexed, and s You should use this feature if your log files or messages are not in UTF-8 and you are seeing garbled or incorrectly rendered characters. This is common in environments that use: -* Modern Windows applications that log in UTF-16. +- Modern Windows applications that log in UTF-16. -* Legacy Windows systems with applications that use traditional code pages (e.g., ShiftJIS, GBK, Win1252). +- Legacy Windows systems with applications that use traditional code pages (e.g., ShiftJIS, GBK, Win1252). ### Configuration Parameters @@ -496,37 +496,40 @@ To enable encoding conversion, you will use one of the following two parameters Use this parameter for high-performance conversion of UTF-16 encoded logs to UTF-8. This method utilizes modern processor features (SIMD instructions) to accelerate the conversion process, making it highly efficient. -* Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16. -* Supported Values: - * UTF-16LE (Little-Endian) - * UTF-16BE (Big-Endian) +- Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16. +- Supported Values: + - UTF-16LE (Little-Endian) + - UTF-16BE (Big-Endian) -2. `Generic.Encoding` +1. `Generic.Encoding` Use this parameter to convert from a wide variety of other character encodings, particularly legacy Windows code pages. -* Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe. -* Supported Values: You can use any of the names or aliases listed below. +- Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe. +- Supported Values: You can use any of the names or aliases listed below. ### East Asian Encodings -* `ShiftJIS` (Aliases: `SJIS`, `CP932`, `Windows-31J`) -* `GB18030` -* `GBK`: (Alias: `CP936`) -* `UHC` (Unified Hangul Code): (Aliases: `CP949` and `Windows-949`) -* `Big5`: (Alias: `CP950`) + +- `ShiftJIS` (Aliases: `SJIS`, `CP932`, `Windows-31J`) +- `GB18030` +- `GBK`: (Alias: `CP936`) +- `UHC` (Unified Hangul Code): (Aliases: `CP949` and `Windows-949`) +- `Big5`: (Alias: `CP950`) ### Windows (ANSI) Encodings -* `Win1250` (Central European): (Alias: `CP1250`) -* `Win1251` (Cyrillic): (Alias: `CP1251`) -* `Win1252` (Western European / Latin): (Alias: `CP1252`) -* `Win1253` (Greek): (Alias: `CP1253`) -* `Win1254` (Turkish): (Alias: `CP1254`) -* `Win1255` (Hebrew): (Alias: `CP1255`) -* `Win1256` (Arabic): (Alias: `CP1256`) + +- `Win1250` (Central European): (Alias: `CP1250`) +- `Win1251` (Cyrillic): (Alias: `CP1251`) +- `Win1252` (Western European / Latin): (Alias: `CP1252`) +- `Win1253` (Greek): (Alias: `CP1253`) +- `Win1254` (Turkish): (Alias: `CP1254`) +- `Win1255` (Hebrew): (Alias: `CP1255`) +- `Win1256` (Arabic): (Alias: `CP1256`) ### DOS (OEM) Encodings -* `Win866` (Cyrillic - DOS): (Alias: `CP866`) -* `Win874` (Thai): (Alias: `CP874`) + +- `Win866` (Cyrillic - DOS): (Alias: `CP866`) +- `Win874` (Thai): (Alias: `CP874`) ### Configuration Example From a17048ae4f03a820adcb0cd9f5869b8f4e735657 Mon Sep 17 00:00:00 2001 From: Lynette Miles <6818907+esmerel@users.noreply.github.com> Date: Wed, 22 Oct 2025 14:27:53 -0700 Subject: [PATCH 5/5] Apply suggestions from code review This should correct the severe vale errors and most of the suggestions, as well as matching current style. Signed-off-by: Lynette Miles <6818907+esmerel@users.noreply.github.com> --- pipeline/inputs/tail.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/pipeline/inputs/tail.md b/pipeline/inputs/tail.md index 93adf9663..ef5493143 100644 --- a/pipeline/inputs/tail.md +++ b/pipeline/inputs/tail.md @@ -86,7 +86,7 @@ Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all c {% endhint %} {% hint style="info" %} -The `Unicode.Encoding` parameter is dependent on the simdutf library, which is itself dependent on C++ version 11 or later. In environments that use earlier versions of C++, the `Unicode.Encoding` parameter will fail. +The `Unicode.Encoding` parameter is dependent on the `simdutf` library, which is itself dependent on C++ version 11 or later. In environments that use earlier versions of C++, the `Unicode.Encoding` parameter will fail. Additionally, the `auto` setting for `Unicode.Encoding` isn't supported in all cases, and can make mistakes when it tries to guess the correct encoding. For best results, use either the `UTF-16LE` or `UTF-16BE` setting if you know the encoding type of the target file. {% endhint %} @@ -473,40 +473,40 @@ While file rotation is handled, there are risks of potential log loss when using {% endhint %} -## Character Encoding Conversion +## Character encoding conversion This feature allows Fluent Bit to convert logs from various character encodings into the standard UTF-8 format. This is crucial for processing logs from systems, especially Windows, that use legacy or non-UTF-8 encodings. Proper conversion ensures that your log data is correctly parsed, indexed, and searchable. -### When to Use This Feature +### When to use this feature -You should use this feature if your log files or messages are not in UTF-8 and you are seeing garbled or incorrectly rendered characters. +You should use this feature if your log files or messages aren't in UTF-8 and you are seeing garbled or incorrectly rendered characters. This is common in environments that use: - Modern Windows applications that log in UTF-16. -- Legacy Windows systems with applications that use traditional code pages (e.g., ShiftJIS, GBK, Win1252). +- Legacy Windows systems with applications that use traditional code pages (for example, ShiftJIS, GBK, Win1252). -### Configuration Parameters +### Configuration parameters To enable encoding conversion, you will use one of the following two parameters within an input plugin configuration. 1. `Unicode.Encoding` -Use this parameter for high-performance conversion of UTF-16 encoded logs to UTF-8. This method utilizes modern processor features (SIMD instructions) to accelerate the conversion process, making it highly efficient. + Use this parameter for high-performance conversion of UTF-16 encoded logs to UTF-8. This method utilizes modern processor features (SIMD instructions) to accelerate the conversion process, making it highly efficient. -- Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16. -- Supported Values: - - UTF-16LE (Little-Endian) - - UTF-16BE (Big-Endian) + - Use Case: Ideal for logs coming from modern Windows environments that default to UTF-16. + - Supported Values: + - UTF-16LE (Little-Endian) + - UTF-16BE (Big-Endian) 1. `Generic.Encoding` -Use this parameter to convert from a wide variety of other character encodings, particularly legacy Windows code pages. + Use this parameter to convert from a wide variety of other character encodings, particularly legacy Windows code pages. -- Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe. -- Supported Values: You can use any of the names or aliases listed below. + - Use Case: Essential for logs from older systems or applications configured for specific regions, common in East Asia and Eastern Europe. + - Supported Values: You can use any of the names or aliases listed below. ### East Asian Encodings @@ -516,7 +516,7 @@ Use this parameter to convert from a wide variety of other character encodings, - `UHC` (Unified Hangul Code): (Aliases: `CP949` and `Windows-949`) - `Big5`: (Alias: `CP950`) -### Windows (ANSI) Encodings +### Windows (ANSI) encodings - `Win1250` (Central European): (Alias: `CP1250`) - `Win1251` (Cyrillic): (Alias: `CP1251`) @@ -526,12 +526,12 @@ Use this parameter to convert from a wide variety of other character encodings, - `Win1255` (Hebrew): (Alias: `CP1255`) - `Win1256` (Arabic): (Alias: `CP1256`) -### DOS (OEM) Encodings +### DOS (OEM) encodings - `Win866` (Cyrillic - DOS): (Alias: `CP866`) - `Win874` (Thai): (Alias: `CP874`) -### Configuration Example +### Configuration example Here is an example of how to use `Generic.Encoding` with the Tail input plugin to read a log file encoded in ShiftJIS.