GreptimeTeam · evenyag · Nov 17, 2025 · Nov 20, 2025 · fengjiachun · Nov 21, 2025
@@ -148,13 +148,14 @@ After dropping the default value, the column will use `NULL` as the default. The
 
 ### Alter table options
 
-`ALTER TABLE` statements can also be used to change the options of tables. 
+`ALTER TABLE` statements can also be used to change the options of tables.
 
 Currently following options are supported:
 - `ttl`: the retention time of data in table.
 - `compaction.twcs.time_window`: the time window parameter of TWCS compaction strategy. The value should be a [time duration string](/reference/time-durations.md).
 - `compaction.twcs.max_output_file_size`: the maximum allowed output file size of TWCS compaction strategy.
 - `compaction.twcs.trigger_file_num`: the number of files in a specific time window to trigger a compaction.
+- `sst_format`: the SST format of the table. The value should be `flat`. A table only supports changing the format from `primary_key` to `flat`.
 
 ```sql
 ALTER TABLE monitor SET 'ttl'='1d';
@@ -164,6 +165,8 @@ ALTER TABLE monitor SET 'compaction.twcs.time_window'='2h';
 ALTER TABLE monitor SET 'compaction.twcs.max_output_file_size'='500MB';
 
 ALTER TABLE monitor SET 'compaction.twcs.trigger_file_num'='8';
+
+ALTER TABLE monitor SET 'sst_format'='flat';
 ```
 
 ### Unset table options

@@ -151,6 +151,7 @@ Users can add table options by using `WITH`. The valid options contain the follo
 | `memtable.type`                             | Type of the memtable.                                           | String value, supports `time_series`, `partition_tree`.                                                                                                                                                                                                     |
 | `append_mode`                               | Whether the table is append-only                                | String value. Default is 'false', which removes duplicate rows by primary keys and timestamps according to the `merge_mode`. Setting it to 'true' to enable append mode and create an append-only table which keeps duplicate rows.                         |
 | `merge_mode`                                | The strategy to merge duplicate rows                            | String value. Only available when `append_mode` is 'false'. Default is `last_row`, which keeps the last row for the same primary key and timestamp. Setting it to `last_non_null` to keep the last non-null field for the same primary key and timestamp.   |
+| `sst_format`                                | The format of SST files                            | String value, supports `primary_key`, `flat`. Default is `primary_key`. `flat` is recommended for tables which have a large number of unique primary keys.   |
 | `comment`                                   | Table level comment                                             | String value.                                                                                                                                                                                                                                               |
 | `skip_wal`                                | Whether to disable Write-Ahead-Log for this table                               | String type. When set to `'true'`, the data written to the table will not be persisted to the write-ahead log, which can avoid storage wear and improve write throughput. However, when the process restarts, any unflushed data will be lost. Please use this feature only when the data source itself can ensure reliability. |
 | `index.type`                                | Index type                                                      | **Only for metric engine** String value, supports `none`, `skipping`.                                                                                                                                                                                       |
@@ -171,15 +172,15 @@ The `ttl` value can be one of the following:
 - `forever`, `NULL`, an empty string `''` and `0s` (or any zero length duration, like `0d`), means the data will never be deleted.
 - `instant`, note that database's TTL can't be set to `instant`. `instant` means the data will be deleted instantly when inserted, useful if you want to send input to a flow task without saving it, see more details in [flow management documents](/user-guide/flow-computation/manage-flow.md#manage-flows).
 - Unset, `ttl` can be unset by using `ALTER TABLE <table-name> UNSET 'ttl'`, which means the table will inherit the database's ttl policy (if any).
-  
+
 If a table has its own TTL policy, it will take precedence over the database TTL policy.
-Otherwise, the database TTL policy will be applied to the table. 
+Otherwise, the database TTL policy will be applied to the table.
 
 So if table's TTL is set to `forever`, no matter what the database's TTL is, the data will never be deleted. But if you unset table TTL using:
 ```sql
 ALTER TABLE <table-name> UNSET 'ttl';
 ```
-Then the database's TTL will be applied to the table. 
+Then the database's TTL will be applied to the table.
 
 Note that the default TTL setting for table and database is unset, which also means the data will never be deleted.
 
@@ -286,10 +287,10 @@ CREATE TABLE greptime_physical_table (
     greptime_timestamp TIMESTAMP(3) NOT NULL,
     greptime_value DOUBLE NULL,
     TIME INDEX (greptime_timestamp),
-) 
+)
 engine = metric
 with (
-    "physical_metric_table" = "",   
+    "physical_metric_table" = "",
 );
 ```
 
@@ -304,14 +305,32 @@ CREATE TABLE greptime_physical_table (
     greptime_timestamp TIMESTAMP(3) NOT NULL,
     greptime_value DOUBLE NULL,
     TIME INDEX (greptime_timestamp),
-) 
+)
 engine = metric
 with (
     "physical_metric_table" = "",
     "index.type" = "skipping",
 );
 ```
 
+#### Create a table with SST format
+
+Create a table with `flat` SST format.
+
+```sql
+create table if not exists metrics(
-create table if not exists metrics(
+CREATE TABLE IF NOT EXISTS metrics(
-create table if not exists metrics(
+CREATE TABLE IF NOT EXISTS metrics(
+    host string,
+    ts timestamp,
+    cpu double,
+    memory double,
+    TIME INDEX (ts),
+    PRIMARY KEY(host)
+)
+with('sst_format'='flat');
+```
+
+The `flat` format is an new format that is optimized for high cardinality primary keys. By default, the SST format of a table is `primary_key` for backward compatibility. The default format will be `flat` once it is stable.
+
 
 
 ### Column options
@@ -480,4 +499,3 @@ For the statement to create or update a view, please read the [view user guide](
 ## CREATE TRIGGER
 
 Please refer to the [CREATE TRIGGER](/reference/sql/trigger-syntax.md#create-trigger) documentation.
-
@@ -77,9 +77,9 @@ The `http_logs` table is an example for storing HTTP server logs.
 - The table sorts logs by time so it is efficient to search logs by time.
 
 
-### When to use primary key
+### Primary key design and SST format
 
-You can use primary key when there are suitable low cardinality columns and one of the following conditions is met:
+You can use primary key when there are suitable columns and one of the following conditions is met:
 
 - Most queries can benefit from the ordering.
 - You need to deduplicate (including delete) rows by the primary key and time index.
@@ -108,18 +108,44 @@ CREATE TABLE http_logs_v2 (
 ) with ('append_mode'='true');
 ```
 
+A long primary key will negatively affect the insert performance and enlarge the memory footprint. It's recommended to define a primary key with no more than 5 columns.
 
-In order to improve sort and deduplication speed under time-series workloads, GreptimeDB buffers and processes rows by time-series internally.
+
+#### Using flat format table for high cardinality primary keys
+
+In order to improve sort and deduplication speed under time-series workloads, GreptimeDB buffers and processes rows by time-series under default SST format.
 So it doesn't need to compare the primary key for each row repeatedly.
 This can be a problem if the tag column has high cardinality:
 
 1. Performance degradation since the database can't batch rows efficiently.
 2. It may increase memory and CPU usage as the database has to maintain the metadata for each time-series.
 3. Deduplication may be too expensive.
 
+Currently, the recommended number of values for the primary key is no more than 100 thousand under the default format.
+
+Sometimes, users may want to put a high cardinality column in the primary key:
+
+* They have to deduplicate rows by that column, although it isn't efficient.
+* Ordering rows by that column can improve query performance significantly.
 
-So you must not use high cardinality column as the primary key or put too many columns in the primary key. Currently, the recommended number of values for the primary key is no more than 100 thousand. A long primary key will negatively affect the insert performance and enlarge the memory footprint. A primary key with no more than 5 columns is recommended.
+To use high cardinality columns as the primary key, you could set the SST format to `flat`.
+This format has much lower memory usage and better performance under this workload.
+Note that deduplication on high cardinality primary keys is always expensive. So it's still recommended to use append-only table if you can tolerate duplication.
 
+```sql
+CREATE TABLE http_logs_flat (
+  access_time TIMESTAMP TIME INDEX,
+  application STRING,
+  remote_addr STRING,
+  http_status STRING,
+  http_method STRING,
+  http_refer STRING,
+  user_agent STRING,
+  request_id STRING,
+  request STRING,
+  PRIMARY KEY(application, request_id),
+) with ('append_mode'='true', 'sst_format'='flat');
+```
 
 Recommendations for tags:
 
@@ -128,9 +154,8 @@ Recommendations for tags:
   For example, `namespace`, `cluster`, or an AWS `region`.
 - No need to set all low cardinality columns as tags since this may impact the performance of ingestion and querying.
 - Typically use short strings and integers for tags, avoiding `FLOAT`, `DOUBLE`, `TIMESTAMP`.
-- Never set high cardinality columns as tags if they change frequently.
-  For example, `trace_id`, `span_id`, `user_id` must not be used as tags.
-  GreptimeDB works well if you set them as fields instead of tags.
+- Set `sst_format` to `flat` if tags change frequently.
+  For example, when tags contain columns like `trace_id`, `span_id`, and `user_id`.
 
 
 ## Index

@@ -59,14 +59,14 @@ staging_size = "10GB"
 
 Some tips:
 
-- 1/10 of disk space for the write cache at least
+- 1/10 of disk space for the write cache at least. It's recommended to use a large write cache when using object storage.
 - 1/4 of total memory for the `page_cache_size` at least if the memory usage is under 20%
 - Double the cache size if the cache hit ratio is less than 50%
 - If using full-text index, leave 1/10 of disk space for the `staging_size` at least
 
-### Avoid adding high cardinality columns to the primary key
+### Using flat format table for high cardinality primary keys
 
-Putting high cardinality columns, such as `trace_id` or `uuid`, into the primary key can negatively impact both write and query performance. Instead, consider using an [append-only table](/reference/sql/create.md#create-an-append-only-table) and setting these high cardinality columns as fields.
+Putting high cardinality columns, such as `trace_id` or `uuid`, into the primary key can negatively impact both write and query performance under the default format. Instead, consider using an [append-only table](/reference/sql/create.md#create-an-append-only-table) and setting the SST format to [`flat` format](/reference/sql/create.md#create-a-table-with-sst-format).
 
 ### Using append-only table if possible
 

@@ -154,7 +154,7 @@ ALTER TABLE monitor MODIFY COLUMN load_15 DROP DEFAULT;
 - `compaction.twcs.time_window`: TWCS compaction 策略的时间窗口，其值是一个[时间范围字符段](/reference/time-durations.md)。
 - `compaction.twcs.max_output_file_size`: TWCS compaction 策略的最大允许输出文件大小。
 - `compaction.twcs.trigger_file_num`: 某个窗口内触发 compaction 的最小文件数量阈值。
-
+- `sst_format`: 表的 SST 格式。值应设置为 `flat`。表只支持从 `primary_key` 格式更改为 `flat` 格式。
 
 ```sql
 ALTER TABLE monitor SET 'ttl'='1d';
@@ -164,6 +164,8 @@ ALTER TABLE monitor SET 'compaction.twcs.time_window'='2h';
 ALTER TABLE monitor SET 'compaction.twcs.max_output_file_size'='500MB';
 
 ALTER TABLE monitor SET 'compaction.twcs.trigger_file_num'='8';
+
+ALTER TABLE monitor SET 'sst_format'='flat';
 ```
 
 ### 移除表参数

@@ -153,6 +153,7 @@ GreptimeDB 提供了丰富的索引实现来加速查询，请在[索引](/user-
 | `memtable.type`                             | memtable 的类型                          | 字符串值，支持 `time_series`，`partition_tree`                                                                                                                           |
 | `append_mode`                               | 该表是否时 append-only 的                | 字符串值。默认值为 'false'，根据 'merge_mode' 按主键和时间戳删除重复行。设置为 'true' 可以开启 append 模式和创建 append-only 表，保留所有重复的行                        |
 | `merge_mode`                                | 合并重复行的策略                         | 字符串值。只有当 `append_mode` 为 'false' 时可用。默认值为 `last_row`，保留相同主键和时间戳的最后一行。设置为 `last_non_null` 则保留相同主键和时间戳的最后一个非空字段。 |
+| `sst_format`                                | SST 文件的格式                            | 字符串值，支持 `primary_key`，`flat`。默认为 `primary_key`。`flat` 格式建议用于具有高基数主键的表。   |
 | `comment`                                   | 表级注释                                 | 字符串值。                                                                                                                                                               |
 | `index.type`                                | Index 类型                               | **仅用于 metric engine**  字符串值，支持 `none`, `skipping`.                                                                                                             |
 | `skip_wal`                                | 是否关闭表的预写日志                               | 字符串类型。当设置为 `'true'` 时表的写入数据将不会持久化到预写日志，可以避免存储磨损同时提升写入吞吐。但是当进程重启时，尚未 flush 的数据会丢失。请仅在数据源本身可以确保可靠性的情况下使用此功能。 |
@@ -317,6 +318,24 @@ with (
 );
 ```
 
+#### 创建指定 SST 格式的表
+
+创建一个使用 `flat` SST 格式的表。
+
+```sql
+create table if not exists metrics(
-create table if not exists metrics(
+CREATE TABLE IF NOT EXISTS metrics(
-create table if not exists metrics(
+CREATE TABLE IF NOT EXISTS metrics(
+    host string,
+    ts timestamp,
+    cpu double,
+    memory double,
+    TIME INDEX (ts),
+    PRIMARY KEY(host)
+)
+with('sst_format'='flat');
+```
+
+`flat` 格式是一种针对高基数主键优化的新格式。为了向后兼容，默认情况下表的 SST 格式为 `primary_key`。一旦 `flat` 格式稳定，默认格式将变为 `flat`。
+
 
 ### 列选项
 
@@ -485,4 +504,3 @@ AS select_statement
 ## 创建 Trigger
 
 请参考 [CREATE TRIGGER](/reference/sql/trigger-syntax.md#create-trigger) 文档。
-
@@ -75,9 +75,9 @@ CREATE TABLE http_logs (
 - 该表按时间对日志进行排序，因此按时间搜索日志效率很高。
 
 
-### 何时使用主键
+### 主键设计与 SST 格式
 
-当有适合的低基数列且满足以下条件之一时，可以使用主键：
+当有适合的列且满足以下条件之一时，可以使用主键：
 
 - 大多数查询可以从排序中受益。
 - 你需要通过主键和时间索引对行进行去重（包括删除）。
@@ -106,15 +106,44 @@ CREATE TABLE http_logs_v2 (
 ) with ('append_mode'='true');
 ```
 
-为了提高时序场景下的排序和去重速度，GreptimeDB 内部按时间序列缓冲和处理行。
+过长的主键将对插入性能产生负面影响并增加内存占用。主键最好不超过 5 个列。
+
+
+#### 使用 flat 格式表处理高基数主键
+
+为了提高时序场景下的排序和去重速度，在默认 SST 格式下，GreptimeDB 内部按时间序列缓冲和处理行。
 因此，它不需要反复比较每行的主键。
 如果 tag 列具有高基数，这可能会成为问题：
 
 1. 由于数据库无法有效地批处理行，性能可能会降低。
 2. 由于数据库必须为每个时间序列维护元数据，可能会增加内存和 CPU 使用率。
 3. 去重可能会变得过于昂贵。
 
-因此，不能将高基数列作为主键，或在主键中放入过多列。目前，主键值的推荐数量不超过 10 万。过长的主键将对插入性能产生负面影响并增加内存占用。主键最好不超过 5 个列。
+目前，在默认格式下，主键值的推荐数量不超过 10 万。
+
+有时，用户可能希望将高基数列放入主键中：
+
+* 他们必须通过该列对行进行去重，尽管这并不高效。
+* 按该列对行进行排序可以显著提高查询性能。
+
+要将高基数列用作主键，可以将 SST 格式设置为 `flat`。
+该格式在这种工作负载下具有更低的内存使用和更好的性能。
+请注意，在高基数主键上进行去重总是昂贵的。因此，如果可以容忍重复，仍然建议使用 append-only 表。
+
+```sql
+CREATE TABLE http_logs_flat (
+  access_time TIMESTAMP TIME INDEX,
+  application STRING,
+  remote_addr STRING,
+  http_status STRING,
+  http_method STRING,
+  http_refer STRING,
+  user_agent STRING,
+  request_id STRING,
+  request STRING,
+  PRIMARY KEY(application, request_id),
+) with ('append_mode'='true', 'sst_format'='flat');
+```
 
 选取 tag 列的建议：
 
@@ -123,9 +152,8 @@ CREATE TABLE http_logs_v2 (
   例如，`namespace`、`cluster` 或 AWS `region`。
 - 无需将所有低基数列设为 tag，因为这可能影响写入和查询性能。
 - 通常对 tag 使用短字符串和整数，避免 `FLOAT`、`DOUBLE`、`TIMESTAMP`。
-- 如果高基数列经常变化，切勿将其设为 tag。
-例如，`trace_id`、`span_id`、`user_id` 绝不能用作 tag。
-如果将它们设为 field 而非 tag，GreptimeDB 可以有很不错的性能。
+- 如果 tag 经常变化，请将 `sst_format` 设置为 `flat`。
+  例如，当 tag 包含 `trace_id`、`span_id` 和 `user_id` 等列时。
 
 
 ## 索引

@@ -56,15 +56,15 @@ staging_size = "10GB"
 
 
 一些建议：
-- 至少将写入缓存设置为磁盘空间的 1/10
+- 至少将写入缓存设置为磁盘空间的 1/10。使用对象存储时，建议使用较大的写入缓存。
 - 如果数据库内存使用率低于 20%，则可以至少将 `page_cache_size` 设置为总内存大小的 1/4
 - 如果缓存命中率低于 50%，则可以将缓存大小翻倍
 - 如果使用全文索引，至少将 `staging_size` 设置为磁盘空间的 1/10
 
 
-### 避免将高基数的列放到主键中
+### 使用 flat 格式表处理高基数主键
 
-将高基数的列，如 `trace_id` 和 `uuid` 等列设置为主键会降低写入和查询的性能。建议建表时使用 [append-only](/reference/sql/create.md#创建-append-only-表) 表并将这些高基数的列设置为 fields。
+在默认格式下，将高基数的列（如 `trace_id` 和 `uuid`）设置为主键会降低写入和查询的性能。建议建表时使用 [append-only](/reference/sql/create.md#创建-append-only-表) 表并将 SST 格式设置为 [`flat` 格式](/reference/sql/create.md#创建指定-sst-格式的表)。
 
 
 ### 尽可能使用 append-only 表