Skip to content

Commit e00145a

Browse files
authored
Merge pull request #934 from Altinity/s3_hive_style_reads_and_writes_25_6_5
Antalya 25.6.5: Object storage hive reads & writes
2 parents 89a60bd + f1762aa commit e00145a

File tree

73 files changed

+2303
-590
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+2303
-590
lines changed

docs/en/engines/table-engines/integrations/azureBlobStorage.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This engine provides an integration with [Azure Blob Storage](https://azure.micr
1414

1515
```sql
1616
CREATE TABLE azure_blob_storage_table (name String, value UInt32)
17-
ENGINE = AzureBlobStorage(connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression])
17+
ENGINE = AzureBlobStorage(connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression, partition_strategy, partition_columns_in_data_file])
1818
[PARTITION BY expr]
1919
[SETTINGS ...]
2020
```
@@ -30,6 +30,8 @@ CREATE TABLE azure_blob_storage_table (name String, value UInt32)
3030
- `account_key` - if storage_account_url is used, then account key can be specified here
3131
- `format` — The [format](/interfaces/formats.md) of the file.
3232
- `compression` — Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. By default, it will autodetect compression by file extension. (same as setting to `auto`).
33+
- `partition_strategy` – Options: `WILDCARD` or `HIVE`. `WILDCARD` requires a `{_partition_id}` in the path, which is replaced with the partition key. `HIVE` does not allow wildcards, assumes the path is the table root, and generates Hive-style partitioned directories with Snowflake IDs as filenames and the file format as the extension. Defaults to `WILDCARD`
34+
- `partition_columns_in_data_file` - Only used with `HIVE` partition strategy. Tells ClickHouse whether to expect partition columns to be written in the data file. Defaults `false`.
3335

3436
**Example**
3537

@@ -96,6 +98,35 @@ SETTINGS filesystem_cache_name = 'cache_for_azure', enable_filesystem_cache = 1;
9698

9799
2. reuse cache configuration (and therefore cache storage) from clickhouse `storage_configuration` section, [described here](/operations/storing-data.md/#using-local-cache)
98100

101+
### PARTITION BY {#partition-by}
102+
103+
`PARTITION BY` — Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
104+
105+
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.
106+
107+
#### Partition strategy {#partition-strategy}
108+
109+
`WILDCARD` (default): Replaces the `{_partition_id}` wildcard in the file path with the actual partition key. Reading is not supported.
110+
111+
`HIVE` implements hive style partitioning for reads & writes. Reading is implemented using a recursive glob pattern. Writing generates files using the following format: `<prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>`.
112+
113+
Note: When using `HIVE` partition strategy, the `use_hive_partitioning` setting has no effect.
114+
115+
Example of `HIVE` partition strategy:
116+
117+
```sql
118+
arthur :) create table azure_table (year UInt16, country String, counter UInt8) ENGINE=AzureBlobStorage(account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', storage_account_url = 'http://localhost:30000/devstoreaccount1', container='cont', blob_path='hive_partitioned', format='Parquet', compression='auto', partition_strategy='hive') PARTITION BY (year, country);
119+
120+
arthur :) insert into azure_table values (2020, 'Russia', 1), (2021, 'Brazil', 2);
121+
122+
arthur :) select _path, * from azure_table;
123+
124+
┌─_path──────────────────────────────────────────────────────────────────────┬─year─┬─country─┬─counter─┐
125+
1. │ cont/hive_partitioned/year=2020/country=Russia/7351305360873664512.parquet │ 2020 │ Russia │ 1
126+
2. │ cont/hive_partitioned/year=2021/country=Brazil/7351305360894636032.parquet │ 2021 │ Brazil │ 2
127+
└────────────────────────────────────────────────────────────────────────────┴──────┴─────────┴─────────┘
128+
```
129+
99130
## See also {#see-also}
100131

101132
[Azure Blob Storage Table Function](/sql-reference/table-functions/azureBlobStorage)

docs/en/engines/table-engines/integrations/s3.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ SELECT * FROM s3_engine_table LIMIT 2;
3434

3535
```sql
3636
CREATE TABLE s3_engine_table (name String, value UInt32)
37-
ENGINE = S3(path [, NOSIGN | aws_access_key_id, aws_secret_access_key,] format, [compression])
37+
ENGINE = S3(path [, NOSIGN | aws_access_key_id, aws_secret_access_key,] format, [compression], [partition_strategy], [partition_columns_in_data_file])
3838
[PARTITION BY expr]
3939
[SETTINGS ...]
4040
```
@@ -46,6 +46,8 @@ CREATE TABLE s3_engine_table (name String, value UInt32)
4646
- `format` — The [format](/sql-reference/formats#formats-overview) of the file.
4747
- `aws_access_key_id`, `aws_secret_access_key` - Long-term credentials for the [AWS](https://aws.amazon.com/) account user. You can use these to authenticate your requests. Parameter is optional. If credentials are not specified, they are used from the configuration file. For more information see [Using S3 for Data Storage](../mergetree-family/mergetree.md#table_engine-mergetree-s3).
4848
- `compression` — Compression type. Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. Parameter is optional. By default, it will auto-detect compression by file extension.
49+
- `partition_strategy` – Options: `WILDCARD` or `HIVE`. `WILDCARD` requires a `{_partition_id}` in the path, which is replaced with the partition key. `HIVE` does not allow wildcards, assumes the path is the table root, and generates Hive-style partitioned directories with Snowflake IDs as filenames and the file format as the extension. Defaults to `WILDCARD`
50+
- `partition_columns_in_data_file` - Only used with `HIVE` partition strategy. Tells ClickHouse whether to expect partition columns to be written in the data file. Defaults `false`.
4951

5052
### Data cache {#data-cache}
5153

@@ -84,6 +86,52 @@ There are two ways to define cache in configuration file.
8486

8587
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.
8688

89+
#### Partition strategy {#partition-strategy}
90+
91+
`WILDCARD` (default): Replaces the `{_partition_id}` wildcard in the file path with the actual partition key. Reading is not supported.
92+
93+
`HIVE` implements hive style partitioning for reads & writes. Reading is implemented using a recursive glob pattern, it is equivalent to `SELECT * FROM s3('table_root/**.parquet')`.
94+
Writing generates files using the following format: `<prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>`.
95+
96+
Note: When using `HIVE` partition strategy, the `use_hive_partitioning` setting has no effect.
97+
98+
Example of `HIVE` partition strategy:
99+
100+
```sql
101+
arthur :) CREATE TABLE t_03363_parquet (year UInt16, country String, counter UInt8)
102+
ENGINE = S3(s3_conn, filename = 't_03363_parquet', format = Parquet, partition_strategy='hive')
103+
PARTITION BY (year, country);
104+
105+
arthur :) INSERT INTO t_03363_parquet VALUES
106+
(2022, 'USA', 1),
107+
(2022, 'Canada', 2),
108+
(2023, 'USA', 3),
109+
(2023, 'Mexico', 4),
110+
(2024, 'France', 5),
111+
(2024, 'Germany', 6),
112+
(2024, 'Germany', 7),
113+
(1999, 'Brazil', 8),
114+
(2100, 'Japan', 9),
115+
(2024, 'CN', 10),
116+
(2025, '', 11);
117+
118+
arthur :) select _path, * from t_03363_parquet;
119+
120+
┌─_path──────────────────────────────────────────────────────────────────────┬─year─┬─country─┬─counter─┐
121+
1. │ test/t_03363_parquet/year=2100/country=Japan/7329604473272971264.parquet │ 2100 │ Japan │ 9
122+
2. │ test/t_03363_parquet/year=2024/country=France/7329604473323302912.parquet │ 2024 │ France │ 5
123+
3. │ test/t_03363_parquet/year=2022/country=Canada/7329604473314914304.parquet │ 2022 │ Canada │ 2
124+
4. │ test/t_03363_parquet/year=1999/country=Brazil/7329604473289748480.parquet │ 1999 │ Brazil │ 8
125+
5. │ test/t_03363_parquet/year=2023/country=Mexico/7329604473293942784.parquet │ 2023 │ Mexico │ 4
126+
6. │ test/t_03363_parquet/year=2023/country=USA/7329604473319108608.parquet │ 2023 │ USA │ 3
127+
7. │ test/t_03363_parquet/year=2025/country=/7329604473327497216.parquet │ 2025 │ │ 11
128+
8. │ test/t_03363_parquet/year=2024/country=CN/7329604473310720000.parquet │ 2024 │ CN │ 10
129+
9. │ test/t_03363_parquet/year=2022/country=USA/7329604473298137088.parquet │ 2022 │ USA │ 1
130+
10. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │ 6
131+
11. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │ 7
132+
└────────────────────────────────────────────────────────────────────────────┴──────┴─────────┴─────────┘
133+
```
134+
87135
### Querying partitioned data {#querying-partitioned-data}
88136

89137
This example uses the [docker compose recipe](https://github.com/ClickHouse/examples/tree/5fdc6ff72f4e5137e23ea075c88d3f44b0202490/docker-compose-recipes/recipes/ch-and-minio-S3), which integrates ClickHouse and MinIO. You should be able to reproduce the same queries using S3 by replacing the endpoint and authentication values.

0 commit comments

Comments
 (0)