Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .translation-init
Original file line number Diff line number Diff line change
@@ -1 +1 @@
Translation initialization: 2025-09-26T01:16:02.268240
Translation initialization: 2025-09-26T04:21:21.696179
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,26 @@ title: INFER_SCHEMA

自动检测文件元数据模式并检索列定义。

`infer_schema` 目前支持以下文件格式:
- **Parquet** - 原生支持模式推断
- **CSV** - 支持自定义分隔符和表头检测
- **NDJSON** - 换行分隔的 JSON 文件

:::caution
**压缩支持**:所有格式均支持扩展名为 `.zip`、`.xz`、`.zst` 的压缩文件。

`infer_schema` 目前仅支持 parquet 文件格式。
:::info 文件大小限制
每个独立文件的模式推断最大大小限制为 **100MB**。
:::

:::info 模式合并
处理多个文件时,`infer_schema` 会自动合并不同模式:

- **兼容类型** 会被提升(例如,INT8 + INT16 → INT16)
- **不兼容类型** 会回退到 **VARCHAR**(例如,INT + FLOAT → VARCHAR)
- 某些文件中 **缺失的列** 会被标记为 **nullable**
- 后续文件中的 **新列** 会被添加到最终模式

这确保所有文件都能使用统一模式读取。
:::

## 语法
Expand All @@ -17,81 +32,222 @@ title: INFER_SCHEMA
INFER_SCHEMA(
LOCATION => '{ internalStage | externalStage }'
[ PATTERN => '<regex_pattern>']
[ FILE_FORMAT => '<format_name>' ]
[ MAX_RECORDS_PRE_FILE => <number> ]
[ MAX_FILE_COUNT => <number> ]
)
```

其中:
## 参数

### internalStage
| 参数 | 描述 | 默认值 | 示例 |
|-----------|-------------|---------|---------|
| `LOCATION` | 暂存区位置:`@<stage_name>[/<path>]` | 必需 | `'@my_stage/data/'` |
| `PATTERN` | 文件名匹配模式 | 所有文件 | `'*.csv'`, `'*.parquet'` |
| `FILE_FORMAT` | 解析用的文件格式名称 | 暂存区格式 | `'csv_format'`, `'NDJSON'` |
| `MAX_RECORDS_PRE_FILE` | 每文件采样的最大记录数 | 所有记录 | `100`, `1000` |
| `MAX_FILE_COUNT` | 处理的最大文件数 | 所有文件 | `5`, `10` |

## 示例

### Parquet 文件

```sql
internalStage ::= @<internal_stage_name>[/<path>]
-- 创建暂存区并导出数据
CREATE STAGE test_parquet;
COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET');

-- 使用模式从 Parquet 文件推断模式
SELECT * FROM INFER_SCHEMA(
location => '@test_parquet',
pattern => '*.parquet'
);
```

结果:
```
+-------------+-----------------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+-----------------+----------+----------+----------+
| number | BIGINT UNSIGNED | false | data_... | 0 |
+-------------+-----------------+----------+----------+----------+
```

### externalStage
### CSV 文件

```sql
externalStage ::= @<external_stage_name>[/<path>]
-- 创建暂存区并导出 CSV 数据
CREATE STAGE test_csv;
COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV');

-- 创建 CSV 文件格式
CREATE FILE FORMAT csv_format TYPE = 'CSV';

-- 使用模式和文件格式推断模式
SELECT * FROM INFER_SCHEMA(
location => '@test_csv',
pattern => '*.csv',
file_format => 'csv_format'
);
```

### PATTERN = 'regex_pattern'
结果:
```
+-------------+---------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| column_1 | BIGINT | true | data_... | 0 |
+-------------+---------+----------+----------+----------+
```

一个基于 [PCRE2](https://www.pcre.org/current/doc/html/) 的正则表达式模式字符串,用单引号括起来,指定要匹配的文件名。点击[这里](#loading-data-with-pattern-matching)查看示例。有关 PCRE2 语法,请参见 http://www.pcre.org/current/doc/html/pcre2syntax.html。
带表头的 CSV 文件:

## 示例
```sql
-- 创建支持表头的 CSV 文件格式
CREATE FILE FORMAT csv_headers_format
TYPE = 'CSV'
field_delimiter = ','
skip_header = 1;

-- 导出带表头的数据
CREATE STAGE test_csv_headers;
COPY INTO @test_csv_headers FROM (
SELECT number as user_id, 'user_' || number::string as user_name
FROM numbers(5)
) FILE_FORMAT = (TYPE = 'CSV', output_header = true);

-- 推断带表头的模式
SELECT * FROM INFER_SCHEMA(
location => '@test_csv_headers',
file_format => 'csv_headers_format'
);
```

在 stage 中生成一个 parquet 文件
限制记录数以加快推断

```sql
CREATE STAGE infer_parquet FILE_FORMAT = (TYPE = PARQUET);
COPY INTO @infer_parquet FROM (SELECT * FROM numbers(10)) FILE_FORMAT = (TYPE = PARQUET);
-- 仅采样前 5 条记录进行模式推断
SELECT * FROM INFER_SCHEMA(
location => '@test_csv',
pattern => '*.csv',
file_format => 'csv_format',
max_records_pre_file => 5
);
```

### NDJSON 文件

```sql
LIST @infer_parquet;
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
| name | size | md5 | last_modified | creator |
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
| data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet | 258 | "7DCC9FFE04EA1F6882AED2CF9640D3D4" | 2023-02-09 05:21:52.000 +0000 | NULL |
+-------------------------------------------------------+------+------------------------------------+-------------------------------+---------+
-- 创建暂存区并导出 NDJSON 数据
CREATE STAGE test_ndjson;
COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON');

-- 使用模式和 NDJSON 格式推断模式
SELECT * FROM INFER_SCHEMA(
location => '@test_ndjson',
pattern => '*.ndjson',
file_format => 'NDJSON'
);
```

### `infer_schema`
结果:
```
+-------------+---------+----------+----------+----------+
| column_name | type | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| number | BIGINT | true | data_... | 0 |
+-------------+---------+----------+----------+----------+
```

限制记录数以加快推断:

```sql
SELECT * FROM INFER_SCHEMA(location => '@infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet');
+-------------+-----------------+----------+----------+
| column_name | type | nullable | order_id |
+-------------+-----------------+----------+----------+
| number | BIGINT UNSIGNED | 0 | 0 |
+-------------+-----------------+----------+----------+
-- 仅采样前 5 条记录进行模式推断
SELECT * FROM INFER_SCHEMA(
location => '@test_ndjson',
pattern => '*.ndjson',
file_format => 'NDJSON',
max_records_pre_file => 5
);
```

### 使用模式匹配的 `infer_schema`
### 多文件模式合并

当文件模式不同时,`infer_schema` 会智能合并:

```sql
SELECT * FROM infer_schema(location => '@infer_parquet/', pattern => '.*parquet');
+-------------+-----------------+----------+----------+
| column_name | type | nullable | order_id |
+-------------+-----------------+----------+----------+
| number | BIGINT UNSIGNED | 0 | 0 |
+-------------+-----------------+----------+----------+
-- 假设有多个不同模式的 CSV 文件:
-- file1.csv: id(INT), name(VARCHAR)
-- file2.csv: id(INT), name(VARCHAR), age(INT)
-- file3.csv: id(FLOAT), name(VARCHAR), age(INT)

SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv',
file_format => 'csv_format'
);
```

### 从 Parquet 文件创建表
结果显示合并后的模式:
```
+-------------+---------+----------+-----------+----------+
| column_name | type | nullable | filenames | order_id |
+-------------+---------+----------+-----------+----------+
| id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
| name | VARCHAR | true | file1,... | 1 |
| age | BIGINT | true | file1,... | 2 | -- file1 缺失→nullable
+-------------+---------+----------+-----------+----------+
```

`infer_schema` 只能显示 parquet 文件的模式,无法从中创建表。
### 模式匹配与文件限制

要从 parquet 文件创建表
使用模式匹配从多个文件推断模式

```sql
CREATE TABLE mytable AS SELECT * FROM @infer_parquet/ (pattern=>'.*parquet') LIMIT 0;

DESC mytable;
+--------+-----------------+------+---------+-------+
| Field | Type | Null | Default | Extra |
+--------+-----------------+------+---------+-------+
| number | BIGINT UNSIGNED | NO | 0 | |
+--------+-----------------+------+---------+-------+
-- 从目录中所有 CSV 文件推断模式
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv'
);
```

限制处理文件数以提升性能:

```sql
-- 仅处理前 5 个匹配文件
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/',
pattern => '*.csv',
max_file_count => 5
);
```

### 压缩文件

`infer_schema` 自动处理压缩文件:

```sql
-- 适用于压缩 CSV 文件
SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip');

-- 适用于压缩 NDJSON 文件
SELECT * FROM INFER_SCHEMA(
location => '@my_stage/data.ndjson.xz',
file_format => 'NDJSON',
max_records_pre_file => 50
);
```

### 从推断模式创建表

`infer_schema` 函数显示模式但不创建表。要从推断模式创建表:

```sql
-- 从文件模式创建表结构
CREATE TABLE my_table AS
SELECT * FROM @my_stage/ (pattern=>'*.parquet')
LIMIT 0;

-- 验证表结构
DESC my_table;
```