You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
improve Parquet querying documentation with comprehensive explanations
Enhanced the Parquet files documentation with detailed explanations of syntax, parameters, and step-by-step tutorial guidance while preserving all existing examples.
@@ -3,9 +3,26 @@ title: Querying Parquet Files in Stage
3
3
sidebar_label: Parquet
4
4
---
5
5
6
+
# Querying Parquet Files in Stage
7
+
8
+
## Overview
9
+
10
+
Parquet is a columnar storage file format optimized for analytics workloads. It provides efficient compression and encoding schemes, making it ideal for storing and querying large datasets. Parquet files contain embedded schema information, which allows Databend to understand the structure and data types of your data without additional configuration.
11
+
12
+
**Why Query Parquet Files in Stages?**
13
+
14
+
Querying Parquet files directly from stages (external storage locations like S3, Azure Blob, or GCS) offers several advantages:
15
+
16
+
-**No Data Movement**: Query data where it lives without importing it into Databend tables
17
+
-**Cost Efficiency**: Avoid storage duplication and reduce data transfer costs
18
+
-**Flexibility**: Analyze data from multiple sources without permanent storage commitment
19
+
-**Schema Preservation**: Leverage Parquet's embedded schema for accurate data type handling
20
+
-**Performance**: Take advantage of Parquet's columnar format for analytical queries
21
+
6
22
## Query Parquet Files in Stage
7
23
8
-
Syntax:
24
+
## Syntax
25
+
9
26
```sql
10
27
SELECT [<alias>.]<column> [, <column> ...]
11
28
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
@@ -18,23 +35,58 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
18
35
)]
19
36
```
20
37
21
-
:::info Tips
22
-
**Query Return Content Explanation:**
38
+
### Parameters Explained
39
+
40
+
**Data Source Options:**
41
+
-`@<stage_name>[/<path>]`: Reference to a named stage with optional subdirectory path
42
+
-`'<uri>'`: Direct URI to storage location (e.g., `'s3://bucket/path/'`)
43
+
-`<table_alias>`: Optional alias for the data source in your query
44
+
45
+
**Query Options:**
46
+
-`<connection_parameters>`: Authentication and connection settings for the storage service
47
+
-`PATTERN => '<regex_pattern>'`: Regular expression to filter files by name (e.g., `'.*\.parquet$'`)
48
+
-`FILE_FORMAT => 'PARQUET | <custom_format_name>'`: Specify built-in PARQUET format or custom format name
49
+
-`FILES => ( '<file_name>' [, ...] )`: Explicitly list specific files to query instead of using pattern matching
50
+
-`CASE_SENSITIVE => true | false`: Control whether column name matching is case-sensitive (default: true)
51
+
52
+
**When to Use Each Parameter:**
53
+
- Use `PATTERN` when you want to query multiple files matching a naming convention
54
+
- Use `FILES` when you need to query specific files by name
55
+
- Use `CASE_SENSITIVE => false` when your Parquet files have inconsistent column name casing
56
+
57
+
### Parquet Query Behavior
23
58
24
-
***Return Format**: Column values in their native data types (not variants)
25
-
***Access Method**: Directly use column names `column_name`
26
-
***Example**: `SELECT id, name, age FROM @stage_name`
27
-
***Key Features**:
28
-
* No need for path expressions (like `$1:name`)
29
-
* No type casting required
30
-
* Parquet files contain embedded schema information
31
-
:::
59
+
**Schema and Data Type Handling:**
32
60
33
-
## Tutorial
61
+
Parquet files differ from other formats (like CSV or JSON) because they contain embedded schema information. This provides several advantages when querying:
62
+
63
+
***Native Data Types**: Column values are returned in their original data types (INTEGER, VARCHAR, TIMESTAMP, etc.) rather than as generic VARIANT types
64
+
***Direct Column Access**: Reference columns directly by name: `SELECT id, name, age FROM @stage_name`
65
+
***No Type Conversion**: No need for explicit type casting or path expressions like `$1:name::VARCHAR`
66
+
***Schema Validation**: Databend automatically validates that your query columns exist in the Parquet schema
67
+
68
+
**Example Comparison:**
69
+
```sql
70
+
-- Parquet (simple and type-safe)
71
+
SELECT customer_id, order_date, total_amount FROM @parquet_stage
72
+
73
+
-- JSON equivalent (requires path expressions and casting)
This tutorial demonstrates how to set up and query Parquet files stored in an external S3 bucket. The process involves creating a stage (connection to external storage), defining a file format, and executing queries.
34
80
35
81
### Step 1. Create an External Stage
36
82
37
-
Create an external stage with your own S3 bucket and credentials where your Parquet files are stored.
83
+
**Purpose**: A stage in Databend acts as a named connection to external storage, allowing you to reference your S3 bucket easily in queries without repeating connection details.
84
+
85
+
**What this step accomplishes:**
86
+
- Creates a reusable connection named `parquet_query_stage`
87
+
- Points to the S3 location `s3://load/parquet/` where your Parquet files are stored
88
+
- Stores authentication credentials securely within Databend
89
+
38
90
```sql
39
91
CREATE STAGE parquet_query_stage
40
92
URL ='s3://load/parquet/'
@@ -44,17 +96,52 @@ CONNECTION = (
44
96
);
45
97
```
46
98
99
+
**Configuration Notes:**
100
+
- Replace `<your-access-key-id>` and `<your-secret-access-key>` with your actual AWS credentials
101
+
- The URL should point to the directory containing your Parquet files
102
+
- You can include subdirectories in the URL or specify them later in queries
103
+
- The stage name (`parquet_query_stage`) will be referenced in subsequent queries
104
+
47
105
### Step 2. Create Custom Parquet File Format
48
106
107
+
**Purpose**: While Databend has built-in Parquet support, creating a named file format allows you to:
108
+
- Reuse format settings across multiple queries
109
+
- Customize Parquet-specific options if needed
110
+
- Make queries more readable and maintainable
111
+
112
+
**What this step accomplishes:**
113
+
- Creates a named file format called `parquet_query_format`
114
+
- Explicitly specifies the format type as PARQUET
115
+
- Provides a reusable reference for consistent file processing
116
+
49
117
```sql
50
118
CREATE FILE FORMAT parquet_query_format
51
119
TYPE = PARQUET
52
120
;
53
121
```
54
-
- More Parquet file format options refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options)
122
+
123
+
**Alternative Approach:**
124
+
You can also use the built-in format directly in queries without creating a custom format:
125
+
```sql
126
+
-- Using built-in format (simpler for one-time queries)
When querying multiple Parquet files from a stage, Databend automatically provides metadata columns that help you understand the source and structure of your data. These virtual columns are especially useful when:
178
+
179
+
-**Data Auditing**: Track which file each row originated from
180
+
-**Debugging**: Identify problematic rows by file and line number
0 commit comments