Skip to content

Commit ab47d63

Browse files
committed
improve Parquet querying documentation with comprehensive explanations
Enhanced the Parquet files documentation with detailed explanations of syntax, parameters, and step-by-step tutorial guidance while preserving all existing examples.
1 parent 9bd3b5c commit ab47d63

File tree

1 file changed

+269
-17
lines changed

1 file changed

+269
-17
lines changed

docs/en/guides/40-load-data/04-transform/00-querying-parquet.md

Lines changed: 269 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,26 @@ title: Querying Parquet Files in Stage
33
sidebar_label: Parquet
44
---
55

6+
# Querying Parquet Files in Stage
7+
8+
## Overview
9+
10+
Parquet is a columnar storage file format optimized for analytics workloads. It provides efficient compression and encoding schemes, making it ideal for storing and querying large datasets. Parquet files contain embedded schema information, which allows Databend to understand the structure and data types of your data without additional configuration.
11+
12+
**Why Query Parquet Files in Stages?**
13+
14+
Querying Parquet files directly from stages (external storage locations like S3, Azure Blob, or GCS) offers several advantages:
15+
16+
- **No Data Movement**: Query data where it lives without importing it into Databend tables
17+
- **Cost Efficiency**: Avoid storage duplication and reduce data transfer costs
18+
- **Flexibility**: Analyze data from multiple sources without permanent storage commitment
19+
- **Schema Preservation**: Leverage Parquet's embedded schema for accurate data type handling
20+
- **Performance**: Take advantage of Parquet's columnar format for analytical queries
21+
622
## Query Parquet Files in Stage
723

8-
Syntax:
24+
## Syntax
25+
926
```sql
1027
SELECT [<alias>.]<column> [, <column> ...]
1128
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
@@ -18,23 +35,58 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1835
)]
1936
```
2037

21-
:::info Tips
22-
**Query Return Content Explanation:**
38+
### Parameters Explained
39+
40+
**Data Source Options:**
41+
- `@<stage_name>[/<path>]`: Reference to a named stage with optional subdirectory path
42+
- `'<uri>'`: Direct URI to storage location (e.g., `'s3://bucket/path/'`)
43+
- `<table_alias>`: Optional alias for the data source in your query
44+
45+
**Query Options:**
46+
- `<connection_parameters>`: Authentication and connection settings for the storage service
47+
- `PATTERN => '<regex_pattern>'`: Regular expression to filter files by name (e.g., `'.*\.parquet$'`)
48+
- `FILE_FORMAT => 'PARQUET | <custom_format_name>'`: Specify built-in PARQUET format or custom format name
49+
- `FILES => ( '<file_name>' [, ...] )`: Explicitly list specific files to query instead of using pattern matching
50+
- `CASE_SENSITIVE => true | false`: Control whether column name matching is case-sensitive (default: true)
51+
52+
**When to Use Each Parameter:**
53+
- Use `PATTERN` when you want to query multiple files matching a naming convention
54+
- Use `FILES` when you need to query specific files by name
55+
- Use `CASE_SENSITIVE => false` when your Parquet files have inconsistent column name casing
56+
57+
### Parquet Query Behavior
2358

24-
* **Return Format**: Column values in their native data types (not variants)
25-
* **Access Method**: Directly use column names `column_name`
26-
* **Example**: `SELECT id, name, age FROM @stage_name`
27-
* **Key Features**:
28-
* No need for path expressions (like `$1:name`)
29-
* No type casting required
30-
* Parquet files contain embedded schema information
31-
:::
59+
**Schema and Data Type Handling:**
3260

33-
## Tutorial
61+
Parquet files differ from other formats (like CSV or JSON) because they contain embedded schema information. This provides several advantages when querying:
62+
63+
* **Native Data Types**: Column values are returned in their original data types (INTEGER, VARCHAR, TIMESTAMP, etc.) rather than as generic VARIANT types
64+
* **Direct Column Access**: Reference columns directly by name: `SELECT id, name, age FROM @stage_name`
65+
* **No Type Conversion**: No need for explicit type casting or path expressions like `$1:name::VARCHAR`
66+
* **Schema Validation**: Databend automatically validates that your query columns exist in the Parquet schema
67+
68+
**Example Comparison:**
69+
```sql
70+
-- Parquet (simple and type-safe)
71+
SELECT customer_id, order_date, total_amount FROM @parquet_stage
72+
73+
-- JSON equivalent (requires path expressions and casting)
74+
SELECT $1:customer_id::INTEGER, $1:order_date::DATE, $1:total_amount::DECIMAL FROM @json_stage
75+
```
76+
77+
## Step-by-Step Tutorial
78+
79+
This tutorial demonstrates how to set up and query Parquet files stored in an external S3 bucket. The process involves creating a stage (connection to external storage), defining a file format, and executing queries.
3480

3581
### Step 1. Create an External Stage
3682

37-
Create an external stage with your own S3 bucket and credentials where your Parquet files are stored.
83+
**Purpose**: A stage in Databend acts as a named connection to external storage, allowing you to reference your S3 bucket easily in queries without repeating connection details.
84+
85+
**What this step accomplishes:**
86+
- Creates a reusable connection named `parquet_query_stage`
87+
- Points to the S3 location `s3://load/parquet/` where your Parquet files are stored
88+
- Stores authentication credentials securely within Databend
89+
3890
```sql
3991
CREATE STAGE parquet_query_stage
4092
URL = 's3://load/parquet/'
@@ -44,17 +96,52 @@ CONNECTION = (
4496
);
4597
```
4698

99+
**Configuration Notes:**
100+
- Replace `<your-access-key-id>` and `<your-secret-access-key>` with your actual AWS credentials
101+
- The URL should point to the directory containing your Parquet files
102+
- You can include subdirectories in the URL or specify them later in queries
103+
- The stage name (`parquet_query_stage`) will be referenced in subsequent queries
104+
47105
### Step 2. Create Custom Parquet File Format
48106

107+
**Purpose**: While Databend has built-in Parquet support, creating a named file format allows you to:
108+
- Reuse format settings across multiple queries
109+
- Customize Parquet-specific options if needed
110+
- Make queries more readable and maintainable
111+
112+
**What this step accomplishes:**
113+
- Creates a named file format called `parquet_query_format`
114+
- Explicitly specifies the format type as PARQUET
115+
- Provides a reusable reference for consistent file processing
116+
49117
```sql
50118
CREATE FILE FORMAT parquet_query_format
51119
TYPE = PARQUET
52120
;
53121
```
54-
- More Parquet file format options refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options)
122+
123+
**Alternative Approach:**
124+
You can also use the built-in format directly in queries without creating a custom format:
125+
```sql
126+
-- Using built-in format (simpler for one-time queries)
127+
SELECT * FROM @parquet_query_stage (FILE_FORMAT => 'PARQUET')
128+
129+
-- Using custom format (better for repeated queries)
130+
SELECT * FROM @parquet_query_stage (FILE_FORMAT => 'parquet_query_format')
131+
```
132+
133+
For advanced Parquet file format options, refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options)
55134

56135
### Step 3. Query Parquet Files
57136

137+
**Purpose**: Execute a query against all Parquet files in the stage that match the specified pattern.
138+
139+
**What this query does:**
140+
- Selects all columns (`*`) from Parquet files in the stage
141+
- Uses the custom file format created in Step 2
142+
- Applies a pattern to match files ending with `.parquet`
143+
- Returns data with proper data types preserved from the Parquet schema
144+
58145
```sql
59146
SELECT *
60147
FROM @parquet_query_stage
@@ -63,9 +150,45 @@ FROM @parquet_query_stage
63150
PATTERN => '.*[.]parquet'
64151
);
65152
```
66-
### Query with Metadata
67153

68-
Query Parquet files directly from a stage, including metadata columns like `METADATA$FILENAME` and `METADATA$FILE_ROW_NUMBER`:
154+
**Query Components Explained:**
155+
- `@parquet_query_stage`: References the stage created in Step 1
156+
- `FILE_FORMAT => 'parquet_query_format'`: Uses the custom format from Step 2
157+
- `PATTERN => '.*[.]parquet'`: Regex pattern matching any file ending with `.parquet`
158+
- `.*` matches any characters
159+
- `[.]` matches a literal dot (escaped in regex)
160+
- `parquet` matches the file extension
161+
162+
**Query Variations:**
163+
```sql
164+
-- Query specific columns
165+
SELECT customer_id, order_date, total_amount FROM @parquet_query_stage (...)
166+
167+
-- Query specific files
168+
SELECT * FROM @parquet_query_stage (FILE_FORMAT => 'parquet_query_format', FILES => ('data1.parquet', 'data2.parquet'))
169+
170+
-- Query with subdirectory
171+
SELECT * FROM @parquet_query_stage/2023/orders (...)
172+
```
173+
## Advanced Querying with Metadata
174+
175+
### Understanding Metadata Columns
176+
177+
When querying multiple Parquet files from a stage, Databend automatically provides metadata columns that help you understand the source and structure of your data. These virtual columns are especially useful when:
178+
179+
- **Data Auditing**: Track which file each row originated from
180+
- **Debugging**: Identify problematic rows by file and line number
181+
- **Data Processing**: Implement file-based processing logic
182+
- **Monitoring**: Understand data distribution across files
183+
184+
### Available Metadata Columns
185+
186+
- `METADATA$FILENAME`: The name of the source Parquet file
187+
- `METADATA$FILE_ROW_NUMBER`: The row number within the source file (1-based)
188+
189+
### Metadata Query Example
190+
191+
**Purpose**: Query Parquet files while capturing source file information for each row.
69192

70193
```sql
71194
SELECT
@@ -77,4 +200,133 @@ FROM @parquet_query_stage
77200
FILE_FORMAT => 'parquet_query_format',
78201
PATTERN => '.*[.]parquet'
79202
);
80-
```
203+
```
204+
205+
**Query Results Structure:**
206+
```
207+
METADATA$FILENAME | METADATA$FILE_ROW_NUMBER | customer_id | order_date | total_amount
208+
--------------------- | ------------------------ | ----------- | ---------- | ------------
209+
sales_2023_q1.parquet | 1 | 1001 | 2023-01-15 | 299.99
210+
sales_2023_q1.parquet | 2 | 1002 | 2023-01-16 | 149.50
211+
sales_2023_q2.parquet | 1 | 1003 | 2023-04-10 | 599.00
212+
```
213+
214+
**Practical Use Cases:**
215+
216+
```sql
217+
-- Find all records from a specific file
218+
SELECT * FROM @parquet_query_stage (...)
219+
WHERE METADATA$FILENAME = 'sales_2023_q1.parquet'
220+
221+
-- Group data by source file
222+
SELECT METADATA$FILENAME, COUNT(*) as record_count
223+
FROM @parquet_query_stage (...)
224+
GROUP BY METADATA$FILENAME
225+
226+
-- Identify the first record from each file
227+
SELECT * FROM @parquet_query_stage (...)
228+
WHERE METADATA$FILE_ROW_NUMBER = 1
229+
```
230+
231+
## Performance Considerations and Best Practices
232+
233+
### Query Optimization
234+
235+
**File Organization:**
236+
- **Partition by Date**: Store files in date-based directories for efficient querying (`/year=2023/month=01/`)
237+
- **Consistent Schema**: Ensure all Parquet files in a stage have compatible schemas
238+
- **Appropriate File Size**: Aim for files between 128MB - 1GB for optimal performance
239+
240+
**Query Patterns:**
241+
- **Column Selection**: Use specific column names instead of `SELECT *` to reduce data transfer
242+
- **Pattern Optimization**: Use precise regex patterns to avoid scanning unnecessary files
243+
- **File Filtering**: Use the `FILES` parameter when querying specific files rather than pattern matching
244+
245+
**Example Optimized Queries:**
246+
```sql
247+
-- Good: Specific columns and precise pattern
248+
SELECT customer_id, total_amount
249+
FROM @parquet_query_stage
250+
(FILE_FORMAT => 'parquet_query_format', PATTERN => 'sales_2023_.*\.parquet$')
251+
252+
-- Better: Query specific files when known
253+
SELECT customer_id, total_amount
254+
FROM @parquet_query_stage
255+
(FILE_FORMAT => 'parquet_query_format', FILES => ('sales_2023_q1.parquet', 'sales_2023_q2.parquet'))
256+
```
257+
258+
### Storage Best Practices
259+
260+
**Parquet File Optimization:**
261+
- Use appropriate compression (SNAPPY for speed, GZIP for size)
262+
- Enable column pruning by storing related columns together
263+
- Consider row group size based on your query patterns
264+
265+
**Stage Configuration:**
266+
- Use the same AWS region for both Databend and your S3 bucket
267+
- Configure appropriate IAM permissions for security
268+
- Consider using S3 Transfer Acceleration for cross-region access
269+
270+
## Common Troubleshooting Scenarios
271+
272+
### Schema-Related Issues
273+
274+
**Problem**: "Column not found" errors
275+
```
276+
ERROR: Column 'customer_name' not found in parquet file
277+
```
278+
**Solutions:**
279+
- Verify column names match exactly (check case sensitivity)
280+
- Use `CASE_SENSITIVE => false` if column casing is inconsistent
281+
- Check that all files in the pattern have the same schema
282+
283+
**Problem**: Data type mismatches between files
284+
```
285+
ERROR: Schema mismatch: column 'amount' has type DECIMAL in file1.parquet but INTEGER in file2.parquet
286+
```
287+
**Solutions:**
288+
- Ensure consistent data types across all Parquet files
289+
- Query files with compatible schemas separately
290+
- Use explicit type casting in your queries when necessary
291+
292+
### File Access Issues
293+
294+
**Problem**: "Access denied" or "File not found" errors
295+
```
296+
ERROR: Access denied: s3://load/parquet/sales_2023.parquet
297+
```
298+
**Solutions:**
299+
- Verify AWS credentials have proper S3 permissions
300+
- Check that the bucket and file paths are correct
301+
- Ensure the stage URL matches your actual S3 structure
302+
303+
**Problem**: Pattern matches no files
304+
```
305+
ERROR: No files match pattern '.*\.parquet$'
306+
```
307+
**Solutions:**
308+
- List files in your S3 bucket to verify naming conventions
309+
- Test your regex pattern with actual file names
310+
- Use `FILES => (...)` to specify exact file names for testing
311+
312+
### Performance Issues
313+
314+
**Problem**: Slow query performance
315+
**Solutions:**
316+
- Reduce the number of files by using more specific patterns
317+
- Select only required columns instead of using `SELECT *`
318+
- Check if files are properly compressed
319+
- Consider the network latency between Databend and your storage
320+
321+
**Problem**: Memory issues with large datasets
322+
**Solutions:**
323+
- Query smaller subsets of data using file patterns or date ranges
324+
- Use `LIMIT` clauses for initial data exploration
325+
- Consider breaking large queries into smaller chunks
326+
327+
### Best Practices for Debugging
328+
329+
1. **Start Simple**: Begin with a single file using the `FILES` parameter
330+
2. **Test Patterns**: Use simple patterns first, then make them more specific
331+
3. **Check Metadata**: Use metadata columns to understand file structure
332+
4. **Verify Credentials**: Test stage connectivity with a simple `LIST @stage_name` command

0 commit comments

Comments
 (0)