Skip to content

Commit 9bd3b5c

Browse files
committed
improve NDJSON querying documentation with context and explanations
- Add comprehensive introduction explaining what NDJSON is and its advantages - Provide concrete NDJSON data examples showing realistic book catalog data - Add detailed explanations for all existing code examples including syntax breakdown - Explain metadata columns usage and practical use cases - Add related documentation links for better navigation - Enhance tutorial flow with clear step-by-step context
1 parent 742aab1 commit 9bd3b5c

File tree

1 file changed

+49
-3
lines changed

1 file changed

+49
-3
lines changed

docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,24 @@ title: Querying NDJSON Files in Stage
33
sidebar_label: NDJSON
44
---
55

6+
In Databend, you can directly query NDJSON files stored in stages without first loading the data into tables. This approach is particularly useful for data exploration, ETL processing, and ad-hoc analysis scenarios.
7+
8+
## What is NDJSON?
9+
10+
NDJSON (Newline Delimited JSON) is a JSON-based file format where each line contains a complete and valid JSON object. This format is especially well-suited for streaming data processing and big data analytics.
11+
12+
**Example NDJSON file content:**
13+
```json
14+
{"id": 1, "title": "Database Fundamentals", "author": "John Doe", "price": 45.50, "category": "Technology"}
15+
{"id": 2, "title": "Machine Learning in Practice", "author": "Jane Smith", "price": 68.00, "category": "AI"}
16+
{"id": 3, "title": "Web Development Guide", "author": "Mike Johnson", "price": 52.30, "category": "Frontend"}
17+
```
18+
19+
**Advantages of NDJSON:**
20+
- **Stream-friendly**: Can be parsed line by line without loading entire file into memory
21+
- **Big data compatible**: Widely used in log files, data exports, and ETL pipelines
22+
- **Easy to process**: Each line is an independent JSON object, enabling parallel processing
23+
624
## Query NDJSON Files in Stage
725

826
Syntax:
@@ -57,6 +75,8 @@ CREATE FILE FORMAT ndjson_query_format
5775

5876
### Step 3. Query NDJSON Files
5977

78+
Now you can query the NDJSON files directly from the stage. This example extracts the `title` and `author` fields from each JSON object:
79+
6080
```sql
6181
SELECT $1:title, $1:author
6282
FROM @ndjson_query_stage
@@ -66,7 +86,15 @@ FROM @ndjson_query_stage
6686
);
6787
```
6888

69-
If the NDJSON files are compressed with gzip, we can use the following query:
89+
**Explanation:**
90+
- `$1:title` and `$1:author`: Extract specific fields from the JSON object. The `$1` represents the entire JSON object as a variant, and `:field_name` accesses individual fields
91+
- `@ndjson_query_stage`: References the external stage created in Step 1
92+
- `FILE_FORMAT => 'ndjson_query_format'`: Uses the custom file format defined in Step 2
93+
- `PATTERN => '.*[.]ndjson'`: Regex pattern that matches all files ending with `.ndjson`
94+
95+
### Querying Compressed Files
96+
97+
If the NDJSON files are compressed with gzip, modify the pattern to match compressed files:
7098

7199
```sql
72100
SELECT $1:title, $1:author
@@ -76,9 +104,11 @@ FROM @ndjson_query_stage
76104
PATTERN => '.*[.]ndjson[.]gz'
77105
);
78106
```
107+
108+
**Key difference:** The pattern `.*[.]ndjson[.]gz` matches files ending with `.ndjson.gz`. Databend automatically decompresses gzip files during query execution thanks to the `COMPRESSION = AUTO` setting in the file format.
79109
### Query with Metadata
80110

81-
Query NDJSON files directly from a stage, including metadata columns like `METADATA$FILENAME` and `METADATA$FILE_ROW_NUMBER`:
111+
You can also include file metadata in your queries, which is useful for tracking data lineage and debugging:
82112

83113
```sql
84114
SELECT
@@ -90,4 +120,20 @@ FROM @ndjson_query_stage
90120
FILE_FORMAT => 'ndjson_query_format',
91121
PATTERN => '.*[.]ndjson'
92122
);
93-
```
123+
```
124+
125+
**Metadata columns explained:**
126+
- `METADATA$FILENAME`: Shows which file each row came from - helpful when querying multiple files
127+
- `METADATA$FILE_ROW_NUMBER`: Shows the line number within the source file - useful for tracking specific records
128+
129+
**Use cases:**
130+
- **Data lineage**: Track which source file contributed each record
131+
- **Debugging**: Identify problematic records by file and line number
132+
- **Incremental processing**: Process only specific files or ranges within files
133+
134+
## Related Documentation
135+
136+
- [Loading NDJSON Files](../03-load-semistructured/03-load-ndjson.md) - How to load NDJSON data into tables
137+
- [NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options) - Complete NDJSON format configuration
138+
- [CREATE STAGE](/sql/sql-commands/ddl/stage/ddl-create-stage) - Managing external and internal stages
139+
- [Querying Metadata](./04-querying-metadata.md) - More details about metadata columns

0 commit comments

Comments
 (0)