You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
improve NDJSON querying documentation with context and explanations
- Add comprehensive introduction explaining what NDJSON is and its advantages
- Provide concrete NDJSON data examples showing realistic book catalog data
- Add detailed explanations for all existing code examples including syntax breakdown
- Explain metadata columns usage and practical use cases
- Add related documentation links for better navigation
- Enhance tutorial flow with clear step-by-step context
Copy file name to clipboardExpand all lines: docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md
+49-3Lines changed: 49 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,24 @@ title: Querying NDJSON Files in Stage
3
3
sidebar_label: NDJSON
4
4
---
5
5
6
+
In Databend, you can directly query NDJSON files stored in stages without first loading the data into tables. This approach is particularly useful for data exploration, ETL processing, and ad-hoc analysis scenarios.
7
+
8
+
## What is NDJSON?
9
+
10
+
NDJSON (Newline Delimited JSON) is a JSON-based file format where each line contains a complete and valid JSON object. This format is especially well-suited for streaming data processing and big data analytics.
-**Stream-friendly**: Can be parsed line by line without loading entire file into memory
21
+
-**Big data compatible**: Widely used in log files, data exports, and ETL pipelines
22
+
-**Easy to process**: Each line is an independent JSON object, enabling parallel processing
23
+
6
24
## Query NDJSON Files in Stage
7
25
8
26
Syntax:
@@ -57,6 +75,8 @@ CREATE FILE FORMAT ndjson_query_format
57
75
58
76
### Step 3. Query NDJSON Files
59
77
78
+
Now you can query the NDJSON files directly from the stage. This example extracts the `title` and `author` fields from each JSON object:
79
+
60
80
```sql
61
81
SELECT $1:title, $1:author
62
82
FROM @ndjson_query_stage
@@ -66,7 +86,15 @@ FROM @ndjson_query_stage
66
86
);
67
87
```
68
88
69
-
If the NDJSON files are compressed with gzip, we can use the following query:
89
+
**Explanation:**
90
+
-`$1:title` and `$1:author`: Extract specific fields from the JSON object. The `$1` represents the entire JSON object as a variant, and `:field_name` accesses individual fields
91
+
-`@ndjson_query_stage`: References the external stage created in Step 1
92
+
-`FILE_FORMAT => 'ndjson_query_format'`: Uses the custom file format defined in Step 2
93
+
-`PATTERN => '.*[.]ndjson'`: Regex pattern that matches all files ending with `.ndjson`
94
+
95
+
### Querying Compressed Files
96
+
97
+
If the NDJSON files are compressed with gzip, modify the pattern to match compressed files:
70
98
71
99
```sql
72
100
SELECT $1:title, $1:author
@@ -76,9 +104,11 @@ FROM @ndjson_query_stage
76
104
PATTERN =>'.*[.]ndjson[.]gz'
77
105
);
78
106
```
107
+
108
+
**Key difference:** The pattern `.*[.]ndjson[.]gz` matches files ending with `.ndjson.gz`. Databend automatically decompresses gzip files during query execution thanks to the `COMPRESSION = AUTO` setting in the file format.
79
109
### Query with Metadata
80
110
81
-
Query NDJSON files directly from a stage, including metadata columns like `METADATA$FILENAME`and `METADATA$FILE_ROW_NUMBER`:
111
+
You can also include file metadata in your queries, which is useful for tracking data lineage and debugging:
82
112
83
113
```sql
84
114
SELECT
@@ -90,4 +120,20 @@ FROM @ndjson_query_stage
90
120
FILE_FORMAT =>'ndjson_query_format',
91
121
PATTERN =>'.*[.]ndjson'
92
122
);
93
-
```
123
+
```
124
+
125
+
**Metadata columns explained:**
126
+
-`METADATA$FILENAME`: Shows which file each row came from - helpful when querying multiple files
127
+
-`METADATA$FILE_ROW_NUMBER`: Shows the line number within the source file - useful for tracking specific records
128
+
129
+
**Use cases:**
130
+
-**Data lineage**: Track which source file contributed each record
131
+
-**Debugging**: Identify problematic records by file and line number
132
+
-**Incremental processing**: Process only specific files or ranges within files
133
+
134
+
## Related Documentation
135
+
136
+
-[Loading NDJSON Files](../03-load-semistructured/03-load-ndjson.md) - How to load NDJSON data into tables
137
+
-[NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options) - Complete NDJSON format configuration
138
+
-[CREATE STAGE](/sql/sql-commands/ddl/stage/ddl-create-stage) - Managing external and internal stages
139
+
-[Querying Metadata](./04-querying-metadata.md) - More details about metadata columns
0 commit comments