You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+102-4Lines changed: 102 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -188,10 +188,16 @@ Other options and flags are also available:
188
188
$ timescaledb-parallel-copy --help
189
189
190
190
Usage of timescaledb-parallel-copy:
191
-
-batch-error-output-dir string
192
-
directory to store batch errors. Settings this will save a .csv file with the contents of the batch that failed and continue with the rest of the data.
191
+
-auto-column-mapping
192
+
Automatically map CSV headers to database columns with the same names
193
+
-batch-byte-size int
194
+
Max number of bytes to send in a batch (default 20971520)
193
195
-batch-size int
194
-
Number of rows per insert (default 5000)
196
+
Number of rows per insert. It will be limited by batch-byte-size (default 5000)
197
+
-buffer-byte-size int
198
+
Number of bytes to buffer, it has to be big enough to hold a full row (default 2097152)
199
+
-column-mapping string
200
+
Column mapping from CSV to database columns (format: "csv_col1:db_col1,csv_col2:db_col2" or JSON)
195
201
-columns string
196
202
Comma-separated columns present in CSV
197
203
-connection string
@@ -236,6 +242,7 @@ Usage of timescaledb-parallel-copy:
236
242
Number of parallel requests to make (default 1)
237
243
```
238
244
245
+
239
246
## Purpose
240
247
241
248
PostgreSQL native `COPY` function is transactional and single-threaded, and may not be suitable for ingesting large
@@ -251,7 +258,7 @@ less often. This improves memory management and keeps operations on the disk as
251
258
252
259
We welcome contributions to this utility, which like TimescaleDB is released under the Apache2 Open Source License. The same [Contributors Agreement](//github.com/timescale/timescaledb/blob/master/CONTRIBUTING.md) applies; please sign the [Contributor License Agreement](https://cla-assistant.io/timescale/timescaledb-parallel-copy) (CLA) if you're a new contributor.
253
260
254
-
###Running Tests
261
+
## Running Tests
255
262
256
263
Some of the tests require a running Postgres database. Set the `TEST_CONNINFO`
257
264
environment variable to point at the database you want to run tests against.
@@ -264,3 +271,94 @@ For example:
264
271
$ createdb gotest
265
272
$ TEST_CONNINFO='dbname=gotest user=myuser' go test -v ./...
266
273
```
274
+
275
+
## Advanced usage
276
+
277
+
### Column Mapping
278
+
279
+
The tool exposes two flags `--column-mapping` and `--auto-column-mapping` that allow to handle csv headers in a smart way.
280
+
281
+
`--column-mapping` allows to specify how the columns from your csv map into database columns. It supports two formats:
282
+
283
+
**Simple format:**
284
+
```bash
285
+
# Map CSV columns to database columns with different names
Example CSV file with headers matching database columns:
315
+
```csv
316
+
time,device_id,temperature,humidity
317
+
2023-01-01 00:00:00,sensor_001,20.5,65.2
318
+
2023-01-01 01:00:00,sensor_002,21.0,64.8
319
+
```
320
+
321
+
Both flags automatically skip the header row and cannot be used together with `--skip-header` or `--columns`.
322
+
323
+
**Flexible Column Mapping:**
324
+
325
+
Column mappings can include entries for columns that are not present in the input CSV file. This allows you to use the same mapping configuration across multiple input files with different column sets:
326
+
327
+
```bash
328
+
# Define a comprehensive mapping that works with multiple CSV formats
Example CSV file with only some of the mapped columns:
334
+
```csv
335
+
timestamp,temp,humidity
336
+
2023-01-01 00:00:00,20.5,65.2
337
+
2023-01-01 01:00:00,21.0,64.8
338
+
```
339
+
340
+
In this case, only the `timestamp`, `temp`, and `humidity` columns from the CSV will be processed and mapped to `time`, `temperature`, and `humidity_percent` respectively. The unused mappings for `pressure` and `location` are simply ignored, allowing the same mapping configuration to work with different input files that may have varying column sets.
341
+
342
+
You can also map different CSV column names to the same database column, as long as only one of them appears in any given input file:
343
+
344
+
```bash
345
+
# Map both 'temp' and 'temperature' to the same database column
This allows importing from different file formats into the same table:
351
+
352
+
**File A** (uses 'temp'):
353
+
```csv
354
+
timestamp,temp,humidity
355
+
2023-01-01 00:00:00,20.5,65.2
356
+
```
357
+
358
+
**File B** (uses 'temperature'):
359
+
```csv
360
+
timestamp,temperature,humidity
361
+
2023-01-01 02:00:00,22.1,63.5
362
+
```
363
+
364
+
Both files can use the same mapping configuration and import successfully into the same database table, even though they use different column names for the temperature data. The tool only validates for duplicate database columns among the columns actually present in each specific input file.
flag.StringVar(&escapeCharacter, "escape", "", "The ESCAPE `character` to use during COPY (default '\"')")
69
74
flag.StringVar(&fromFile, "file", "", "File to read from rather than stdin")
70
75
flag.StringVar(&columns, "columns", "", "Comma-separated columns present in CSV")
76
+
flag.StringVar(&columnMapping, "column-mapping", "", "Column mapping from CSV to database columns (format: \"csv_col1:db_col1,csv_col2:db_col2\" or JSON)")
77
+
flag.BoolVar(&autoColumnMapping, "auto-column-mapping", false, "Automatically map CSV headers to database columns with the same names")
78
+
71
79
flag.BoolVar(&skipHeader, "skip-header", false, "Skip the first line of the input")
72
-
flag.IntVar(&headerLinesCnt, "header-line-count", 1, "Number of header lines")
80
+
flag.IntVar(&headerLinesCnt, "header-line-count", 1, "(deprecated) Number of header lines")
81
+
flag.IntVar(&skipLines, "skip-lines", 0, "Skip the first n lines of the input. it is applied before skip-header")
73
82
74
83
flag.BoolVar(&skipBatchErrors, "skip-batch-errors", false, "if true, the copy will continue even if a batch fails")
75
84
@@ -103,6 +112,11 @@ func main() {
103
112
ifdbName!="" {
104
113
log.Fatalf("Error: Deprecated flag -db-name is being used. Update -connection to connect to the given database")
105
114
}
115
+
116
+
ifheaderLinesCnt!=1 {
117
+
log.Fatalf("Error: -header-line-count is deprecated. Use -skip-lines instead")
0 commit comments