feat: column mapping and auto column mapping #118

MetalBlueberry · 2025-07-01T08:17:18Z

The current implementation only supports --column-names which requires the user to manually provide the names for the database columns.

Given that CSV files already support headers, adding --auto-column-mapping allows to take advantage of such information and use it to build --column-names dynamically. This is less error prone and more flexible.

The --column-mapping flag is added along with --auto-column-mapping because it resolves another common problem where your input csv column names do not match the database column names. --auto-column-mapping is a concrete case of --column-mapping where source and destination are the same name.

smoya · 2025-07-01T13:05:41Z

pkg/csvcopy/csvcopy.go

+		if c.skip != 1 {
+			return Result{}, fmt.Errorf("column mapping requires skip to be exactly 1 (one header row)")
+		}


As discussed during a pairing review session, If we want to keep letting the user skip lines at the beginning of the file, this breaks with such a feature.
An alternative to this plus a test with a valid use case can be found in my draft pr: #120
Also note there is a TODO we might want to discuss

pkg/csvcopy/csvcopy.go

smoya · 2025-07-02T16:17:09Z

cmd/timescaledb-parallel-copy/main.go

-			csvcopy.WithSkipHeaderCount(headerLinesCnt),
-		)
+		if headerLinesCnt == 1 {
+			opts = append(opts, csvcopy.WithSkipHeader(true))


I don't think we should keep calling this "skipping headers" thing. We just skip lines, not header line of the file. It doesn't matter if the first line has the headers or rather comments.

WDYT?

I think if you call WithSkipHeader and (Auto)ColumnMapping at the same time, it should fail. That is the reason why I keep both. It is more like a safe check.

but now that I read this code, it will make it impossible to skip just 1 row and then use column mapping 💥

I just don't want people setting both by mistake just to realize it errors in a strange way.

So we need a way to detect

A user was already using skip header (otherwise the import fails if the file has headers)

They set the flag for column mapping without removing skip header

I expect this to be a quite common path and if we don't error properly, it will be very confusing.

If you don't want to add a breaking change, I would just add a new arg called "skip lines" or similar, transform usages of the previos params to the new one, and deprecate the usage with a warning or similar.

Otherwise 🔥 and YOLO

I think a breaking change is the right thing to do. It will be straightforward to change for anyone facing the issue if we trigger the right error message

alejandrodnm

This is an initial review.

alejandrodnm · 2025-07-17T06:53:03Z

cmd/timescaledb-parallel-copy/main.go

+	if headerLinesCnt != 1 {
+		log.Fatalf("Error: -header-line-count is deprecated. Use -skip-lines instead")
+	}


Why are we deprecating this flag? And why in the context of this PR?

it is in the thread above #118 (comment)

alejandrodnm · 2025-07-17T07:02:03Z

cmd/timescaledb-parallel-copy/main.go

+
+// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping
+// Supports two formats:
+// 1. Simple: "csv_col1:db_col1,csv_col2:db_col2"


Why do we need to support 2 methods? Wouldn't one simplify UX for the user?

Have you run tests with random column names that need to be quoted?

simple method works for most of use cases and is very comfortable to type in the terminal. BUT as you already noticed, it will fail with strange column names as it will provably conflict with terminal syntax.

the json format is bullet proof. as json encoding is well defined and you can easily define any column name you want without having to worry about your terminal. This includes syntax to quote characters. So there is no need for extra validation. IF you mess it up, the code will fail on the first attempt to insert data into your database.

It also doesn't make a lot of sense to have unit tests for this. As it is just json.Unmarshall

alejandrodnm · 2025-07-17T07:04:36Z

cmd/timescaledb-parallel-copy/main.go

+func parseJSONColumnMapping(jsonStr string) (csvcopy.ColumnsMapping, error) {
+	var mappingMap map[string]string
+	if err := json.Unmarshal([]byte(jsonStr), &mappingMap); err != nil {
+		return nil, fmt.Errorf("invalid JSON format for column mapping: %w", err)


Let's return the correct format as part of the error.

alejandrodnm · 2025-07-17T07:24:30Z

pkg/csvcopy/options.go

+		if mappings == nil {
+			return errors.New("column mapping cannot be nil")
+		}
+		if c.useFileHeaders != HeaderNone {


The mutually exclusive group is getting kind of big. I don't have a strong suggestion, I was thinking of maybe using having some share prefix, but that won't work for header:

columns

columns-mapping

columns-mapping-auto

columns-skip-header

The error could be along the lines like only one columns option is allowed. Again not a strong suggestion.

One thing though, we should change the error messages to show the actual flags instead of the Go options function name. These are errors that are surfaced to the user.

Instead of:

header handling is already configured. Use only one of: WithSkipHeader, WithColumnMapping, or WithAutoColumnMapping

Do:

header handling is already configured. Use only one of: --skip-header, --column-mapping, or --auto-column-mapping

I've been considering about the bit to replace WithSkipHeader with --skip-header. But I think that has to be done specifically for the CLI interface. otherwise, the PKG interface will be displaying cli flags 😕

in any case, both are similar enough for an LLM to see the relationship 😉 so I expect humans to be smarter, right? 🐵

alejandrodnm · 2025-07-17T07:30:31Z

cmd/timescaledb-parallel-copy/main.go

 	fmt.Println(res)
 }
+
+// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping


Maybe we should move these to a config package and create some tests to validate the mapping works. We should also test weird valid Postgres column names.

I'll rather keep this function near in main.go file than to create a separate package. Specially given this is specific to the cli interface.

alejandrodnm · 2025-07-17T07:32:07Z

pkg/csvcopy/csvcopy.go

+
+const (
+	HeaderNone HeaderHandling = iota
+	HeaderSkip


Maybe this could help with the UX of how to present the flags to the user and the mutual exclusion by prefixing the flags with header. Related to my other comment about the validation logic.

I think you have a point here. We may want to step back and think about how flags look like. We have a bit of a mess right now. as you already noticed

alejandrodnm · 2025-07-17T07:40:14Z

pkg/csvcopy/csvcopy.go

+		return fmt.Errorf("failed to parse headers: %w", err)
+	}
+
+	if len(c.columnMapping) == 0 {


Shouldn't this be driven by c.useFileHeaders == HeaderAutoColumnMapping?

I think you are right, both will work as HeaderAutoColumnMapping implies len(columnMapping) == 0

alejandrodnm · 2025-07-17T07:45:13Z

pkg/csvcopy/csvcopy.go

+	return nil
+}
+
+func validateColumnMapping(columnMapping ColumnsMapping) error {


Should we check that the DB columns exists on the table?

Nope, it will be complicated at this stage as we do not have a connection to the database.

If you get it wrong, the first insert will fail with a clear error message, so I think it is not worth the complexity.

alejandrodnm · 2025-07-17T09:50:00Z

pkg/csvcopy/scan.go

 		batchSize = opts.BatchByteSize
 	}

-	if batchSize < bufferSize {


Should we keep this check?

This will fail as soon as you hit
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8#diff-50cd4d562d776a354f7d300bd42b001d9736e761941da821c9e14344d51a78e8R171

with return fmt.Errorf("reading lines, %w. you should provably increase batch size", err)

this test covers for the case
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8

It is not ideal, as it fails after the initial read. But it still lets you know what the solution is.

Maybe we should move batchSize parsing up as well. I remember there was a reason why I didn't do it. But I don't remember it :harold:

alejandrodnm · 2025-07-17T09:58:14Z

I'm done with my review, feel free to implement the comments, or not, your choice.

extract logic to skip headers to Copy so it runs synchronously

742e7a0

MetalBlueberry changed the base branch from main to vperez/implement-transaction-control July 1, 2025 08:22

MetalBlueberry added 5 commits July 1, 2025 10:45

Implemented functionality to parse headers

aaf752a

properly support automatic mapping

a96067b

wops!

4029a15

cleanup comments

7659f43

cleanup

e1caee9

Base automatically changed from vperez/implement-transaction-control to main July 1, 2025 11:19

feat: skip lines and not headers

7fef706

smoya reviewed Jul 1, 2025

View reviewed changes

added failing test cases

b8992c0

MetalBlueberry commented Jul 1, 2025

View reviewed changes

pkg/csvcopy/csvcopy.go Show resolved Hide resolved

MetalBlueberry added 6 commits July 2, 2025 17:22

remove todo comment

d1d9f31

fix incorrect row count when using column mapping

4fbdb3c

ensure options report conflicts properly

881957b

update main to use withskipheader so it properly detects flag conflicts

18e4a95

quote identifiers

6f6c8ed

wops!

0cc1034

smoya reviewed Jul 2, 2025

View reviewed changes

MetalBlueberry changed the title ~~vperez/implement column mapping~~ feat: column mapping and auto column mapping Jul 2, 2025

MetalBlueberry added 4 commits July 3, 2025 15:28

Update readme

375f0b9

add a new flag to skip lines

ec2f122

fix: properly implement flags

6c15df0

woops!!

0b67b31

MetalBlueberry marked this pull request as ready for review July 4, 2025 11:46

extend test coverage and add readme examples

8d5e553

alejandrodnm reviewed Jul 17, 2025

View reviewed changes

MetalBlueberry merged commit fb931ee into main Sep 1, 2025
3 checks passed

MetalBlueberry deleted the vperez/implement-column-mapping branch September 1, 2025 14:26

feat: column mapping and auto column mapping #118

feat: column mapping and auto column mapping #118

Uh oh!

Conversation

MetalBlueberry commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alejandrodnm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MetalBlueberry Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alejandrodnm commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

MetalBlueberry commented Jul 1, 2025 •

edited

Loading

MetalBlueberry Jul 21, 2025 •

edited

Loading