Skip to content

Conversation

MetalBlueberry
Copy link
Contributor

@MetalBlueberry MetalBlueberry commented Jul 1, 2025

The current implementation only supports --column-names which requires the user to manually provide the names for the database columns.

Given that CSV files already support headers, adding --auto-column-mapping allows to take advantage of such information and use it to build --column-names dynamically. This is less error prone and more flexible.

The --column-mapping flag is added along with --auto-column-mapping because it resolves another common problem where your input csv column names do not match the database column names. --auto-column-mapping is a concrete case of --column-mapping where source and destination are the same name.

@MetalBlueberry MetalBlueberry changed the base branch from main to vperez/implement-transaction-control July 1, 2025 08:22
Base automatically changed from vperez/implement-transaction-control to main July 1, 2025 11:19
Comment on lines 159 to 161
if c.skip != 1 {
return Result{}, fmt.Errorf("column mapping requires skip to be exactly 1 (one header row)")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during a pairing review session, If we want to keep letting the user skip lines at the beginning of the file, this breaks with such a feature.
An alternative to this plus a test with a valid use case can be found in my draft pr: #120
Also note there is a TODO we might want to discuss

csvcopy.WithSkipHeaderCount(headerLinesCnt),
)
if headerLinesCnt == 1 {
opts = append(opts, csvcopy.WithSkipHeader(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should keep calling this "skipping headers" thing. We just skip lines, not header line of the file. It doesn't matter if the first line has the headers or rather comments.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you call WithSkipHeader and (Auto)ColumnMapping at the same time, it should fail. That is the reason why I keep both. It is more like a safe check.

but now that I read this code, it will make it impossible to skip just 1 row and then use column mapping 💥

I just don't want people setting both by mistake just to realize it errors in a strange way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need a way to detect

  • A user was already using skip header (otherwise the import fails if the file has headers)
  • They set the flag for column mapping without removing skip header

I expect this to be a quite common path and if we don't error properly, it will be very confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to add a breaking change, I would just add a new arg called "skip lines" or similar, transform usages of the previos params to the new one, and deprecate the usage with a warning or similar.

Otherwise 🔥 and YOLO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a breaking change is the right thing to do. It will be straightforward to change for anyone facing the issue if we trigger the right error message

@MetalBlueberry MetalBlueberry changed the title vperez/implement column mapping feat: column mapping and auto column mapping Jul 2, 2025
@MetalBlueberry MetalBlueberry marked this pull request as ready for review July 4, 2025 11:46
Copy link
Contributor

@alejandrodnm alejandrodnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an initial review.

Comment on lines +116 to +118
if headerLinesCnt != 1 {
log.Fatalf("Error: -header-line-count is deprecated. Use -skip-lines instead")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we deprecating this flag? And why in the context of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in the thread above #118 (comment)


// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping
// Supports two formats:
// 1. Simple: "csv_col1:db_col1,csv_col2:db_col2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to support 2 methods? Wouldn't one simplify UX for the user?

Have you run tests with random column names that need to be quoted?

Copy link
Contributor Author

@MetalBlueberry MetalBlueberry Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple method works for most of use cases and is very comfortable to type in the terminal. BUT as you already noticed, it will fail with strange column names as it will provably conflict with terminal syntax.

the json format is bullet proof. as json encoding is well defined and you can easily define any column name you want without having to worry about your terminal. This includes syntax to quote characters. So there is no need for extra validation. IF you mess it up, the code will fail on the first attempt to insert data into your database.

It also doesn't make a lot of sense to have unit tests for this. As it is just json.Unmarshall

func parseJSONColumnMapping(jsonStr string) (csvcopy.ColumnsMapping, error) {
var mappingMap map[string]string
if err := json.Unmarshal([]byte(jsonStr), &mappingMap); err != nil {
return nil, fmt.Errorf("invalid JSON format for column mapping: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return the correct format as part of the error.

if mappings == nil {
return errors.New("column mapping cannot be nil")
}
if c.useFileHeaders != HeaderNone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutually exclusive group is getting kind of big. I don't have a strong suggestion, I was thinking of maybe using having some share prefix, but that won't work for header:

  • columns
  • columns-mapping
  • columns-mapping-auto
  • columns-skip-header

The error could be along the lines like only one columns option is allowed. Again not a strong suggestion.

One thing though, we should change the error messages to show the actual flags instead of the Go options function name. These are errors that are surfaced to the user.

Instead of:

header handling is already configured. Use only one of: WithSkipHeader, WithColumnMapping, or WithAutoColumnMapping

Do:

header handling is already configured. Use only one of: --skip-header, --column-mapping, or --auto-column-mapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been considering about the bit to replace WithSkipHeader with --skip-header. But I think that has to be done specifically for the CLI interface. otherwise, the PKG interface will be displaying cli flags 😕

in any case, both are similar enough for an LLM to see the relationship 😉 so I expect humans to be smarter, right? 🐵

fmt.Println(res)
}

// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move these to a config package and create some tests to validate the mapping works. We should also test weird valid Postgres column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rather keep this function near in main.go file than to create a separate package. Specially given this is specific to the cli interface.


const (
HeaderNone HeaderHandling = iota
HeaderSkip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could help with the UX of how to present the flags to the user and the mutual exclusion by prefixing the flags with header. Related to my other comment about the validation logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have a point here. We may want to step back and think about how flags look like. We have a bit of a mess right now. as you already noticed

return fmt.Errorf("failed to parse headers: %w", err)
}

if len(c.columnMapping) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be driven by c.useFileHeaders == HeaderAutoColumnMapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, both will work as HeaderAutoColumnMapping implies len(columnMapping) == 0

return nil
}

func validateColumnMapping(columnMapping ColumnsMapping) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that the DB columns exists on the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it will be complicated at this stage as we do not have a connection to the database.

If you get it wrong, the first insert will fail with a clear error message, so I think it is not worth the complexity.

batchSize = opts.BatchByteSize
}

if batchSize < bufferSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail as soon as you hit
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8#diff-50cd4d562d776a354f7d300bd42b001d9736e761941da821c9e14344d51a78e8R171

with return fmt.Errorf("reading lines, %w. you should provably increase batch size", err)

this test covers for the case
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8

It is not ideal, as it fails after the initial read. But it still lets you know what the solution is.

Maybe we should move batchSize parsing up as well. I remember there was a reason why I didn't do it. But I don't remember it :harold:

@alejandrodnm
Copy link
Contributor

I'm done with my review, feel free to implement the comments, or not, your choice.

@MetalBlueberry MetalBlueberry merged commit fb931ee into main Sep 1, 2025
3 checks passed
@MetalBlueberry MetalBlueberry deleted the vperez/implement-column-mapping branch September 1, 2025 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants