-
Notifications
You must be signed in to change notification settings - Fork 57
feat: column mapping and auto column mapping #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pkg/csvcopy/csvcopy.go
Outdated
if c.skip != 1 { | ||
return Result{}, fmt.Errorf("column mapping requires skip to be exactly 1 (one header row)") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during a pairing review session, If we want to keep letting the user skip lines at the beginning of the file, this breaks with such a feature.
An alternative to this plus a test with a valid use case can be found in my draft pr: #120
Also note there is a TODO we might want to discuss
csvcopy.WithSkipHeaderCount(headerLinesCnt), | ||
) | ||
if headerLinesCnt == 1 { | ||
opts = append(opts, csvcopy.WithSkipHeader(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should keep calling this "skipping headers" thing. We just skip lines, not header line of the file. It doesn't matter if the first line has the headers or rather comments.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you call WithSkipHeader and (Auto)ColumnMapping at the same time, it should fail. That is the reason why I keep both. It is more like a safe check.
but now that I read this code, it will make it impossible to skip just 1 row and then use column mapping 💥
I just don't want people setting both by mistake just to realize it errors in a strange way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need a way to detect
- A user was already using skip header (otherwise the import fails if the file has headers)
- They set the flag for column mapping without removing skip header
I expect this to be a quite common path and if we don't error properly, it will be very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to add a breaking change, I would just add a new arg called "skip lines" or similar, transform usages of the previos params to the new one, and deprecate the usage with a warning or similar.
Otherwise 🔥 and YOLO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a breaking change is the right thing to do. It will be straightforward to change for anyone facing the issue if we trigger the right error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an initial review.
if headerLinesCnt != 1 { | ||
log.Fatalf("Error: -header-line-count is deprecated. Use -skip-lines instead") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we deprecating this flag? And why in the context of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in the thread above #118 (comment)
|
||
// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping | ||
// Supports two formats: | ||
// 1. Simple: "csv_col1:db_col1,csv_col2:db_col2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to support 2 methods? Wouldn't one simplify UX for the user?
Have you run tests with random column names that need to be quoted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple method works for most of use cases and is very comfortable to type in the terminal. BUT as you already noticed, it will fail with strange column names as it will provably conflict with terminal syntax.
the json format is bullet proof. as json encoding is well defined and you can easily define any column name you want without having to worry about your terminal. This includes syntax to quote characters. So there is no need for extra validation. IF you mess it up, the code will fail on the first attempt to insert data into your database.
It also doesn't make a lot of sense to have unit tests for this. As it is just json.Unmarshall
func parseJSONColumnMapping(jsonStr string) (csvcopy.ColumnsMapping, error) { | ||
var mappingMap map[string]string | ||
if err := json.Unmarshal([]byte(jsonStr), &mappingMap); err != nil { | ||
return nil, fmt.Errorf("invalid JSON format for column mapping: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's return the correct format as part of the error.
if mappings == nil { | ||
return errors.New("column mapping cannot be nil") | ||
} | ||
if c.useFileHeaders != HeaderNone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutually exclusive group is getting kind of big. I don't have a strong suggestion, I was thinking of maybe using having some share prefix, but that won't work for header:
- columns
- columns-mapping
- columns-mapping-auto
- columns-skip-header
The error could be along the lines like only one columns option is allowed. Again not a strong suggestion.
One thing though, we should change the error messages to show the actual flags instead of the Go options function name. These are errors that are surfaced to the user.
Instead of:
header handling is already configured. Use only one of: WithSkipHeader, WithColumnMapping, or WithAutoColumnMapping
Do:
header handling is already configured. Use only one of: --skip-header, --column-mapping, or --auto-column-mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been considering about the bit to replace WithSkipHeader with --skip-header. But I think that has to be done specifically for the CLI interface. otherwise, the PKG interface will be displaying cli flags 😕
in any case, both are similar enough for an LLM to see the relationship 😉 so I expect humans to be smarter, right? 🐵
fmt.Println(res) | ||
} | ||
|
||
// parseColumnMapping parses column mapping string into csvcopy.ColumnsMapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move these to a config package and create some tests to validate the mapping works. We should also test weird valid Postgres column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rather keep this function near in main.go file than to create a separate package. Specially given this is specific to the cli interface.
|
||
const ( | ||
HeaderNone HeaderHandling = iota | ||
HeaderSkip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could help with the UX of how to present the flags to the user and the mutual exclusion by prefixing the flags with header
. Related to my other comment about the validation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have a point here. We may want to step back and think about how flags look like. We have a bit of a mess right now. as you already noticed
return fmt.Errorf("failed to parse headers: %w", err) | ||
} | ||
|
||
if len(c.columnMapping) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be driven by c.useFileHeaders == HeaderAutoColumnMapping
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, both will work as HeaderAutoColumnMapping
implies len(columnMapping) == 0
return nil | ||
} | ||
|
||
func validateColumnMapping(columnMapping ColumnsMapping) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that the DB columns exists on the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it will be complicated at this stage as we do not have a connection to the database.
If you get it wrong, the first insert will fail with a clear error message, so I think it is not worth the complexity.
batchSize = opts.BatchByteSize | ||
} | ||
|
||
if batchSize < bufferSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail as soon as you hit
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8#diff-50cd4d562d776a354f7d300bd42b001d9736e761941da821c9e14344d51a78e8R171
with return fmt.Errorf("reading lines, %w. you should provably increase batch size", err)
this test covers for the case
https://github.com/timescale/timescaledb-parallel-copy/pull/118/files/8d5e5538b55c24bf31bdee981632cb7d0efdc4a8
It is not ideal, as it fails after the initial read. But it still lets you know what the solution is.
Maybe we should move batchSize parsing up as well. I remember there was a reason why I didn't do it. But I don't remember it :harold:
I'm done with my review, feel free to implement the comments, or not, your choice. |
The current implementation only supports
--column-names
which requires the user to manually provide the names for the database columns.Given that CSV files already support headers, adding
--auto-column-mapping
allows to take advantage of such information and use it to build--column-names
dynamically. This is less error prone and more flexible.The
--column-mapping
flag is added along with--auto-column-mapping
because it resolves another common problem where your input csv column names do not match the database column names.--auto-column-mapping
is a concrete case of--column-mapping
where source and destination are the same name.