-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086
Conversation
… to control Hive partition URL encoding This commit adds a new optional boolean parameter `url_encode_hive_values` to control whether Hive partition values are URL-encoded when writing datasets. Changes: - C++: Modified HivePartitioning::FormatValues to conditionally apply URL encoding based on segment_encoding (SegmentEncoding::Uri vs SegmentEncoding::None) - Python: Added url_encode_hive_values parameter to write_dataset() and modified _ensure_write_partitioning() to create HivePartitioning with appropriate encoding - R: Added url_encode_hive_values parameter to write_dataset(), write_csv_dataset(), write_tsv_dataset(), and write_delim_dataset() - Tests: Added comprehensive test coverage across all three languages The parameter defaults to true/TRUE to maintain backward compatibility. When set to false/FALSE, partition values are used as-is in directory names, enabling clean, human-readable partition directories for local filesystems. Closes apache#41618
Changed test data from using forward slash (/) to plus (+) in partition values, as forward slashes cannot be used in directory names on Unix/macOS filesystems. The tests now use characters that are valid in filenames but still demonstrate the URL encoding functionality: - Space (encoded as %20) - Plus + (encoded as %2B) - Ampersand & (encoded as %26) - Percent % (encoded as %25) This ensures tests pass on all platforms while still validating that url_encode_hive_values parameter correctly controls URL encoding behavior.
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @CytoShahar!
Just a heads up - looking at this PR, I'm fairly confident it's AI generated in parts or as a whole. While there are no rules against doing this - I use genAI myself to help with my work - there are a few things I'd like to mention here.
-
The PR body contains a lot of redundant information. Please take a look at other similar PRs and revise it to be more in-line with those.
-
I can't speak for other maintainers, but I personally find that I'm much more motivated to review a PR where there is obvious evidence that the contributor has engaged with the AI-generated content, understands all of the code, and has updated it where necessary. Otherwise, it feels like I have more work to do compared to a human-generated PR, and these things end up slipping to the bottom of my to-do list.
I have learned some of the above through my own learning about using AI in contributions, and through making mistakes or sensing hesitance from others reviewing my work, and I think we're at a tricky point now where there is a lot of potential for getting more things done more easily, but we haven't quite figured out how to reap the benefits without added friction for all involved.
I'm happy to help if you have any questions though.
| partitioning = dplyr::group_vars(dataset), | ||
| basename_template = paste0("part-{i}.", as.character(format)), | ||
| hive_style = TRUE, | ||
| url_encode_hive_values = TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for inserting this parameter here in the function signature?
|
@thisisnic I tracked down the actual issue and will push a proper fix shortly. Appreciate the clear guidance and the help. 🙏 |
Awesome, cheers! I don't suppose you'd mind also getting tests passing locally before resubmitting? As this is your first contribution to this repo, a maintainer has to manually approve every CI run, so passing locally first will make it a lot smoother. Ta :) |
Rationale for this change
Arrow currently always URL-encodes Hive partition values when writing datasets (e.g., spaces become
%20, slashes become%2F). This behavior:As reported in #41618, users working with local filesystems need human-readable directory names (e.g.,
category=Product Ainstead ofcategory=Product%20A) while maintaining compatibility with existing Arrow workflows.What changes are included in this PR?
Added a new optional boolean parameter
url_encode_hive_values(defaulttrue) to control URL encoding behavior in Hive-style partitioning across all three language bindings:C++ Core (
cpp/src/arrow/dataset/partition.cc):HivePartitioning::FormatValues()to conditionally applyUriEscape()based onsegment_encoding()SegmentEncoding::Noneis set, partition values are used as-isSegmentEncoding::Uriis set (default), maintains existing URL encoding behaviorR API (
r/R/dataset-write.R):url_encode_hive_values = TRUEparameter towrite_dataset(),write_csv_dataset(),write_tsv_dataset(),write_delim_dataset()segment_encodingparameter when creatingHivePartitioningobjectsTRUEto maintain backward compatibilityPython API (
python/pyarrow/dataset.py):url_encode_hive_values = Trueparameter towrite_dataset()_ensure_write_partitioning()to handle the parameter for all partitioning input typesHivePartitioningobjects with appropriatesegment_encodingTrueto maintain backward compatibilityThe implementation leverages the existing
segment_encodingparameter inHivePartitioning, requiring no changes to core C++ data structures.Are these changes tested?
Yes, comprehensive test coverage across all three languages:
C++ Tests (
cpp/src/arrow/dataset/partition_test.cc):WriteHiveWithSlashesInValuesDisableUrlEncodingtestSegmentEncoding::Noneis setR Tests (
r/tests/testthat/test-dataset-write.R):Python Tests (
python/pyarrow/tests/test_dataset.py):test_hive_partitioning_url_encoding()testHivePartitioningobjects and string partition specsAre there any user-facing changes?
Yes, but fully backward compatible:
New Parameter: Users can now optionally disable URL encoding in Hive-style partitioning:
R Example:
Python Example:
Backward Compatibility:
url_encode_hive_valuesdefaults totrue, maintaining existing URL encodingCloses #41618