GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086

CytoShahar · 2025-11-09T03:49:01Z

Rationale for this change

Arrow currently always URL-encodes Hive partition values when writing datasets (e.g., spaces become %20, slashes become %2F). This behavior:

Cannot be disabled, even for local filesystems where special characters are valid
Creates incompatibility with non-Arrow tools expecting unencoded directory names
Makes partition directories difficult to read and debug
Causes issues when URIs are already encoded by service providers

As reported in #41618, users working with local filesystems need human-readable directory names (e.g., category=Product A instead of category=Product%20A) while maintaining compatibility with existing Arrow workflows.

What changes are included in this PR?

Added a new optional boolean parameter url_encode_hive_values (default true) to control URL encoding behavior in Hive-style partitioning across all three language bindings:

C++ Core (cpp/src/arrow/dataset/partition.cc):

Modified HivePartitioning::FormatValues() to conditionally apply UriEscape() based on segment_encoding()
When SegmentEncoding::None is set, partition values are used as-is
When SegmentEncoding::Uri is set (default), maintains existing URL encoding behavior

R API (r/R/dataset-write.R):

Added url_encode_hive_values = TRUE parameter to write_dataset(), write_csv_dataset(), write_tsv_dataset(), write_delim_dataset()
Sets segment_encoding parameter when creating HivePartitioning objects
Defaults to TRUE to maintain backward compatibility

Python API (python/pyarrow/dataset.py):

Added url_encode_hive_values = True parameter to write_dataset()
Modified _ensure_write_partitioning() to handle the parameter for all partitioning input types
Creates HivePartitioning objects with appropriate segment_encoding
Defaults to True to maintain backward compatibility

The implementation leverages the existing segment_encoding parameter in HivePartitioning, requiring no changes to core C++ data structures.

Are these changes tested?

Yes, comprehensive test coverage across all three languages:

C++ Tests (cpp/src/arrow/dataset/partition_test.cc):

Added WriteHiveWithSlashesInValuesDisableUrlEncoding test
Verifies that partition values with spaces, slashes, ampersands, and percent signs are written without URL encoding when SegmentEncoding::None is set
All existing partition tests continue to pass, ensuring backward compatibility

R Tests (r/tests/testthat/test-dataset-write.R):

Added comprehensive test covering special characters: space, slash, percent, plus, ampersand, equals, question mark
Validates directory names are correctly encoded/unencoded based on parameter value
Verifies data integrity is maintained across both encoding modes
Tests with CSV, TSV, and Parquet formats

Python Tests (python/pyarrow/tests/test_dataset.py):

Added test_hive_partitioning_url_encoding() test
Tests both URL encoding enabled (default) and disabled (new feature)
Tests with explicitly created HivePartitioning objects and string partition specs
Validates directory names and data integrity

Are there any user-facing changes?

Yes, but fully backward compatible:

New Parameter: Users can now optionally disable URL encoding in Hive-style partitioning:

R Example:

# Default behavior (URL encoding enabled) - UNCHANGED
write_dataset(data, "path", partitioning = "category",
              hive_style = TRUE)  # url_encode_hive_values defaults to TRUE
# Creates: category=Product%20A/

# New behavior (URL encoding disabled)
write_dataset(data, "path", partitioning = "category",
              hive_style = TRUE, url_encode_hive_values = FALSE)
# Creates: category=Product A/

Python Example:

# Default behavior (URL encoding enabled) - UNCHANGED
ds.write_dataset(table, "path", partitioning=["category"],
                 partitioning_flavor="hive")  # url_encode_hive_values defaults to True
# Creates: category=Product%20A/

# New behavior (URL encoding disabled)
ds.write_dataset(table, "path", partitioning=["category"],
                 partitioning_flavor="hive", url_encode_hive_values=False)
# Creates: category=Product A/

Backward Compatibility:

Default behavior unchanged: url_encode_hive_values defaults to true, maintaining existing URL encoding
All existing code continues to work without modification
Only affects Hive-style partitioning, not directory partitioning
Reading datasets works with both encoded and unencoded partition values

Closes #41618

GitHub Issue: [R] Unable to disable url-encoding #41618

… to control Hive partition URL encoding This commit adds a new optional boolean parameter `url_encode_hive_values` to control whether Hive partition values are URL-encoded when writing datasets. Changes: - C++: Modified HivePartitioning::FormatValues to conditionally apply URL encoding based on segment_encoding (SegmentEncoding::Uri vs SegmentEncoding::None) - Python: Added url_encode_hive_values parameter to write_dataset() and modified _ensure_write_partitioning() to create HivePartitioning with appropriate encoding - R: Added url_encode_hive_values parameter to write_dataset(), write_csv_dataset(), write_tsv_dataset(), and write_delim_dataset() - Tests: Added comprehensive test coverage across all three languages The parameter defaults to true/TRUE to maintain backward compatibility. When set to false/FALSE, partition values are used as-is in directory names, enabling clean, human-readable partition directories for local filesystems. Closes apache#41618

Changed test data from using forward slash (/) to plus (+) in partition values, as forward slashes cannot be used in directory names on Unix/macOS filesystems. The tests now use characters that are valid in filenames but still demonstrate the URL encoding functionality: - Space (encoded as %20) - Plus + (encoded as %2B) - Ampersand & (encoded as %26) - Percent % (encoded as %25) This ensures tests pass on all platforms while still validating that url_encode_hive_values parameter correctly controls URL encoding behavior.

github-actions · 2025-11-09T03:49:24Z

⚠️ GitHub issue #41618 has been automatically assigned in GitHub to PR creator.

thisisnic

Thanks for the PR @CytoShahar!

Just a heads up - looking at this PR, I'm fairly confident it's AI generated in parts or as a whole. While there are no rules against doing this - I use genAI myself to help with my work - there are a few things I'd like to mention here.

The PR body contains a lot of redundant information. Please take a look at other similar PRs and revise it to be more in-line with those.
I can't speak for other maintainers, but I personally find that I'm much more motivated to review a PR where there is obvious evidence that the contributor has engaged with the AI-generated content, understands all of the code, and has updated it where necessary. Otherwise, it feels like I have more work to do compared to a human-generated PR, and these things end up slipping to the bottom of my to-do list.

I have learned some of the above through my own learning about using AI in contributions, and through making mistakes or sensing hesitance from others reviewing my work, and I think we're at a tricky point now where there is a lot of potential for getting more things done more easily, but we haven't quite figured out how to reap the benefits without added friction for all involved.

I'm happy to help if you have any questions though.

thisisnic · 2025-11-14T13:24:13Z

r/R/dataset-write.R

  partitioning = dplyr::group_vars(dataset),
  basename_template = paste0("part-{i}.", as.character(format)),
  hive_style = TRUE,
+  url_encode_hive_values = TRUE,


What's the reason for inserting this parameter here in the function signature?

r/tests/testthat/test-dataset-write.R

CytoShahar · 2025-11-14T17:08:03Z

@thisisnic
Thanks for the thoughtful feedback and for the patience. You were absolutely right — I leaned too much on AI and didn’t think it through enough before submitting.

I tracked down the actual issue and will push a proper fix shortly. Appreciate the clear guidance and the help. 🙏

thisisnic · 2025-11-14T17:35:38Z

I tracked down the actual issue and will push a proper fix shortly. Appreciate the clear guidance and the help. 🙏

Awesome, cheers! I don't suppose you'd mind also getting tests passing locally before resubmitting? As this is your first contribution to this repo, a maintainer has to manually approve every CI run, so passing locally first will make it a lot smoother. Ta :)

CytoShahar added 2 commits November 8, 2025 21:40

CytoShahar requested review from AlenkaF, jonkeane, raulcd, rok and thisisnic as code owners November 9, 2025 03:49

github-actions bot added Component: R Component: C++ Component: Python awaiting review Awaiting review labels Nov 9, 2025

Merge branch 'main' into feature/url-encode-hive-values-clean

8f4c6b6

thisisnic requested changes Nov 14, 2025

View reviewed changes

CytoShahar requested a review from thisisnic November 14, 2025 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086

GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086

CytoShahar commented Nov 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

thisisnic left a comment •

edited

Loading

Uh oh!

thisisnic Nov 14, 2025

Uh oh!

Uh oh!

CytoShahar commented Nov 14, 2025

Uh oh!

thisisnic commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086

Are you sure you want to change the base?

GH-41618: [C++][R][PYTHON]: Add ability to control url encoding behavior in hive partitioning #48086

Conversation

CytoShahar commented Nov 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

thisisnic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thisisnic Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CytoShahar commented Nov 14, 2025

Uh oh!

thisisnic commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CytoShahar commented Nov 9, 2025 •

edited by github-actions bot

Loading

thisisnic left a comment •

edited

Loading