Skip to content

Conversation

@CytoShahar
Copy link

@CytoShahar CytoShahar commented Nov 9, 2025

Rationale for this change

Arrow currently always URL-encodes Hive partition values when writing datasets (e.g., spaces become %20, slashes become %2F). This behavior:

  • Cannot be disabled, even for local filesystems where special characters are valid
  • Creates incompatibility with non-Arrow tools expecting unencoded directory names
  • Makes partition directories difficult to read and debug
  • Causes issues when URIs are already encoded by service providers

As reported in #41618, users working with local filesystems need human-readable directory names (e.g., category=Product A instead of category=Product%20A) while maintaining compatibility with existing Arrow workflows.

What changes are included in this PR?

Added a new optional boolean parameter url_encode_hive_values (default true) to control URL encoding behavior in Hive-style partitioning across all three language bindings:

C++ Core (cpp/src/arrow/dataset/partition.cc):

  • Modified HivePartitioning::FormatValues() to conditionally apply UriEscape() based on segment_encoding()
  • When SegmentEncoding::None is set, partition values are used as-is
  • When SegmentEncoding::Uri is set (default), maintains existing URL encoding behavior

R API (r/R/dataset-write.R):

  • Added url_encode_hive_values = TRUE parameter to write_dataset(), write_csv_dataset(), write_tsv_dataset(), write_delim_dataset()
  • Sets segment_encoding parameter when creating HivePartitioning objects
  • Defaults to TRUE to maintain backward compatibility

Python API (python/pyarrow/dataset.py):

  • Added url_encode_hive_values = True parameter to write_dataset()
  • Modified _ensure_write_partitioning() to handle the parameter for all partitioning input types
  • Creates HivePartitioning objects with appropriate segment_encoding
  • Defaults to True to maintain backward compatibility

The implementation leverages the existing segment_encoding parameter in HivePartitioning, requiring no changes to core C++ data structures.

Are these changes tested?

Yes, comprehensive test coverage across all three languages:

C++ Tests (cpp/src/arrow/dataset/partition_test.cc):

  • Added WriteHiveWithSlashesInValuesDisableUrlEncoding test
  • Verifies that partition values with spaces, slashes, ampersands, and percent signs are written without URL encoding when SegmentEncoding::None is set
  • All existing partition tests continue to pass, ensuring backward compatibility

R Tests (r/tests/testthat/test-dataset-write.R):

  • Added comprehensive test covering special characters: space, slash, percent, plus, ampersand, equals, question mark
  • Validates directory names are correctly encoded/unencoded based on parameter value
  • Verifies data integrity is maintained across both encoding modes
  • Tests with CSV, TSV, and Parquet formats

Python Tests (python/pyarrow/tests/test_dataset.py):

  • Added test_hive_partitioning_url_encoding() test
  • Tests both URL encoding enabled (default) and disabled (new feature)
  • Tests with explicitly created HivePartitioning objects and string partition specs
  • Validates directory names and data integrity

Are there any user-facing changes?

Yes, but fully backward compatible:

New Parameter: Users can now optionally disable URL encoding in Hive-style partitioning:

R Example:

# Default behavior (URL encoding enabled) - UNCHANGED
write_dataset(data, "path", partitioning = "category",
              hive_style = TRUE)  # url_encode_hive_values defaults to TRUE
# Creates: category=Product%20A/

# New behavior (URL encoding disabled)
write_dataset(data, "path", partitioning = "category",
              hive_style = TRUE, url_encode_hive_values = FALSE)
# Creates: category=Product A/

Python Example:

# Default behavior (URL encoding enabled) - UNCHANGED
ds.write_dataset(table, "path", partitioning=["category"],
                 partitioning_flavor="hive")  # url_encode_hive_values defaults to True
# Creates: category=Product%20A/

# New behavior (URL encoding disabled)
ds.write_dataset(table, "path", partitioning=["category"],
                 partitioning_flavor="hive", url_encode_hive_values=False)
# Creates: category=Product A/

Backward Compatibility:

  • Default behavior unchanged: url_encode_hive_values defaults to true, maintaining existing URL encoding
  • All existing code continues to work without modification
  • Only affects Hive-style partitioning, not directory partitioning
  • Reading datasets works with both encoded and unencoded partition values

Closes #41618

… to control Hive partition URL encoding

This commit adds a new optional boolean parameter `url_encode_hive_values`
to control whether Hive partition values are URL-encoded when writing datasets.

Changes:
- C++: Modified HivePartitioning::FormatValues to conditionally apply URL encoding
  based on segment_encoding (SegmentEncoding::Uri vs SegmentEncoding::None)
- Python: Added url_encode_hive_values parameter to write_dataset() and modified
  _ensure_write_partitioning() to create HivePartitioning with appropriate encoding
- R: Added url_encode_hive_values parameter to write_dataset(), write_csv_dataset(),
  write_tsv_dataset(), and write_delim_dataset()
- Tests: Added comprehensive test coverage across all three languages

The parameter defaults to true/TRUE to maintain backward compatibility.
When set to false/FALSE, partition values are used as-is in directory names,
enabling clean, human-readable partition directories for local filesystems.

Closes apache#41618
Changed test data from using forward slash (/) to plus (+) in partition
values, as forward slashes cannot be used in directory names on Unix/macOS
filesystems. The tests now use characters that are valid in filenames but
still demonstrate the URL encoding functionality:
- Space (encoded as %20)
- Plus + (encoded as %2B)
- Ampersand & (encoded as %26)
- Percent % (encoded as %25)

This ensures tests pass on all platforms while still validating that
url_encode_hive_values parameter correctly controls URL encoding behavior.
@github-actions
Copy link

github-actions bot commented Nov 9, 2025

⚠️ GitHub issue #41618 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @CytoShahar!

Just a heads up - looking at this PR, I'm fairly confident it's AI generated in parts or as a whole. While there are no rules against doing this - I use genAI myself to help with my work - there are a few things I'd like to mention here.

  • The PR body contains a lot of redundant information. Please take a look at other similar PRs and revise it to be more in-line with those.

  • I can't speak for other maintainers, but I personally find that I'm much more motivated to review a PR where there is obvious evidence that the contributor has engaged with the AI-generated content, understands all of the code, and has updated it where necessary. Otherwise, it feels like I have more work to do compared to a human-generated PR, and these things end up slipping to the bottom of my to-do list.

I have learned some of the above through my own learning about using AI in contributions, and through making mistakes or sensing hesitance from others reviewing my work, and I think we're at a tricky point now where there is a lot of potential for getting more things done more easily, but we haven't quite figured out how to reap the benefits without added friction for all involved.

I'm happy to help if you have any questions though.

partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
url_encode_hive_values = TRUE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for inserting this parameter here in the function signature?

@CytoShahar CytoShahar requested a review from thisisnic November 14, 2025 16:29
@CytoShahar
Copy link
Author

@thisisnic
Thanks for the thoughtful feedback and for the patience. You were absolutely right — I leaned too much on AI and didn’t think it through enough before submitting.

I tracked down the actual issue and will push a proper fix shortly. Appreciate the clear guidance and the help. 🙏

@thisisnic
Copy link
Member

I tracked down the actual issue and will push a proper fix shortly. Appreciate the clear guidance and the help. 🙏

Awesome, cheers! I don't suppose you'd mind also getting tests passing locally before resubmitting? As this is your first contribution to this repo, a maintainer has to manually approve every CI run, so passing locally first will make it a lot smoother. Ta :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] Unable to disable url-encoding

2 participants