Skip to content

fix(grpc): percent-decode target URI path and reject non-UTF-8 characters#2677

Open
arjan-bal wants to merge 9 commits into
grpc:masterfrom
arjan-bal:non-utf-strings
Open

fix(grpc): percent-decode target URI path and reject non-UTF-8 characters#2677
arjan-bal wants to merge 9 commits into
grpc:masterfrom
arjan-bal:non-utf-strings

Conversation

@arjan-bal

@arjan-bal arjan-bal commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This PR ensures that non-UTF-8 strings in the target URI path are rejected.

Key Changes:

  • Target::from_str: Updated to fail for targets with non-UTF-8 symbols in the percent-decoded path.
  • Target::path(): Modified to return the percent-decoded path.

@arjan-bal arjan-bal requested a review from dfawley June 5, 2026 17:44
@arjan-bal arjan-bal changed the title fix(grpc): Fix handling of non-UTF-8 path strings in target URI fix(grpc): Fix handling of non-UTF-8 strings in target URI path Jun 5, 2026
@arjan-bal arjan-bal requested a review from ejona86 June 8, 2026 05:10
@ejona86

ejona86 commented Jun 8, 2026

Copy link
Copy Markdown
Member

This seems weird to me. We should be assuming UTF-8-encoded URIs in general, so I could understand adding a way to get the un-decoded path or a path-as-bytes. And then only this code would do something different.

But I'd expect we'd decode URIs as UTF-8 in general, as the encoding of the URI can be different than the encoding of the on-disk path. For example, if the system is using the Latin-1 codepage (LANG=en_US.ISO-8859-1, not the now-typical LANG=en_US.UTF-8), then we'd decode the URI as UTF-8 and then (someone would) convert that to Latin-1 when opening the socket. Historically that was libc that would do that conversion. I see Java do that conversion. It looks like Go assumes the Linux system uses utf-8, which is actually fair these days. Go probably converts the UTF-8 to UTF-16 for Windows.

RFC 8089 encourages UTF-8 as we do represent them as strings in every other language to my knowledge. https://encoding.spec.whatwg.org/#encoding assumes Unicode, and we'd definitely be using UTF-8.

So, yeah... While it is true that the files don't even need to be any encoding at all (you can store raw binary, except for \0 and /), and rust allows you to do that, I don't think we want to deal with that in general.

@arjan-bal

Copy link
Copy Markdown
Contributor Author

@ejona86, to clarify, is the recommendation that gRPC should validate that the target URI contains only valid UTF-8 in the percent-decoded path?

@ejona86 ejona86 closed this Jun 8, 2026
@ejona86 ejona86 reopened this Jun 8, 2026
@ejona86

ejona86 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Yeah, we should validate it. But either PercentDecode.decode_utf8() or decode_utf8_lossy() might work. Java uses a replacement character.. Go fails URL parsing doesn't validate. Dunno what C++ does.

Edit: C++ doesn't validate. But C++ doesn't care if you store raw binary in strings, and Go is tolerant to invalid UTF-8 in strings. Java doesn't allow that, and neither does Rust to my knowledge.

@arjan-bal arjan-bal assigned arjan-bal and unassigned ejona86 and dfawley Jun 9, 2026
@arjan-bal arjan-bal changed the title fix(grpc): Fix handling of non-UTF-8 strings in target URI path fix(grpc): percent-decode target URI path and reject non-UTF-8 characters Jun 9, 2026
@arjan-bal

arjan-bal commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I've reworked the approach to use PercentDecode.decode_utf8() during target URL parsing, which will now fail if it encounters non-UTF-8 characters. I prefer this to using replacement characters because it prevents unexpected behaviour and fails fast to ensures users are immediately aware of the invalid target so they can fix it.

@arjan-bal arjan-bal assigned ejona86 and unassigned arjan-bal Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants