Skip to content

RFC: ID_Compat_Math characters allowed in identifiers #3840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Danvil
Copy link

@Danvil Danvil commented Jul 16, 2025

This RFC extends the set of Unicode character which can be used in identifiers with ID_Compat_Math_Start and ID_Compat_Math_Continue, most notable: ∇, ∂, ∞, subscripts ⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ and superscripts ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎.

This can be a boon to implementers of scientific concepts as they can write for example let ∇E₁₂ = 0.5;.

Rendered

@clarfonthey
Copy link

While I mostly sympathise with this and think that it's probably fine to do this, I think that an RFC suggesting this should at minimum:

  1. Reference the actual section of UAX 31 that defines these groups of characters: https://www.unicode.org/reports/tr31/#Standard_Profiles
  2. Reference the section of UTS 55 linked in the above section that explains why you might not want to use these groups of characters, which currently cites Rust as an existing example: https://www.unicode.org/reports/tr55/#General-Security-Profile
  3. Reference the section of UTS 39 linked in the above section that explains the exact mechanisms which the above can be made safe: https://www.unicode.org/reports/tr39/#General_Security_Profile

Note that your reference to NFKC is technically correct: Not_NFKC is one of the restricted security profile cases that is covered by UTS 39, but it's not the only one, and it's worth discussing whether Rust's handling would need to be expanded because of this case.

FWIW, I very much sympathise with both the desire to have more scientific characters in variables and the desire to hand-wave away the issues as being already solved. It's also harder than ever before to do proper research online due to the shift of focus toward crystal-ball-based decisionmaking. I mostly want to clarify where you can find the relevant Unicode resources discussing this issue, and I think that the RFC should be updated to directly reference them so that we don't try and reinvent the wheel and redo all their hard work.

Also, I think it's pretty great that Rust is explicitly mentioned in the Unicode standard as someone who does this right! I didn't know this was the case until now.

* Added links to UAX31 and others as requested in CR
* Fixed typos as requested in CR
* Extended the drawbacks section
* Other improvements
@Danvil
Copy link
Author

Danvil commented Jul 16, 2025

@clarfonthey Thanks for the review! I made the requested changes and added more links to the Unicode resources and expanded some sections.
@programmerjake Thanks for the review - typos are fixed.

@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jul 16, 2025

* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `a²` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing.

# Rationale and alternatives
Copy link
Member

@Noratrieb Noratrieb Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust currently just follows Unicode's recommendation on what should be allowed as a programming language identifier: https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html (Annex 31).

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

It would be very good to have a description here of why Annex 31 does not contain these symbols, if such discussion can be found anywhere, to ensure that we are not missing something important and are sure about our choice to deviate from the recommendation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to have a description here of why Annex 31 does not contain these symbols

UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile

For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.

It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed formulation related to UAX31 a bit.

@Noratrieb
Copy link
Member

cc @Manishearth as our Unicode person

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine to me. I didn't include this in the original RFC since IIRC the mathematical profile was still being worked on, and I didn't wish to have this facet be another thing that needed to be discussed.


* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `a²` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing.

# Rationale and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to have a description here of why Annex 31 does not contain these symbols

UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile

For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.

It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.

# Drawbacks
[drawbacks]: #drawbacks

* Characters like `𝛁𝛛𝛻𝜕𝜵𝝏𝝯𝞉𝞩𝟃` are easily confusable with their base versions `∂∇` and can lead to subtle bugs. However the precedence in Rust seems to be to add them alongside their base version but trigger the NFKC warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... also the confusables warning if they get mixed. We have redundant protections here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


The characters 5) are added to the set of Rust identifiers, but will trigger an NFKC warning when used:
```
warning: identifier contains a non normalized (NFKC) character: '𝛁'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably mention that this will be uncommon_codepoints.

If more characters are added to this set; while they may not always be non-NFKC, they will very likely still trigger uncommon_codepoints

Basically we should make it clear that using these characters will very likely trigger lints, even if more get added to the set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Clarified choice between syntactic and identifier use
* Added link to a similar C++ proposal
* Expanded the alternatives section discussing how characters
  could be given syntactic meaning instead
@Danvil
Copy link
Author

Danvil commented Jul 17, 2025

@Manishearth @kennytm @Noratrieb thank you for the review! I added your suggestions and comments to the draft.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

If this RFC is not implemented then everyone has to keep using ASCII characters for identifier in scientific code, for example `gradient_energy` or `a_12`.

The impact of not implementing it should be fairly small, but implementing it could invite more scientific oriented people to the Rust language and make it easier for them to implement complex concepts.

Alternatively Rust could decide to give the proposed characters syntatic meaning.

Superscript characters could be interpreted as potentiation, for example `let a = 2; let b = a²;` could be a synonym to `let a = 2; let b = a * a;`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better word is exponentiation rather than potentiation (I've never heard of the latter in the context of mathematics).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


`∞` could be a synonym or replacement to `f32::INIFITY`, however there is no precedence for using non-ASCII characters in `core`/`std` and this would likely meet considerable opposition.

Derivatives could be added as a language features via auto-differentiation techniques thus giving `∇` and `∂` syntactic meaning, however there is no precedence of this in other languages and similar features are usually provided by libraries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is experimental support for automatic differentiation being worked on for rustc.
Mathematica supports the syntax $∂_{x}f$ for $\frac{∂f}{∂x}$ and the syntax $∇_{x}f$ for the gradient of $f$ with respect to $x$.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very interesting link! Done

Having these symbols available as Rust identifiers could simplify the implementation of these concepts and stay closer to a reference publication, thus reducing confusing and implementation errors.

For example instead of:
```
Copy link
Contributor

@Jules-Bertholet Jules-Bertholet Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
```rust

And so on for other code blocks. Makes syntax highlighting work

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Feel free to hit “resolve” on this review thread (and others you have addressed) so it doesn’t clutter the page

Similarly `let a = [2, 0]; let b = a₁;` will naturally give a compiler error that `a₁` is an unknown identifier and not be interpreted as `let b = a[0];`.
`∞` will just be a character usable in identifiers and not be a synonym to the likes of `f32::INFINITY`.

The characters 5) are added to the set of Rust identifiers, but will trigger an NFKC or `uncommon_codepoints` warning when used depending on their Unicode classification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it's worth noting that actually the characters from 3) and 4) would also be included in the NFKC warning; if you look at the definition of NFKC, superscripts and subscripts are an explicit example: https://unicode.org/reports/tr15/#Compatibility_Composite_Figure

So, it's worth noting that only three characters from this wouldn't trigger the warning. These are still three good characters to include, but it's worth noting for accuracy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side note to mention, also to clarify the above, is that NFKC effectively removes super/subscripts when normalizing, and I personally think it's kind of weird that this results in some normalized characters which are not normally allowed in identifiers. (for example, parentheses and +/- signs)

I think it's particularly strange that these are included in the definition at all, but I guess it kind of makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants