ICU-22885 Add parsing of approximately sign #3454

sffc · 2025-03-26T02:31:11Z

This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.

Checklist

Required: Issue filed: ICU-22885
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

FrankYFTang · 2025-03-26T19:52:16Z

Do we also need a Java fix for this?

FrankYFTang · 2025-03-26T19:53:48Z

icu4c/source/common/static_unicode_sets.cpp

+    U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr);
    gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
+    U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
+    gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated


how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?

I claim no moral authority for how this set was formed beyond "this set was manually curated".

What I did was open the xml files and look for characters used in the approximately pattern in various locales.

so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that

OK, here is what I tried

find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u - ~ ∼ ≃ ≈ ca. dáàshì dáàṣì 約

so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /

sffc · 2025-03-26T22:19:47Z

Added Java and fixed the comment.

richgillam · 2025-03-26T22:50:54Z

I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?

What does this symbol mean in practice? Why would somebody be using it in text that we parse?
I get why calling abort() is a bad idea, but why wouldn't it be better to just signal a parse error?
If ignoring the character is the right thing to do, why do we need code to explicitly identify it as the approximately sign? Couldn't you just have a generic list of characters that should be ignored in parsing?

sffc · 2025-03-26T22:59:41Z

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

richgillam · 2025-03-26T23:04:15Z

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me.

richgillam

LOKTM

FrankYFTang · 2025-03-26T23:04:52Z

icu4c/source/common/static_unicode_sets.cpp

+    U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr);
    gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
+    U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
+    // This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.


please wrap the line in the comment. This is way too long I think.

FrankYFTang · 2025-03-26T23:05:40Z

icu4c/source/i18n/numparse_symbols.cpp



+ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing)
+        : SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN),


line wrap for these two lines please

FrankYFTang · 2025-03-26T23:05:53Z

icu4c/source/test/intltest/numfmtst.cpp

+    dfmt.parse(u"≈200", result, status);
+    ASSERT_SUCCESS(status);
+    if (result.getInt64() != 200) {
+        errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64()));


line wrap please

FrankYFTang · 2025-03-26T23:06:41Z

icu4j/main/core/src/main/java/com/ibm/icu/impl/StaticUnicodeSets.java


+        // The following don't currently have parseLenients in data.
        unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze());
+        // This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.


line wrap te comment plese.

markusicu · 2025-09-23T00:40:23Z

@sffc please take a look at the feedback from @FrankYFTang

icu4c/source/i18n/numparse_symbols.h

See unicode-org#3454

jira-pull-request-webhook · 2025-09-23T03:39:53Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

sffc · 2025-09-23T03:40:51Z

I used the tool to squash the branch because @FrankYFTang's comments are stylistic only. It would take me more time to find the branch and fix the style. I might try to get to it this week but I'm currently overwhelmed with urgent items and I want to make this mergeable for ICU 78 RC.

See unicode-org#3454

jira-pull-request-webhook · 2025-09-26T17:30:54Z

Notice: the branch changed across the force-push!

icu4c/source/common/static_unicode_sets.cpp is different
icu4c/source/i18n/numparse_symbols.cpp is different
icu4c/source/test/intltest/numfmtst.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/impl/StaticUnicodeSets.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2025-09-26T17:33:17Z

Let's not have this be stuck for a second release just on line wrapping... I have amended the commit to address @FrankYFTang 's feedback. @richgillam had approved this already in March, minus the new line wrapping. @sffc ok with my changes? I will also self-approve, and try to remember to merge by EOD if I don't hear otherwise.

richgillam

Still looks good to me.

sffc · 2025-09-26T19:03:53Z

Sure

sffc assigned FrankYFTang Mar 26, 2025

sffc requested a review from richgillam March 26, 2025 02:33

FrankYFTang reviewed Mar 26, 2025

View reviewed changes

sffc requested a review from FrankYFTang March 26, 2025 22:19

richgillam previously approved these changes Mar 26, 2025

View reviewed changes

FrankYFTang reviewed Mar 26, 2025

View reviewed changes

FrankYFTang reviewed Sep 23, 2025

View reviewed changes

icu4c/source/i18n/numparse_symbols.h Show resolved Hide resolved

sffc added a commit to sffc/icu that referenced this pull request Sep 23, 2025

ICU-22885 Add parsing of approximately sign

eb4a47f

See unicode-org#3454

sffc force-pushed the ICU-22885-appx-sign branch from cb354b1 to eb4a47f Compare September 23, 2025 03:39

ICU-22885 Add parsing of approximately sign

53055a2

See unicode-org#3454

markusicu dismissed richgillam’s stale review via 53055a2 September 26, 2025 17:30

markusicu force-pushed the ICU-22885-appx-sign branch from eb4a47f to 53055a2 Compare September 26, 2025 17:30

markusicu approved these changes Sep 26, 2025

View reviewed changes

markusicu requested review from FrankYFTang and richgillam September 26, 2025 17:34

richgillam approved these changes Sep 26, 2025

View reviewed changes

markusicu merged commit 31cb585 into unicode-org:main Sep 26, 2025
104 checks passed



		ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing)
		: SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN),

Uh oh!

ICU-22885 Add parsing of approximately sign #3454

ICU-22885 Add parsing of approximately sign #3454

Uh oh!

Conversation

sffc commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

FrankYFTang commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

sffc commented Mar 26, 2025

Uh oh!

richgillam commented Mar 26, 2025

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markusicu commented Sep 23, 2025

Uh oh!

Uh oh!

jira-pull-request-webhook bot commented Sep 23, 2025

Uh oh!

sffc commented Sep 23, 2025

Uh oh!

jira-pull-request-webhook bot commented Sep 26, 2025

Uh oh!

markusicu commented Sep 26, 2025

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

sffc commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sffc commented Mar 26, 2025 •

edited

Loading