Skip to content

Conversation

sffc
Copy link
Member

@sffc sffc commented Mar 26, 2025

This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.

Checklist

  • Required: Issue filed: ICU-22885
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@FrankYFTang
Copy link
Contributor

Do we also need a Java fix for this?

U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr);
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I claim no moral authority for how this set was formed beyond "this set was manually curated".

What I did was open the xml files and look for characters used in the approximately pattern in various locales.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here is what I tried

find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u
-
~
∼
≃
≈
ca.
dáàshì
dáàṣì
約

so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /

@sffc sffc requested a review from FrankYFTang March 26, 2025 22:19
@sffc
Copy link
Member Author

sffc commented Mar 26, 2025

Added Java and fixed the comment.

@richgillam
Copy link
Contributor

I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?

  1. What does this symbol mean in practice? Why would somebody be using it in text that we parse?
  2. I get why calling abort() is a bad idea, but why wouldn't it be better to just signal a parse error?
  3. If ignoring the character is the right thing to do, why do we need code to explicitly identify it as the approximately sign? Couldn't you just have a generic list of characters that should be ignored in parsing?

@sffc
Copy link
Member Author

sffc commented Mar 26, 2025

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

@richgillam
Copy link
Contributor

The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago).

Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code.

Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me.

richgillam
richgillam previously approved these changes Mar 26, 2025
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM

U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr);
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status);
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr);
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap the line in the comment. This is way too long I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing)
: SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap for these two lines please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dfmt.parse(u"≈200", result, status);
ASSERT_SUCCESS(status);
if (result.getInt64() != 200) {
errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// The following don't currently have parseLenients in data.
unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze());
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrap te comment plese.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@markusicu
Copy link
Member

@sffc please take a look at the feedback from @FrankYFTang

sffc added a commit to sffc/icu that referenced this pull request Sep 23, 2025
@sffc sffc force-pushed the ICU-22885-appx-sign branch from cb354b1 to eb4a47f Compare September 23, 2025 03:39
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc
Copy link
Member Author

sffc commented Sep 23, 2025

I used the tool to squash the branch because @FrankYFTang's comments are stylistic only. It would take me more time to find the branch and fix the style. I might try to get to it this week but I'm currently overwhelmed with urgent items and I want to make this mergeable for ICU 78 RC.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/static_unicode_sets.cpp is different
  • icu4c/source/i18n/numparse_symbols.cpp is different
  • icu4c/source/test/intltest/numfmtst.cpp is different
  • icu4j/main/core/src/main/java/com/ibm/icu/impl/StaticUnicodeSets.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu
Copy link
Member

Let's not have this be stuck for a second release just on line wrapping... I have amended the commit to address @FrankYFTang 's feedback. @richgillam had approved this already in March, minus the new line wrapping. @sffc ok with my changes? I will also self-approve, and try to remember to merge by EOD if I don't hear otherwise.

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good to me.

@sffc
Copy link
Member Author

sffc commented Sep 26, 2025

Sure

@markusicu markusicu merged commit 31cb585 into unicode-org:main Sep 26, 2025
104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants