-
-
Notifications
You must be signed in to change notification settings - Fork 831
ICU-22885 Add parsing of approximately sign #3454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do we also need a Java fix for this? |
U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr); | ||
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status); | ||
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr); | ||
gUnicodeSets[APPROXIMATELY_SIGN] = new UnicodeSet(u"[∼~≈≃約]", status); // this set was manually curated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this set determeind? What does it based on ? having "約" in this set is strange? Could we have a comments about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I claim no moral authority for how this set was formed beyond "this set was manually curated".
What I did was open the xml files and look for characters used in the approximately pattern in various locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you mean these characters are gathered by looking at the content of some xml files and some particuarl field in those xml files? If so, could you point out which XML files and which particular fields about HOW you "manually curated" ? Give a little bit more details of how you did that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here is what I tried
find common/main/* |xargs egrep approximatelySign |egrep -v "↑↑↑|unconfirmed"|cut -d '>' -f 2|cut -d '<' -f 1|sort -u
-
~
∼
≃
≈
ca.
dáàshì
dáàṣì
約
so... maybe adding comment as "This set of characters is gathered from the values of approximatelySign element of CLDR common/main/*.xml files." /
Added Java and fixed the comment. |
I'm coming to the party late, so I apologize if these are dumb questions, but what are you actually doing here, and why is that the appropriate response to the original issue? If I'm reading this correctly, this makes the number parser explicitly aware of the approximately sign (in all the various locales), but just basically ignores it in parsing. Is that right, and is that what we want to do?
|
The bug uncovered that we didn't handle the approximately sign in parsing, even though it is supported in patters and in formatting, which I added relatively recently (a few years ago). Treating the approximately sign the same way as the plus sign makes sense to me. With the plus sign, we accept it and it doesn't impact the resulting parsed value. I mostly copied the plus sign code to make the approximately sign code. |
Okay, I'll accept that. Thanks for the explanation. Given that, the code looks okay to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOKTM
U_ASSERT(gUnicodeSets[INFINITY_SIGN] == nullptr); | ||
gUnicodeSets[INFINITY_SIGN] = new UnicodeSet(u"[∞]", status); | ||
U_ASSERT(gUnicodeSets[APPROXIMATELY_SIGN] == nullptr); | ||
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please wrap the line in the comment. This is way too long I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
ApproximatelySignMatcher::ApproximatelySignMatcher(const DecimalFormatSymbols& dfs, bool allowTrailing) | ||
: SymbolMatcher(dfs.getConstSymbol(DecimalFormatSymbols::kApproximatelySignSymbol), unisets::APPROXIMATELY_SIGN), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap for these two lines please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dfmt.parse(u"≈200", result, status); | ||
ASSERT_SUCCESS(status); | ||
if (result.getInt64() != 200) { | ||
errln(UnicodeString(u"Got unexpected parse result: ") + DoubleToUnicodeString(result.getInt64())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// The following don't currently have parseLenients in data. | ||
unicodeSets.put(Key.INFINITY_SIGN, new UnicodeSet("[∞]").freeze()); | ||
// This set of characters was manually curated from the values of the approximatelySign element of CLDR common/main/*.xml files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrap te comment plese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@sffc please take a look at the feedback from @FrankYFTang |
cb354b1
to
eb4a47f
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
I used the tool to squash the branch because @FrankYFTang's comments are stylistic only. It would take me more time to find the branch and fix the style. I might try to get to it this week but I'm currently overwhelmed with urgent items and I want to make this mergeable for ICU 78 RC. |
eb4a47f
to
53055a2
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Let's not have this be stuck for a second release just on line wrapping... I have amended the commit to address @FrankYFTang 's feedback. @richgillam had approved this already in March, minus the new line wrapping. @sffc ok with my changes? I will also self-approve, and try to remember to merge by EOD if I don't hear otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good to me.
Sure |
This adds support for parsing the approximately sign and fixes the bug observed in ICU-22885.
Checklist