Skip to content

Conversation

@jctian98
Copy link
Contributor

@jctian98 jctian98 commented Mar 20, 2025

Add Cantonese support, which mainly follows Chinese implementation.
Key features:
(1) The Cantonese-to-Jyutping dictionary from CC-Canto.
(2) Jyutping-to-IPA dictionary from Wiki

Other important points:
(1) In the Jyutping-to-IPA rules, the m and ng have different IPA parsing that are dependent on context. Specifically, they can (1) represent isolated syllables; and (2) as a consonant. So we first check (1)
(2) The order of Jyutping-to-IPA rules matters: jyutping items with long characters are checked first.

I haven't found a good test example for Chinese, maybe because it needs file downloading.
As suggestion is welcome

@jctian98
Copy link
Contributor Author

There should be some issues with the CC-Canto dictionary: some isolated characters are not included, even though there are many words built with it. It seems this issue doesn't happen in the Chinese dictionary but is in the Cantonese dictionary.

E.g., "今" is missing, but many words with "今" is included. Parsing "今天" will fail as "今tʰin˥"

image

@jctian98
Copy link
Contributor Author

Update after discussion with @dmort27

When parsing a word with several characters, also add the Jyutping of each character into the dictionary to avoid OOV problem. This will definitely introduce some issues as some characters may have more than one pronunciation.
I manually checked the output of test case and think they are reasonable (although with some modifications with the original ones).

I still get an issue with Jyutping -> IPA mapping. E.g.,
(1) jat1 -> jɐt̚1 when applying at -> ɐt̚ / _
(2) jɐt̚1 -> jɐtʰ̚1 when applying t -> tʰ / _
(3) jɐtʰ̚1 -> jɐtʰ̚˥ when applying 1 -> ˥ / _

The step (2) is totally unexpected, as the t here is already IPA rather than Jyutping. But I'm not sure how to avoid this phenomenon. Any advice on it?

@kalvinchang
Copy link
Collaborator

is t really supposed to turn into aspirated t in all environments ?

@jctian98
Copy link
Contributor Author

No. so that the step (2) is not expected. the rule t -> tʰ / _ is mis-used here as it takes the IPA t as a Jyutping character and transforms it again.

@kalvinchang
Copy link
Collaborator

kalvinchang commented Mar 25, 2025

maybe make t -> tʰ word-initial?

seems like u already covered the cases where [t] is word-final

@juice500ml
Copy link
Collaborator

Can you also update the README.md too? Thanks!

@juice500ml
Copy link
Collaborator

Closing & reopening PR for triggering CI

@juice500ml juice500ml closed this Mar 25, 2025
@juice500ml juice500ml reopened this Mar 25, 2025
@juice500ml
Copy link
Collaborator

@jctian98 , can u resolve merge conflicts? Also, there is slight change in coding style of download.py, you might want to change it.

@jctian98
Copy link
Contributor Author

jctian98 commented Mar 25, 2025

I added some special characters to indicate the transformed IPAs and then avoid applying rules repetitively.

I think this PR is ready to review now. @dmort27

@juice500ml
Copy link
Collaborator

It seems that the above issue have been added in tests, and been resolved. Merging the PR. Thanks a lot everyone!

@juice500ml juice500ml merged commit f0a02f7 into dmort27:master Apr 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants