Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22564 Java API for the host env to register a low level break eng… #2697

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

FrankYFTang
Copy link
Contributor

…ine to break CJ + Southeastern Asia script

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22564
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@FrankYFTang
Copy link
Contributor Author

@richgillam Somehow my FakeYueBreakEngine break the PersonNameConsistencyTest on yue.txt and yue_Hans.txt . I am suprise that changes to BreakIterator will impact the outcome of PersonName. Could you explain how does PersonName depends on BreakIterator?

richgillam
richgillam previously approved these changes Nov 15, 2023
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richgillam
Copy link
Contributor

@richgillam Somehow my FakeYueBreakEngine break the PersonNameConsistencyTest on yue.txt and yue_Hans.txt . I am suprise that changes to BreakIterator will impact the outcome of PersonName. Could you explain how does PersonName depends on BreakIterator?

PersonName is using BreakIterator for a few things, as I remember it:

  • Some field modifiers capitalize each individual word in the field, using BreakIterator to find the word boundaries.
  • I think the initial-formation stuff uses BreakIterator to break a field into words so that it can convert each of them into an initial.
  • I could be wrong, but I think the space-replacement stuff may also use BreakIterator to figure out where to insert or substitute spaces.

The first two don't seem relevant to Yue, but the third one might be (if it's real-- I don't remember for sure now). Some of the foreign-name support might be triggering the capitalization/initial logic even though it's not strictly relevant (e.g., formatting Yue with an English person name formatter). If you can tell me more about which tests are failing, what they expect, and what you're getting instead, I can help you track the problem down (if you need help, of course).

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what we agreed to last week? I thought I came away from that discussion thinking we were adding a third method to say whether the break engine supported a locale, not turning the isFor() method into that method by getting rid of the c parameter. Do we no longer need an isFor() function that tests a character to see if it can begin a run of characters to be broken?

Comment on lines +1086 to +1089
* Register a new external break engine. The external break engine will be adopted.
* Because ICU may choose to cache break engine internally, this must
* be called at application startup, prior to any calls to
* object methods of RuleBasedBreakIterator to avoid undefined behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Register a new external break engine. The external break engine will be adopted.
* Because ICU may choose to cache break engine internally, this must
* be called at application startup, prior to any calls to
* object methods of RuleBasedBreakIterator to avoid undefined behavior.
* Register a new external break engine. The external break engine will be adopted.
* Because ICU may choose to cache break engines internally,
* to avoid undefined behavior this must
* be called at application startup, prior to any calls to
* object methods of RuleBasedBreakIterator.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but someone more familiar with the break iterator api should review.

* @param text A CharacterIterator representing the text
* @param rangeStart The start of the range of known characters
* @param rangeEnd The end of the range of known characters
* @param foundBreaks Output of a list of Integer to denote break positions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify whether the contents of the List must be cleared before calling, or whether the fillBreaks is required to clear internally.

@markusicu
Copy link
Member

AFAIK @FrankYFTang has not yet addressed the feedback on his proposal from last year, and hasn't yet sent an updated proposal.

@markusicu markusicu added incomplete Needs work; do not approve/merge as is. waiting-on-author consensus-needed labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus-needed incomplete Needs work; do not approve/merge as is. waiting-on-author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants