Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Fix computeSubwords method #8

Merged
merged 1 commit into from
Apr 24, 2019
Merged

Conversation

y-yammt
Copy link

@y-yammt y-yammt commented Apr 21, 2019

According to the link, the Unicode check (c & 0xC0) == 0x80 is only applied if strings are encoded in UTF-8. This PR fixes the character extraction from strings encoded in UTF-16.

@@ -155,4 +158,23 @@ public static String readWord(java.io.Reader reader) throws IOException {
return sb.length() == 0 ? null : sb.toString();
}

@Test
public void testComputeSubwords() {
ImmutableMap<String, Set<String>> wordToSubwords = ImmutableMap.of(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases are generated by https://ideone.com/XR19v4 .

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it looks like both versions (c++ and yours java) match.

@sszuev
Copy link
Owner

sszuev commented Apr 24, 2019

Thanks for PR. Good catch.
Although, it seems, the code can be slightly optimized (e.g. double calling codePointAt for i=j), it works, has a test, and seems to be better than it was.

@sszuev sszuev merged commit b7da617 into sszuev:master Apr 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants