Fix computeSubwords method #8

y-yammt · 2019-04-21T10:58:18Z

According to the link, the Unicode check (c & 0xC0) == 0x80 is only applied if strings are encoded in UTF-8. This PR fixes the character extraction from strings encoded in UTF-16.

y-yammt · 2019-04-21T10:59:07Z

src/test/java/cc/fasttext/DictionaryTest.java

@@ -155,4 +158,23 @@ public static String readWord(java.io.Reader reader) throws IOException {
        return sb.length() == 0 ? null : sb.toString();
    }

+    @Test
+    public void testComputeSubwords() {
+        ImmutableMap<String, Set<String>> wordToSubwords = ImmutableMap.of(


The test cases are generated by https://ideone.com/XR19v4 .

well, it looks like both versions (c++ and yours java) match.

sszuev · 2019-04-24T08:09:21Z

Thanks for PR. Good catch.
Although, it seems, the code can be slightly optimized (e.g. double calling codePointAt for i=j), it works, has a test, and seems to be better than it was.

Fix computeSubwords method

0b8124f

y-yammt commented Apr 21, 2019

View reviewed changes

sszuev merged commit b7da617 into sszuev:master Apr 24, 2019

sszuev mentioned this pull request Oct 14, 2019

Optimize Dictionary#computeSubwords #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix computeSubwords method #8

Fix computeSubwords method #8

y-yammt commented Apr 21, 2019

y-yammt Apr 21, 2019

sszuev Apr 24, 2019

sszuev commented Apr 24, 2019

Fix computeSubwords method #8

Fix computeSubwords method #8

Conversation

y-yammt commented Apr 21, 2019

y-yammt Apr 21, 2019

Choose a reason for hiding this comment

sszuev Apr 24, 2019

Choose a reason for hiding this comment

sszuev commented Apr 24, 2019