Speed up hangul composition #37

harendra-kumar · 2020-05-07T05:47:57Z

Use faster divisibility-by-28 check.

Perf comparison:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                          13.08                      9.72
NFKC/Korean                         13.39                     10.31

harendra-kumar · 2020-05-07T06:04:58Z

I earlier commented in #33 about tests not working. It seems the tests are there but they are based on first few hangul characters and for those cases the incorrect code worked fine.

harendra-kumar · 2020-05-07T12:12:57Z

Added a new commit for better speed up, results (% diff) before and after the new commit:

Benchmark       unicode-transforms(0)(ms)(base) unicode-transforms(1)(%)(-base)
--------------- ------------------------------- -------------------------------
NFC/AllChars                               9.95                           -0.05
NFC/Deutsch                                3.02                           +2.63
NFC/Devanagari                             6.69                           +7.76
NFC/English                                2.23                           -0.06
NFC/Japanese                              12.59                           -2.51
NFC/Korean                                 9.65                          -52.40
NFC/Vietnamese                            10.23                           +3.28
NFD/AllChars                               6.81                           -4.77
NFD/Deutsch                                2.38                           -4.46
NFD/Devanagari                             6.67                           +7.18
NFD/English                                1.85                           -4.66
NFD/Japanese                               8.41                           -5.57
NFD/Korean                                17.02                           -4.62
NFD/Vietnamese                             6.32                           -0.75
NFKC/AllChars                             14.87                           +0.17
NFKC/Deutsch                               3.18                           +1.78
NFKC/Devanagari                            6.75                           +6.09
NFKC/English                               2.23                           +0.10
NFKC/Japanese                             13.05                           -0.47
NFKC/Korean                               10.18                          -52.61
NFKC/Vietnamese                           10.50                           +0.39
NFKD/AllChars                             10.61                           -8.10
NFKD/Deutsch                               2.37                           +0.77
NFKD/Devanagari                            7.61                          -16.52
NFKD/English                               1.78                           -0.44
NFKD/Japanese                              8.63                           -3.32
NFKD/Korean                               16.26                           +0.02
NFKD/Vietnamese                            7.24                          -17.43

To compare just hangul:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                           9.65                      4.59
NFKC/Korean                         10.18                      4.80

We are now better than text-icu in all cases except mixed chars and Japanese.

Bodigrim · 2020-05-08T20:10:27Z

Could you possibly rebase this atop of #39? To estimate the effect of lookahead on its own, once divisions are as fast as possible.

Bodigrim · 2020-05-11T18:10:00Z

I rebased this branch locally atop of master. It still provides a decent improvement.

Benchmark      Before    After
--------------------------------
NFC/Korean    7.024 ms  4.847 ms
NFKC/Korean   7.341 ms  4.772 ms

Earlier were determining the type of the hangul syllable as soon as we saw it. Determining whether it is a hangul LV or not required a division operation which is expensive. Instead we now delay this decision until we see the next character. In most cases we do not need to determine this, we only need to determine it if the syllable is LV and it is followed by a T. With this we need very little division, therefore speeding up hangul composition significantly.

harendra-kumar · 2020-05-16T06:09:22Z

Rebased, this is what I see:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                           7.72                      4.79
NFKC/Korean                          8.37                      4.90

harendra-kumar · 2020-05-16T06:27:43Z

Comparison with text-icu:

Benchmark       ICU(ms)(base) unicode-transforms(%)(-base)
--------------- ------------- ----------------------------
NFC/AllChars             3.70                      +163.05
NFC/Deutsch              2.04                       +48.59
NFC/Devanagari           9.07                       -21.91
NFC/English              2.16                        +0.95
NFC/Japanese             3.15                      +288.14
NFC/Korean               4.50                        +6.15
NFC/Vietnamese          18.58                       -35.44
NFD/AllChars             7.59                       -13.13
NFD/Deutsch              3.64                       -36.19
NFD/Devanagari           9.02                       -12.31
NFD/English              2.86                       -37.22
NFD/Japanese             9.58                       -14.74
NFD/Korean              38.39                       -78.69
NFD/Vietnamese           6.98                        +2.26
NFKC/AllChars           10.06                       +51.46
NFKC/Deutsch             3.27                        +1.97
NFKC/Devanagari          8.81                       -11.69
NFKC/English             2.14                        +5.22
NFKC/Japanese            5.56                      +131.32
NFKC/Korean              7.86                       -36.83
NFKC/Vietnamese         17.19                       -33.78
NFKD/AllChars           11.09                        -6.46
NFKD/Deutsch             3.65                       -38.25
NFKD/Devanagari          9.03                       -24.62
NFKD/English             2.86                       -37.16
NFKD/Japanese           10.37                       -12.39
NFKD/Korean             38.49                       -79.00
NFKD/Vietnamese         12.97                       -52.90

AllChars and Japanese are the only ones where we are significantly worse than text-icu:

Benchmark       ICU(ms) unicode-transforms(ms)
--------------- ------- ----------------------
NFC/AllChars       3.70                  10.15
NFC/Deutsch        2.04                   3.11
NFC/Japanese       3.15                  12.38
NFKC/AllChars     10.06                  15.26
NFKC/Deutsch       3.27                   3.33
NFKC/Japanese      5.56                  12.95

Bodigrim

Looks good to me!

harendra-kumar requested review from Bodigrim and adithyaov May 7, 2020 05:47

harendra-kumar marked this pull request as ready for review May 7, 2020 12:13

harendra-kumar changed the base branch from compose-perf to master May 8, 2020 00:53

harendra-kumar mentioned this pull request May 10, 2020

Use multiplication instead of division by a constant #42

Merged

harendra-kumar force-pushed the compose-hangul branch from 7c4c3ea to 2f70dc9 Compare May 16, 2020 06:07

Bodigrim approved these changes May 16, 2020

View reviewed changes

harendra-kumar merged commit 2f70dc9 into master May 16, 2020

harendra-kumar deleted the compose-hangul branch July 18, 2020 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up hangul composition #37

Speed up hangul composition #37

harendra-kumar commented May 7, 2020

harendra-kumar commented May 7, 2020

harendra-kumar commented May 7, 2020

Bodigrim commented May 8, 2020

Bodigrim commented May 11, 2020

harendra-kumar commented May 16, 2020

harendra-kumar commented May 16, 2020

Bodigrim left a comment

Speed up hangul composition #37

Speed up hangul composition #37

Conversation

harendra-kumar commented May 7, 2020

harendra-kumar commented May 7, 2020

harendra-kumar commented May 7, 2020

Bodigrim commented May 8, 2020

Bodigrim commented May 11, 2020

harendra-kumar commented May 16, 2020

harendra-kumar commented May 16, 2020

Bodigrim left a comment

Choose a reason for hiding this comment