Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up hangul composition #37

Merged
merged 1 commit into from
May 16, 2020
Merged

Speed up hangul composition #37

merged 1 commit into from
May 16, 2020

Conversation

harendra-kumar
Copy link
Member

Use faster divisibility-by-28 check.

Perf comparison:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                          13.08                      9.72
NFKC/Korean                         13.39                     10.31

@harendra-kumar
Copy link
Member Author

I earlier commented in #33 about tests not working. It seems the tests are there but they are based on first few hangul characters and for those cases the incorrect code worked fine.

@harendra-kumar
Copy link
Member Author

Added a new commit for better speed up, results (% diff) before and after the new commit:

Benchmark       unicode-transforms(0)(ms)(base) unicode-transforms(1)(%)(-base)
--------------- ------------------------------- -------------------------------
NFC/AllChars                               9.95                           -0.05
NFC/Deutsch                                3.02                           +2.63
NFC/Devanagari                             6.69                           +7.76
NFC/English                                2.23                           -0.06
NFC/Japanese                              12.59                           -2.51
NFC/Korean                                 9.65                          -52.40
NFC/Vietnamese                            10.23                           +3.28
NFD/AllChars                               6.81                           -4.77
NFD/Deutsch                                2.38                           -4.46
NFD/Devanagari                             6.67                           +7.18
NFD/English                                1.85                           -4.66
NFD/Japanese                               8.41                           -5.57
NFD/Korean                                17.02                           -4.62
NFD/Vietnamese                             6.32                           -0.75
NFKC/AllChars                             14.87                           +0.17
NFKC/Deutsch                               3.18                           +1.78
NFKC/Devanagari                            6.75                           +6.09
NFKC/English                               2.23                           +0.10
NFKC/Japanese                             13.05                           -0.47
NFKC/Korean                               10.18                          -52.61
NFKC/Vietnamese                           10.50                           +0.39
NFKD/AllChars                             10.61                           -8.10
NFKD/Deutsch                               2.37                           +0.77
NFKD/Devanagari                            7.61                          -16.52
NFKD/English                               1.78                           -0.44
NFKD/Japanese                              8.63                           -3.32
NFKD/Korean                               16.26                           +0.02
NFKD/Vietnamese                            7.24                          -17.43

To compare just hangul:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                           9.65                      4.59
NFKC/Korean                         10.18                      4.80

We are now better than text-icu in all cases except mixed chars and Japanese.

@harendra-kumar harendra-kumar marked this pull request as ready for review May 7, 2020 12:13
@harendra-kumar harendra-kumar changed the base branch from compose-perf to master May 8, 2020 00:53
@Bodigrim
Copy link
Collaborator

Bodigrim commented May 8, 2020

Could you possibly rebase this atop of #39? To estimate the effect of lookahead on its own, once divisions are as fast as possible.

@Bodigrim
Copy link
Collaborator

I rebased this branch locally atop of master. It still provides a decent improvement.

Benchmark      Before    After
--------------------------------
NFC/Korean    7.024 ms  4.847 ms
NFKC/Korean   7.341 ms  4.772 ms

Earlier were determining the type of the hangul syllable as soon as
we saw it.  Determining whether it is a hangul LV or not required a
division operation which is expensive.

Instead we now delay this decision until we see the next character. In
most cases we do not need to determine this, we only need to determine
it if the syllable is LV and it is followed by a T. With this we
need very little division, therefore speeding up hangul composition
significantly.
@harendra-kumar
Copy link
Member Author

Rebased, this is what I see:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                           7.72                      4.79
NFKC/Korean                          8.37                      4.90

@harendra-kumar
Copy link
Member Author

Comparison with text-icu:

Benchmark       ICU(ms)(base) unicode-transforms(%)(-base)
--------------- ------------- ----------------------------
NFC/AllChars             3.70                      +163.05
NFC/Deutsch              2.04                       +48.59
NFC/Devanagari           9.07                       -21.91
NFC/English              2.16                        +0.95
NFC/Japanese             3.15                      +288.14
NFC/Korean               4.50                        +6.15
NFC/Vietnamese          18.58                       -35.44
NFD/AllChars             7.59                       -13.13
NFD/Deutsch              3.64                       -36.19
NFD/Devanagari           9.02                       -12.31
NFD/English              2.86                       -37.22
NFD/Japanese             9.58                       -14.74
NFD/Korean              38.39                       -78.69
NFD/Vietnamese           6.98                        +2.26
NFKC/AllChars           10.06                       +51.46
NFKC/Deutsch             3.27                        +1.97
NFKC/Devanagari          8.81                       -11.69
NFKC/English             2.14                        +5.22
NFKC/Japanese            5.56                      +131.32
NFKC/Korean              7.86                       -36.83
NFKC/Vietnamese         17.19                       -33.78
NFKD/AllChars           11.09                        -6.46
NFKD/Deutsch             3.65                       -38.25
NFKD/Devanagari          9.03                       -24.62
NFKD/English             2.86                       -37.16
NFKD/Japanese           10.37                       -12.39
NFKD/Korean             38.49                       -79.00
NFKD/Vietnamese         12.97                       -52.90

AllChars and Japanese are the only ones where we are significantly worse than text-icu:

Benchmark       ICU(ms) unicode-transforms(ms)
--------------- ------- ----------------------
NFC/AllChars       3.70                  10.15
NFC/Deutsch        2.04                   3.11
NFC/Japanese       3.15                  12.38
NFKC/AllChars     10.06                  15.26
NFKC/Deutsch       3.27                   3.33
NFKC/Japanese      5.56                  12.95

Copy link
Collaborator

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@harendra-kumar harendra-kumar merged commit 2f70dc9 into master May 16, 2020
@harendra-kumar harendra-kumar deleted the compose-hangul branch July 18, 2020 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants