Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and optimize composeChar #33

Closed
wants to merge 4 commits into from
Closed

Conversation

harendra-kumar
Copy link
Member

Look at the commit messages for more details. May be a good idea to review individual commits. Here is the perf summary, before and after:

Benchmark      unicode-transforms(0)(ms) unicode-transforms(1)(ms)
-------------- ------------------------- -------------------------
NFC/AllChars                       19.36                     10.21
NFC/Deutsch                        13.54                      3.04
NFC/Devanagari                     17.42                      7.97
NFC/English                        12.76                      2.24
NFC/Japanese                       20.81                     11.02
NFC/Korean                         21.38                     12.96
NFC/Vietnamese                     18.62                     11.36

@harendra-kumar
Copy link
Member Author

Tests are failing, need to do a bit more work before it becomes ready.

Copy link
Collaborator

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stunning speed up! Great!

-- Hold an L to wait for V, hold an LV to wait for T.
data JamoBuf
= JamoEmpty
| JamoLIndex {-# UNPACK #-} !Int
= JamoLIndex {-# UNPACK #-} !Int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe {-# UNPACK #-} pragmas can be safely omitted, there is {-# OPTIONS_GHC -funbox-strict-fields #-} anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can be removed.

| JamoLV {-# UNPACK #-} !Char

data RegBuf
= RegOne !Char
| RegMany !Char !Char [Char]
Copy link
Collaborator

@Bodigrim Bodigrim May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any invariants about which of these Chars are combining and which are not? If yes, it is worth to reflect in comments. (Previously ReBuf was guaranteed to consist of combining characters only, and Char in Stater Char Rebuf was guaranteed to be non-combining.)

You can probably put a bang before [Char] as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one in the buffer may be a starter. Others are guaranteed to be non-starters (i.e. combining chars). I will put a comment. I will add a bang as well.

@harendra-kumar
Copy link
Member Author

Tests are fixed now. There were two screw ups during refactoring:

  1. I assumed that a non-hangul character can never decompose to hangul character. But that's not true. There are hangul characters out of the hangul range which have compatibility decompositions to hangul range.

  2. When combining starter pairs we need to combine only two contiguous starters, I combined even non-contiguous ones.

Perf comparison:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/AllChars                        19.92                     11.00
NFC/Deutsch                         13.25                      3.29
NFC/Devanagari                      17.47                      7.88
NFC/English                         12.58                      2.55
NFC/Japanese                        21.42                     12.34
NFC/Korean                          21.55                     13.03
NFC/Vietnamese                      18.52                     11.73

NFKC/AllChars                       23.99                     15.85
NFKC/Deutsch                        13.03                      3.56
NFKC/Devanagari                     17.09                      7.58
NFKC/English                        12.37                      2.59
NFKC/Japanese                       22.40                     13.60
NFKC/Korean                         21.31                     13.10
NFKC/Vietnamese                     18.23                     11.65

@harendra-kumar
Copy link
Member Author

RegBuf is just ReBuf without the Empty state. I guess we can remove duplication by using Just RegBuf wherever we are using ReBuf.

@Bodigrim
Copy link
Collaborator

Bodigrim commented May 6, 2020

Just may introduce an additional indirection and (worse) an additional level of laziness.

@harendra-kumar
Copy link
Member Author

I hope GHC would eliminate the indirection during simplification.

@harendra-kumar
Copy link
Member Author

harendra-kumar commented May 6, 2020

Comparison with text-icu (last column is % diff from the lower, negative means we are better). We are better in all decompositions and in some compositions. The only cases where we are signficantly worse is Japanese and Korean compositions. And we have to remember that we are doing a full decomposition and then composition.

Korean can be improved relatively easily perhaps.

Benchmark       ICU(ms)(base) unicode-transforms(%)(-base)
--------------- ------------- ----------------------------
NFC/AllChars             3.65                      +176.18
NFC/Deutsch              2.03                       +55.51
NFC/Devanagari           8.33                        -8.10
NFC/English              2.17                        +4.85
NFC/Japanese             2.99                      +309.14
NFC/Korean               4.51                      +184.43
NFC/Vietnamese          17.28                       -37.16
NFD/AllChars             7.57                       -12.89
NFD/Deutsch              3.55                       -35.48
NFD/Devanagari           9.04                       -27.18
NFD/English              2.82                       -38.42
NFD/Japanese             9.40                        -9.08
NFD/Korean              37.46                       -52.72
NFD/Vietnamese           7.03                       -15.52
NFKC/AllChars           10.03                       +50.94
NFKC/Deutsch             3.33                        +5.77
NFKC/Devanagari          8.34                        -5.40
NFKC/English             2.10                       +19.39
NFKC/Japanese            5.32                      +137.63
NFKC/Korean              7.65                       +69.89
NFKC/Vietnamese         17.37                       -37.23
NFKD/AllChars           11.15                       -13.69
NFKD/Deutsch             3.58                       -39.06
NFKD/Devanagari          9.16                       -27.47
NFKD/English             2.92                       -38.14
NFKD/Japanese           10.38                       -15.28
NFKD/Korean             37.46                       -52.94
NFKD/Vietnamese         12.98                       -53.81

@harendra-kumar
Copy link
Member Author

harendra-kumar commented May 6, 2020

Added one more commit to speed up isHangulLV check. Now Korean benchmark is pretty close to text-icu in NFC and significantly better than ICU in NFKC.

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                          13.21                      5.13
NFKC/Korean                         13.36                      5.49

@Bodigrim
Copy link
Collaborator

Bodigrim commented May 6, 2020

Astonishing!

@harendra-kumar
Copy link
Member Author

I did not properly review the code before committing. Even though the code was wrong, the tests passed, I need to check if the tests are working properly. Unicode test suite does not have a hangul (LV, T) composition test, I had added this in an extra test suite, but it does not seem to be working.

The correct code should have been:

    assert (jamoTCount `mod` 4 == 0)
           (idiv4 == 0 && ti == 0)
    where
        i = (ord c) - hangulFirst
        idiv4 = i .&. 3
        (_, ti) = i `quotRem` jamoTCount

But this does not help much, perhaps because anyway most syllables are hangul LV so we anyway have to do the expensive quotRem. I wrote this one earlier:

-- Cheaper division by 28
{-# INLINE isDivBy28 #-}
isDivBy28 :: Int -> Bool
isDivBy28 n =
    let (q, r) = divBy32 n
    in if (q == 0)
       then r == 28 || r == 0
       else isDivBy28 (q `unsafeShiftL` 2 + r)

    where

    divBy32 x =
        let q = x `unsafeShiftR` 5
            r = x .&. 31
        in (q, r)

isHangulLV c = isDivBy28 ((ord c) - hangulFirst)

This gives us the following results:

Benchmark       unicode-transforms(0)(ms) unicode-transforms(1)(ms)
--------------- ------------------------- -------------------------
NFC/Korean                          13.21                      10.65
NFKC/Korean                         13.36                      10.65

Good improvement but not as dramatic. I am looking at another strategy which could be cheaper i.e. test for a jamo T and only then check if the previous one was an LV. In a precomposed form we will mostly see LV and LVT, rarely when we see a jamo T only then we will need the expensive division.

* unfold the non-decomposable case outside the recursive loop
* use SPEC constr on encode
There was a lot of code bloat earlier, the core size reduced from 50K lines
to 20K lines.
In composeChar:

* First check the state and then make decisions based on the char type instead
  of doing the opposite. This simplifies the code, the core size reduces by
  1/3rd (it is still huge).

* With this change the code is better amenable to spec-constr
  optimization, with -fspec-constr-count=8 it is able to produce a
  tighter loop improving performance by several times (best improvement is in
  English benchmark).
UNPACK pragmas are redundant as we use a blanket -funbox-strict-fields
@harendra-kumar
Copy link
Member Author

I dropped the last commit, will continue that in a separate PR and merge this.

@harendra-kumar
Copy link
Member Author

merged in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants