C and LLVM codegen updates #351

MatthewFluet · 2019-11-22T11:09:29Z

Many updates to C and LLVM codegens. Highlights:

Add Machine.Program.rflow to compute {returns,raises}To control flow (654c557) and use in functor Chunkify (1b3b7b8) and in Machine IR Raise/Return transfers (cf8e487).
Add chunk-jump-table {false|true} compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1).
Add -chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f)
Experiment using LLVM's cc10 (aka, ghccc) calling convention (2e26ebd).

Experiment with a new simple chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same as coalesce4096, significantly improves fib and tak (for GCC), slightly improves hamlet, but slightly worsens raytrace:

config command                                                                                                                          
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C05    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple                                          
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
C10    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple                                       

task_clock ratio_means.fieller@0.95 (2-level)
program           `C05/C04` `C10/C09`
barnes-hut           0.9978    0.9589
boyer                1.064     1.076 
checksum             1.051     0.9775
count-graphs         1.005     0.9876
DLXSimulator         1.000     0.9905
even-odd             1.037     0.9989
fft                  0.9616    0.9537
fib                  0.6689    0.6260
flat-array           1.000     0.9645
hamlet               0.9547    0.9322
imp-for              1.067     1.014 
knuth-bendix         1.092     1.031 
lexgen               1.031     1.078 
life                 1.002     0.9911
logic                1.016     1.015 
mandelbrot           0.9776    1.030 
matrix-multiply      0.9903    0.9844
md5                  1.008     0.9940
merge                0.9927    1.062 
mlyacc               0.9810    1.024 
model-elimination    0.9877    0.9743
mpuz                 1.011     1.010 
nucleic              1.036     1.030 
output1              0.9943    1.021 
peek                 1.036     1.027 
pidigits             1.000     0.9653
psdes-random         1.009     1.014 
ratio-regions        0.9985    0.9881
ray                  0.9738    0.9601
raytrace             1.101     1.100 
simple               0.9620    0.9272
smith-normal-form    0.9690    0.9806
string-concat        0.9610    0.9772
tailfib              1.006     0.9292
tailmerge            0.9847    1.023 
tak                  0.8264    1.013 
tensor               1.010     0.9998
tsp                  0.9981    1.010 
tyan                 1.045     1.027 
vector-rev           1.012     0.9891
vector32-concat      0.9495    1.030 
vector64-concat      1.098     0.9744
vliw                 0.9413    1.019 
wc-input1            0.9301    1.098 
wc-scanStream        1.114     0.9234
zebra                1.008     1.001 
zern                 0.9819    1.014 
MIN                  0.6689    0.6260
GMEAN                0.9940    0.9912
MAX                  1.114     1.100

The simple chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile by gcc.

Add expect: WordX.t option to RSSA and Machine Switch.T (911b5d4, e2b27ab, 695320d) and add -gc-expect {none|false|true} compile-time option, where -gc-expect false should indicate that performing a GC is cold path (823815a); no notable performance impact.
Lots of tweaks to C codegen, ultimately eliminating almost all c-chunk.h macros.
Eliminate unused Machine.Operand.Contents constructor (006269b).
Make a major refactoring of LLVM codegen (cec30c5).
Implement Real<N>_qequal for C codegen (9b7b2bd) and use is{less,lessequal} for Real<N>_l{t,e} for C codegen (7b55819).
Generalize LLVM type-based alias-analysis (27709ef).
Add -llvm-aamd scope for simple noalias/alias.scope alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact.
Use C99/C11 inline for primitive and Basis Library functions (311331c, c864492, 4f2d213).

Add -codegen-fuse-op-and-chk {false|true} compile-time option to explicitly fuse adjacent Word<N>_<op> and Word{S,U}<N>_<op>CheckP primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacent Word<N>_<op> and Word{S,U}<N>_<op>CheckP primitives. The performance results reported at Add new overflow-checking primitives #273 and Remove old-style arithmetic primitives #292 suggest that this does not always have significant impact, but sometimes -codegen-fuse-op-and-chk true can have a positive. Unfortunately, it can also have a (significant) negative impact. In matrix-multiply and vector-rev, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks.

config command                                                                                                                          
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
C11    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true                             
C15    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true                          

task_clock ratio_means.fieller@0.95 (2-level)
program           `C11/C04` `C15/C09`
barnes-hut           1.005     0.9925
boyer                1.052     1.013 
checksum             1.022     1.028 
count-graphs         0.9722    1.002 
DLXSimulator         1.004     0.9959
even-odd             0.8768    1.003 
fft                  0.9592    1.016 
fib                  0.9732    0.9798
flat-array           0.8148    1.019 
hamlet               0.9966    1.030 
imp-for              0.8993    0.7985
knuth-bendix         1.008     1.013 
lexgen               0.9851    1.043 
life                 0.9954    1.006 
logic                0.9994    1.014 
mandelbrot           0.9440    1.013 
matrix-multiply      1.336     1.009 
md5                  0.9604    1.007 
merge                0.9675    1.037 
mlyacc               1.032     1.029 
model-elimination    1.010     1.004 
mpuz                 1.035     0.9599
nucleic              0.9938    0.9983
output1              0.9278    0.9709
peek                 0.9850    1.035 
pidigits             0.9702    0.9538
psdes-random         1.017     0.9986
ratio-regions        0.9801    0.9887
ray                  0.9795    1.009 
raytrace             0.9959    1.026 
simple               0.9764    1.010 
smith-normal-form    1.002     1.049 
string-concat        0.7919    0.9035
tailfib              1.030     1.227 
tailmerge            1.017     0.9980
tak                  0.9790    0.9988
tensor               0.5258    1.000 
tsp                  0.9845    1.013 
tyan                 1.019     0.9739
vector-rev           1.178     1.253 
vector32-concat      0.8703    0.9230
vector64-concat      0.8906    0.9038
vliw                 0.9921    1.044 
wc-input1            1.060     0.9809
wc-scanStream        0.9166    1.040 
zebra                1.008     1.020 
zern                 1.051     1.089 
MIN                  0.5258    0.7985
GMEAN                0.9720    1.007 
MAX                  1.336     1.253

Note: the issue with md5 mentioned in the commit messages are with respect to the md5 benchmark before 2daaebf.

Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change:

config command                                                                                                                          
C02    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9                                                           
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C08    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang                                                        
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        

task_clock ratio_means.fieller@0.95 (2-level)
program           `C04/C02` `C09/C08`
barnes-hut           1.036     1.025 
boyer                0.9731    1.006 
checksum             0.9652    1.002 
count-graphs         0.9988    0.9964
DLXSimulator         0.9970    1.023 
even-odd             1.002     0.9881
fft                  1.026     0.9674
fib                  0.9034    0.7846
flat-array           1.014     1.021 
hamlet               0.9740    1.010 
imp-for              0.9707    0.9908
knuth-bendix         0.9077    0.9777
lexgen               1.048     0.8985
life                 1.002     0.9827
logic                1.006     0.9867
mandelbrot           1.000     1.011 
matrix-multiply      1.020     0.9957
md5                  0.9700    0.9960
merge                0.9974    0.9818
mlyacc               1.003     0.9824
model-elimination    0.9936    0.9817
mpuz                 0.9815    0.9466
nucleic              0.9946    1.002 
output1              1.007     1.026 
peek                 0.9832    0.9898
pidigits             0.9950    1.047 
psdes-random         1.009     0.9869
ratio-regions        0.9978    0.9725
ray                  0.9938    0.9663
raytrace             0.9975    1.032 
simple               0.9936    1.000 
smith-normal-form    1.038     0.9941
string-concat        1.041     1.014 
tailfib              0.9865    0.9741
tailmerge            1.010     1.020 
tak                  0.9331    0.9041
tensor               0.9938    0.9941
tsp                  0.9825    1.004 
tyan                 0.9960    0.9879
vector-rev           1.014     0.9091
vector32-concat      1.090     0.9016
vector64-concat      0.9994    0.9800
vliw                 0.9995    0.9876
wc-input1            0.9685    0.8634
wc-scanStream        1.178     1.105 
zebra                0.9857    0.9900
zern                 0.9733    0.9890
MIN                  0.9034    0.7846
GMEAN                0.9982    0.9815
MAX                  1.178     1.105

…nsfers

…chunk.h`

Rather than using `goto doLeaveChunk;`.

On `Return`, use mustReturnToSelf || (mayReturnToSelf && (nextChunks[nextBlock] == selfChunk)) to guard `goto doSwitchNextBlock`; this guarantees that the `ChunkSwitch` will only be entered with a block found in the chunk.

Some chunk functions may not use `gcState`, `stackTop`, `frontier`, or `selfChunk`.

Using GCC's label address and computed goto features (https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html#Labels-as-Values), one can force the generation of a jump table for the chunk switch. Although GCC and clang will typically implement dense `switch` statements into a jump table, consider the following: switch (nextBlock) { case 5: goto L_5; case 6: goto L_6; ... case 29: goto L_29; default: __builtin_unreachable(); } GCC-7 (and earlier) and clang appear to implement this as: int t = nextBlock - 5 if (t > 24) goto L_29; else goto *jumpTable[i]; That is, it still performs a range comparison. With an explicit jump table, GCC and clang will implement this as: goto *jumpTable[nextBlock - 5]; (where the -5 can be incorporated into the address computation). Unfortunately, the performance impact seems negligible.

* Make `ChunkSwitch` exhaustive * Use direct call in `Return` when exactly one non-self target chunk

Highlights: - Introduce a `structure LLVM` with sub-structures for `Type`, `Value`, `Instr`, `MetaData`, and `ModuleContext`. While much of the LLVM codegen still uses strings, the modules enforce a more correct usage. - Unify `implementsPrim` and `primApp`. - Eliminate a number of instances of code duplication in the translation of primitives. - Eliminate awkward `Context` type, which was a form of manual closure conversion. - Favor direct output and using `AppendList.t` over constructing large strings. - More closely match the C codegen.

Include `Global` and `GCState` domains, distinct from `Heap`, `Stack`, and `Other`. For each domain, include optional extra information for further distinctions: * `GCState`: offset * `Global`: cty, index * `Heap`: kind, tycon, cty, offset * `Stack`: offset Not all options lead to sound alias analysis; limitations are noted in comments.

…sis metadata

This allows marking an `_import`ed C function to have its prototype emitted with the `inline` keyword. This will be used to properly mark Basis Library functions (as opposed to primitives) that are provided as `inline` to be properly annotated as such in the emitted C declarations.

Previously, functions meant to be inlined because they correspond to primitives or Basis Library functions (e.g., `Real<N>` and `Real<N>.Math`) were marked `static inline` when included via `c-chunk.h`. If a C compiler chooses not to inline a function (such as clang at -O1), then each .o file included its own copy of the function (and the copy of the function provided by `libmlton.a` was not linked into the final executable). Now, functions are marked `inline` when included via `c-chunk.h` (and the corresponding `_import` is given the `inline` attribute). The C99/C11 semantics of `inline` requires the C compiler to *not* include a copy of the function in the .o file (if it chooses not to inline the function) and treat the function as an external reference. The copy of the function provided by `libmlton.a` is used to satisfy the external reference when linking.

…inline))` Although the functions are small and marked `inline`, clang at -O1 does not inline the functions (even with the `-finline-functions` option).

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). This flag will be used to enable explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the codegens.

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). These functions compute both the arithmetic result and a boolean indicating overflow (using `__builtin_<op>_overflow`). They will be used for explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C codegen for `-codegen-fuse-op-and-check true`.

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). (Note: Because the final md5 state is not used by the `md5` benchmark program, MLton actually optimizes out most of the md5 computation. What is left is a lot of arithmetic from `PackWord32Little.subVec` to check for indices that should raise `Subscript`.) For example, with `-codegen-fuse-op-and-check false` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r9d, %r10d subl $1, %r10d jo .L650 leal -1(%r8), %r10d movl %r10d, %r12d addl %r10d, %edx jo .L650 addl %r10d, %r11d cmpl %eax, %r11d jnb .L656 movl %ebp, %edx addl $1, %edx jo .L659 leal 1(%rcx), %edx movl %edx, %r11d imull %r9d, %r11d jo .L650 imull %r8d, %edx movl %edx, %r11d addl %r10d, %r11d jo .L650 leal (%rdx,%r10), %r11d cmpl %eax, %r11d jnb .L665 What seems to have happened is that gcc has arranged for equivalent values to be in `%r8` and `%r9`. In the first three lines, there is an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while in the fourth line, there is an implementation of `Word32_sub (X, 1)` using `lea` with an offset of `-1`. Notice that `%r10` is used for the result of both, so the fourth line is redundant (the value is already in `%r10`). On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r8d, %r9d subl $1, %r9d jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L651 movl %eax, %ecx addl $1, %ecx jo .L654 imull %r8d, %ecx jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L660

On some small programs (with all `Chunk<N>` fns in the same compilation unit), Clang could observe that all `Chunk<N>` fns return the value `-2` (arising from a C call with no return point). With this knowledge, it would replace a tail call from one `Chunk<N>` fn to a `Chunk<M>` fn with a non-tail call and an explicit `ret -2`. Breaking the tail call and not performing tail-call optimization leads to unbounded C stack growth and segmentation faults. LLVM could make the same optimization, but the LLVM codegen did not exhibit the same problem (perhaps it requires a specific LLVM optimization pass that is requested by Clang at `-O1`, but not included by default by opt at `-O2`). Obscuring the manifest result value by using a function call seems to prevent the problem (though, Clang could observe that all `Chunk<N>` fns return the value `MLton_unreachable()` and make the same transformation, but presumable propagating a function call is considered more expensive than propagating a constant). It could arise again with aggressive link-time optimization.

…>_<op>CheckP` primitives

See MLton#190 and MLton#191 On systems (e.g., gcc 7.04 on Ubuntu 18.04) that error when linking PIC and non-PIC code, the default behavior of `llc` to generate non-PIC code leads to link-time errors.

When a Machine IR temp is used as a destination for `Word{S,U}<N>_<op>AndCheck`, by having its address taken, Clang sometimes fails to turn the `alloca` introduced for the C local variable into an SSA variable. Moreover, Clang introduces `@llvm.lifetime.{start,end}` intrinsic calls at chunk entry and exit; the call at the chunk exit (although they are no-ops) inhibit tail call optimization. Using a manifest temporary C local variable for the results of `Word{S,U}<N>_<op>AndCheck` and then copying them into Machine IR destination operands seems avoid the problem.

…en-updates

Fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens will now occur in either order.

MatthewFluet added 30 commits June 22, 2019 22:00

Add and use ChunkFn_t and ChunkFnPtr_t typedefs

67e2743

Simplify passing of Control.chunkTailCall to c-chunk.h

8eca49d

Add Machine.Program.rflow for {returns,raises}To control flow

654c557

Use Machine.Program.rflow in functor Chunkify

1b3b7b8

Add {raises,returns}To information to Machine IL Raise/Return tra…

cf8e487

…nsfers

Avoid doSwitchNextBlock when Raise/Return must be inter-chunk

793afa2

Simplify c-chunk.h macros

b9bdff1

Use more descriptive parameters in FarCall macro in c-chunk.h

423b9ad

Move goto doSwitchBlock; from ChunkSwitch to Chunk macro in `c-…

677145e

…chunk.h`

Add and use LeaveChunk macro in c-chunk.h

df56150

Use LeaveChunk macro in Return in c-chunk.h

eb9c62a

Rather than using `goto doLeaveChunk;`.

Make ChunkSwitch exhaustive

bf9fe59

On `Return`, use mustReturnToSelf || (mayReturnToSelf && (nextChunks[nextBlock] == selfChunk)) to guard `goto doSwitchNextBlock`; this guarantees that the `ChunkSwitch` will only be entered with a block found in the chunk.

Use direct call in Return when exactly one non-self target chunk

7c10c70

Remove DefineChunk macro from c-common.h

e232b8a

Add and use SwitchNextBlock macro in c-chunk.h

b45e5ef

Add ChunkSwitchCase to c-chunk.h

dcb2873

Add -chunk-jump-table {false|true} compile-time option

8e0dd2d

Reorganize c-chunk.h

b3a3ab0

Silence C compiler warnings about unused parameters/variables

f29a65f

Some chunk functions may not use `gcState`, `stackTop`, `frontier`, or `selfChunk`.

Silence C compiler warning about addresses always evaluating to true

62f07c5

Add and use %ChunkFn{,Ptr{,Arr}}_t typedefs in LLVM codegen

b6456a2

Share code for Raise and Return in C codegen

1ea83b5

Eliminate DeclareChunk macro

edbf6c9

Eliminate ChunkName and Chunkp macros

25138bc

Update LLVM codegen to match C codegen

69d4444

* Make `ChunkSwitch` exhaustive * Use direct call in `Return` when exactly one non-self target chunk

Add and use Machine.Operand.gcField

25b1991

Perform StackTop = StackBottom + ExnStack in C codegen

869c4a2

"%Pointer" type defn is not used by LLVM codegen

8212ae3

Eliminate unnecessary cast in translation of SetExnStackLocal

fb54dcf

MatthewFluet added 28 commits July 29, 2019 16:36

Declare/Define nextChunks as ChunkFnPtr_t const

e34905f

Declare/Define more static arrays as const

04af37a

Include all codegen prims in CCodegen.implementsPrim

0a649f2

Implement Real<N>_qequal for C codegen

9b7b2bd

Use is{less,lessequal} for Real<N>_l{t,e} in C codegen

7b55819

Eliminate Machine.Statement.Noop

32606cc

Add -llvm-aamd scope for simple noalias/alias.scope alias-analy…

b825f56

…sis metadata

Mark primitive and Basis Library functions as `__attribute__((always_…

4f2d213

…inline))` Although the functions are small and marked `inline`, clang at -O1 does not inline the functions (even with the `-finline-functions` option).

Use -chunkify simple for a ControlFlags.Chunkify.simpleDefault

d1df0de

Add comments about fusing of adjacent Word<N>_<op> and `Word{S,U}<N…

82c019f

…>_<op>CheckP` primitives

Eliminate redundant debug flags in compileC and compileS

44de25d

Add -relocation-model=pic to llc options when positionIndependent

79e7c81

See MLton#190 and MLton#191 On systems (e.g., gcc 7.04 on Ubuntu 18.04) that error when linking PIC and non-PIC code, the default behavior of `llc` to generate non-PIC code leads to link-time errors.

Implement -codegen-fuse-op-and-check true for LLVM codegen

61de560

Avoid fneg LLVM instruction (not present prior to LLVM 8.0)

74b77ea

Merge branch 'master' of github.com:MLton/mlton into c-and-llvm-codeg…

098009d

…en-updates

Make -codegen-fuse-op-and-chk true for C & LLVM order independent

0d46a85

Fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens will now occur in either order.

Use @llvm.expect.i<N> in LLVM codegen for Switch.T with expect

695320d

Add and use LLVMCodegen.LLVM.ModuleContext.intrinsic

62bf6f8

Update CHANGELOG.adoc

29ae87c

MatthewFluet merged commit 7ab49d0 into MLton:master Nov 22, 2019

MatthewFluet deleted the c-and-llvm-codegen-updates branch November 22, 2019 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C and LLVM codegen updates #351

C and LLVM codegen updates #351

MatthewFluet commented Nov 22, 2019

C and LLVM codegen updates #351

C and LLVM codegen updates #351

Conversation

MatthewFluet commented Nov 22, 2019