Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C and LLVM codegen updates #351

Merged
merged 104 commits into from
Nov 22, 2019

Conversation

MatthewFluet
Copy link
Member

Many updates to C and LLVM codegens. Highlights:

  • Add Machine.Program.rflow to compute {returns,raises}To control flow (654c557) and use in functor Chunkify (1b3b7b8) and in Machine IR Raise/Return transfers (cf8e487).
  • Add chunk-jump-table {false|true} compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1).
  • Add -chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f)
  • Experiment using LLVM's cc10 (aka, ghccc) calling convention (2e26ebd).
  • Experiment with a new simple chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same as coalesce4096, significantly improves fib and tak (for GCC), slightly improves hamlet, but slightly worsens raytrace:
    config command                                                                                                                          
    C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
    C05    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple                                          
    C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
    C10    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple                                       
    
    task_clock ratio_means.fieller@0.95 (2-level)
    program           `C05/C04` `C10/C09`
    barnes-hut           0.9978    0.9589
    boyer                1.064     1.076 
    checksum             1.051     0.9775
    count-graphs         1.005     0.9876
    DLXSimulator         1.000     0.9905
    even-odd             1.037     0.9989
    fft                  0.9616    0.9537
    fib                  0.6689    0.6260
    flat-array           1.000     0.9645
    hamlet               0.9547    0.9322
    imp-for              1.067     1.014 
    knuth-bendix         1.092     1.031 
    lexgen               1.031     1.078 
    life                 1.002     0.9911
    logic                1.016     1.015 
    mandelbrot           0.9776    1.030 
    matrix-multiply      0.9903    0.9844
    md5                  1.008     0.9940
    merge                0.9927    1.062 
    mlyacc               0.9810    1.024 
    model-elimination    0.9877    0.9743
    mpuz                 1.011     1.010 
    nucleic              1.036     1.030 
    output1              0.9943    1.021 
    peek                 1.036     1.027 
    pidigits             1.000     0.9653
    psdes-random         1.009     1.014 
    ratio-regions        0.9985    0.9881
    ray                  0.9738    0.9601
    raytrace             1.101     1.100 
    simple               0.9620    0.9272
    smith-normal-form    0.9690    0.9806
    string-concat        0.9610    0.9772
    tailfib              1.006     0.9292
    tailmerge            0.9847    1.023 
    tak                  0.8264    1.013 
    tensor               1.010     0.9998
    tsp                  0.9981    1.010 
    tyan                 1.045     1.027 
    vector-rev           1.012     0.9891
    vector32-concat      0.9495    1.030 
    vector64-concat      1.098     0.9744
    vliw                 0.9413    1.019 
    wc-input1            0.9301    1.098 
    wc-scanStream        1.114     0.9234
    zebra                1.008     1.001 
    zern                 0.9819    1.014 
    MIN                  0.6689    0.6260
    GMEAN                0.9940    0.9912
    MAX                  1.114     1.100 
    
    The simple chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile by gcc.
  • Add expect: WordX.t option to RSSA and Machine Switch.T (911b5d4, e2b27ab, 695320d) and add -gc-expect {none|false|true} compile-time option, where -gc-expect false should indicate that performing a GC is cold path (823815a); no notable performance impact.
  • Lots of tweaks to C codegen, ultimately eliminating almost all c-chunk.h macros.
  • Eliminate unused Machine.Operand.Contents constructor (006269b).
  • Make a major refactoring of LLVM codegen (cec30c5).
  • Implement Real<N>_qequal for C codegen (9b7b2bd) and use is{less,lessequal} for Real<N>_l{t,e} for C codegen (7b55819).
  • Generalize LLVM type-based alias-analysis (27709ef).
  • Add -llvm-aamd scope for simple noalias/alias.scope alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact.
  • Use C99/C11 inline for primitive and Basis Library functions (311331c, c864492, 4f2d213).
  • Add -codegen-fuse-op-and-chk {false|true} compile-time option to explicitly fuse adjacent Word<N>_<op> and Word{S,U}<N>_<op>CheckP primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacent Word<N>_<op> and Word{S,U}<N>_<op>CheckP primitives. The performance results reported at Add new overflow-checking primitives #273 and Remove old-style arithmetic primitives #292 suggest that this does not always have significant impact, but sometimes -codegen-fuse-op-and-chk true can have a positive. Unfortunately, it can also have a (significant) negative impact. In matrix-multiply and vector-rev, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks.
    config command                                                                                                                          
    C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
    C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
    C11    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true                             
    C15    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true                          
    
    task_clock ratio_means.fieller@0.95 (2-level)
    program           `C11/C04` `C15/C09`
    barnes-hut           1.005     0.9925
    boyer                1.052     1.013 
    checksum             1.022     1.028 
    count-graphs         0.9722    1.002 
    DLXSimulator         1.004     0.9959
    even-odd             0.8768    1.003 
    fft                  0.9592    1.016 
    fib                  0.9732    0.9798
    flat-array           0.8148    1.019 
    hamlet               0.9966    1.030 
    imp-for              0.8993    0.7985
    knuth-bendix         1.008     1.013 
    lexgen               0.9851    1.043 
    life                 0.9954    1.006 
    logic                0.9994    1.014 
    mandelbrot           0.9440    1.013 
    matrix-multiply      1.336     1.009 
    md5                  0.9604    1.007 
    merge                0.9675    1.037 
    mlyacc               1.032     1.029 
    model-elimination    1.010     1.004 
    mpuz                 1.035     0.9599
    nucleic              0.9938    0.9983
    output1              0.9278    0.9709
    peek                 0.9850    1.035 
    pidigits             0.9702    0.9538
    psdes-random         1.017     0.9986
    ratio-regions        0.9801    0.9887
    ray                  0.9795    1.009 
    raytrace             0.9959    1.026 
    simple               0.9764    1.010 
    smith-normal-form    1.002     1.049 
    string-concat        0.7919    0.9035
    tailfib              1.030     1.227 
    tailmerge            1.017     0.9980
    tak                  0.9790    0.9988
    tensor               0.5258    1.000 
    tsp                  0.9845    1.013 
    tyan                 1.019     0.9739
    vector-rev           1.178     1.253 
    vector32-concat      0.8703    0.9230
    vector64-concat      0.8906    0.9038
    vliw                 0.9921    1.044 
    wc-input1            1.060     0.9809
    wc-scanStream        0.9166    1.040 
    zebra                1.008     1.020 
    zern                 1.051     1.089 
    MIN                  0.5258    0.7985
    GMEAN                0.9720    1.007 
    MAX                  1.336     1.253 
    
    Note: the issue with md5 mentioned in the commit messages are with respect to the md5 benchmark before 2daaebf.

Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change:

config command                                                                                                                          
C02    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9                                                           
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C08    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang                                                        
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        

task_clock ratio_means.fieller@0.95 (2-level)
program           `C04/C02` `C09/C08`
barnes-hut           1.036     1.025 
boyer                0.9731    1.006 
checksum             0.9652    1.002 
count-graphs         0.9988    0.9964
DLXSimulator         0.9970    1.023 
even-odd             1.002     0.9881
fft                  1.026     0.9674
fib                  0.9034    0.7846
flat-array           1.014     1.021 
hamlet               0.9740    1.010 
imp-for              0.9707    0.9908
knuth-bendix         0.9077    0.9777
lexgen               1.048     0.8985
life                 1.002     0.9827
logic                1.006     0.9867
mandelbrot           1.000     1.011 
matrix-multiply      1.020     0.9957
md5                  0.9700    0.9960
merge                0.9974    0.9818
mlyacc               1.003     0.9824
model-elimination    0.9936    0.9817
mpuz                 0.9815    0.9466
nucleic              0.9946    1.002 
output1              1.007     1.026 
peek                 0.9832    0.9898
pidigits             0.9950    1.047 
psdes-random         1.009     0.9869
ratio-regions        0.9978    0.9725
ray                  0.9938    0.9663
raytrace             0.9975    1.032 
simple               0.9936    1.000 
smith-normal-form    1.038     0.9941
string-concat        1.041     1.014 
tailfib              0.9865    0.9741
tailmerge            1.010     1.020 
tak                  0.9331    0.9041
tensor               0.9938    0.9941
tsp                  0.9825    1.004 
tyan                 0.9960    0.9879
vector-rev           1.014     0.9091
vector32-concat      1.090     0.9016
vector64-concat      0.9994    0.9800
vliw                 0.9995    0.9876
wc-input1            0.9685    0.8634
wc-scanStream        1.178     1.105 
zebra                0.9857    0.9900
zern                 0.9733    0.9890
MIN                  0.9034    0.7846
GMEAN                0.9982    0.9815
MAX                  1.178     1.105 

Rather than using `goto doLeaveChunk;`.
On `Return`, use

    mustReturnToSelf || (mayReturnToSelf && (nextChunks[nextBlock] == selfChunk))

to guard `goto doSwitchNextBlock`; this guarantees that the `ChunkSwitch` will
only be entered with a block found in the chunk.
Some chunk functions may not use `gcState`, `stackTop`, `frontier`, or
`selfChunk`.
Using GCC's label address and computed goto
features (https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html#Labels-as-Values),
one can force the generation of a jump table for the chunk switch.

Although GCC and clang will typically implement dense `switch`
statements into a jump table, consider the following:

    switch (nextBlock) {
      case 5: goto L_5;
      case 6: goto L_6;
      ...
      case 29: goto L_29;
      default: __builtin_unreachable();
    }

GCC-7 (and earlier) and clang appear to implement this as:

   int t = nextBlock - 5
   if (t > 24)
     goto L_29;
   else
     goto *jumpTable[i];

That is, it still performs a range comparison.  With an explicit jump
table, GCC and clang will implement this as:

   goto *jumpTable[nextBlock - 5];

(where the -5 can be incorporated into the address computation).

Unfortunately, the performance impact seems negligible.
* Make `ChunkSwitch` exhaustive
* Use direct call in `Return` when exactly one non-self target chunk
Highlights:

- Introduce a `structure LLVM` with sub-structures for `Type`,
  `Value`, `Instr`, `MetaData`, and `ModuleContext`.  While much of
  the LLVM codegen still uses strings, the modules enforce a more
  correct usage.

- Unify `implementsPrim` and `primApp`.

- Eliminate a number of instances of code duplication in the
  translation of primitives.

- Eliminate awkward `Context` type, which was a form of manual closure
  conversion.

- Favor direct output and using `AppendList.t` over constructing large
  strings.

- More closely match the C codegen.
Include `Global` and `GCState` domains, distinct from `Heap`, `Stack`,
and `Other`.

For each domain, include optional extra information for further
distinctions:

* `GCState`: offset
* `Global`: cty, index
* `Heap`: kind, tycon, cty, offset
* `Stack`: offset

Not all options lead to sound alias analysis; limitations are noted in
comments.
This allows marking an `_import`ed C function to have its prototype
emitted with the `inline` keyword.  This will be used to properly mark
Basis Library functions (as opposed to primitives) that are provided
as `inline` to be properly annotated as such in the emitted C
declarations.
Previously, functions meant to be inlined because they correspond to
primitives or Basis Library functions (e.g., `Real<N>` and
`Real<N>.Math`) were marked `static inline` when included via
`c-chunk.h`.  If a C compiler chooses not to inline a function (such
as clang at -O1), then each .o file included its own copy of the
function (and the copy of the function provided by `libmlton.a` was
not linked into the final executable).

Now, functions are marked `inline` when included via `c-chunk.h` (and
the corresponding `_import` is given the `inline` attribute).  The
C99/C11 semantics of `inline` requires the C compiler to *not* include
a copy of the function in the .o file (if it chooses not to inline the
function) and treat the function as an external reference.  The copy
of the function provided by `libmlton.a` is used to satisfy the
external reference when linking.
…inline))`

Although the functions are small and marked `inline`, clang at -O1
does not inline the functions (even with the `-finline-functions`
option).
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

This flag will be used to enable explicit fusing of adjacent
`Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the
codegens.
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

These functions compute both the arithmetic result and a boolean
indicating overflow (using `__builtin_<op>_overflow`).  They will be
used for explicit fusing of adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives in the C codegen for
`-codegen-fuse-op-and-check true`.
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

(Note: Because the final md5 state is not used by the `md5` benchmark
program, MLton actually optimizes out most of the md5 computation.
What is left is a lot of arithmetic from `PackWord32Little.subVec` to
check for indices that should raise `Subscript`.)

For example, with `-codegen-fuse-op-and-check false` and gcc-9, the
`transform` function of `md5` has the following assembly:

	movl	%r9d, %r10d
	subl	$1, %r10d
	jo	.L650
	leal	-1(%r8), %r10d
	movl	%r10d, %r12d
	addl	%r10d, %edx
	jo	.L650
	addl	%r10d, %r11d
	cmpl	%eax, %r11d
	jnb	.L656
	movl	%ebp, %edx
	addl	$1, %edx
	jo	.L659
	leal	1(%rcx), %edx
	movl	%edx, %r11d
	imull	%r9d, %r11d
	jo	.L650
	imull	%r8d, %edx
	movl	%edx, %r11d
	addl	%r10d, %r11d
	jo	.L650
	leal	(%rdx,%r10), %r11d
	cmpl	%eax, %r11d
	jnb	.L665

What seems to have happened is that gcc has arranged for equivalent
values to be in `%r8` and `%r9`.  In the first three lines, there is
an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while
in the fourth line, there is an implementation of `Word32_sub (X, 1)`
using `lea` with an offset of `-1`.  Notice that `%r10` is used for
the result of both, so the fourth line is redundant (the value is
already in `%r10`).

On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9,
the `transform` function of `md5` has the following assembly:

	movl	%r8d, %r9d
	subl	$1, %r9d
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L651
	movl	%eax, %ecx
	addl	$1, %ecx
	jo	.L654
	imull	%r8d, %ecx
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L660
On some small programs (with all `Chunk<N>` fns in the same
compilation unit), Clang could observe that all `Chunk<N>` fns return
the value `-2` (arising from a C call with no return point).  With
this knowledge, it would replace a tail call from one `Chunk<N>` fn to
a `Chunk<M>` fn with a non-tail call and an explicit `ret -2`.
Breaking the tail call and not performing tail-call optimization leads
to unbounded C stack growth and segmentation faults.  LLVM could make
the same optimization, but the LLVM codegen did not exhibit the same
problem (perhaps it requires a specific LLVM optimization pass that is
requested by Clang at `-O1`, but not included by default by opt at
`-O2`).

Obscuring the manifest result value by using a function call seems to
prevent the problem (though, Clang could observe that all `Chunk<N>`
fns return the value `MLton_unreachable()` and make the same
transformation, but presumable propagating a function call is
considered more expensive than propagating a constant).  It could
arise again with aggressive link-time optimization.
See MLton#190 and MLton#191

On systems (e.g., gcc 7.04 on Ubuntu 18.04) that error when linking
PIC and non-PIC code, the default behavior of `llc` to generate
non-PIC code leads to link-time errors.
When a Machine IR temp is used as a destination for
`Word{S,U}<N>_<op>AndCheck`, by having its address taken, Clang
sometimes fails to turn the `alloca` introduced for the C local
variable into an SSA variable.  Moreover, Clang introduces
`@llvm.lifetime.{start,end}` intrinsic calls at chunk entry and exit;
the call at the chunk exit (although they are no-ops) inhibit tail
call optimization.

Using a manifest temporary C local variable for the results of
`Word{S,U}<N>_<op>AndCheck` and then copying them into Machine IR
destination operands seems avoid the problem.
Fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP`
primitives in the C and LLVM codegens will now occur in either order.
@MatthewFluet MatthewFluet merged commit 7ab49d0 into MLton:master Nov 22, 2019
@MatthewFluet MatthewFluet deleted the c-and-llvm-codegen-updates branch November 22, 2019 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant