-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C and LLVM codegen updates #351
Merged
MatthewFluet
merged 104 commits into
MLton:master
from
MatthewFluet:c-and-llvm-codegen-updates
Nov 22, 2019
Merged
C and LLVM codegen updates #351
MatthewFluet
merged 104 commits into
MLton:master
from
MatthewFluet:c-and-llvm-codegen-updates
Nov 22, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Rather than using `goto doLeaveChunk;`.
On `Return`, use mustReturnToSelf || (mayReturnToSelf && (nextChunks[nextBlock] == selfChunk)) to guard `goto doSwitchNextBlock`; this guarantees that the `ChunkSwitch` will only be entered with a block found in the chunk.
Some chunk functions may not use `gcState`, `stackTop`, `frontier`, or `selfChunk`.
Using GCC's label address and computed goto features (https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html#Labels-as-Values), one can force the generation of a jump table for the chunk switch. Although GCC and clang will typically implement dense `switch` statements into a jump table, consider the following: switch (nextBlock) { case 5: goto L_5; case 6: goto L_6; ... case 29: goto L_29; default: __builtin_unreachable(); } GCC-7 (and earlier) and clang appear to implement this as: int t = nextBlock - 5 if (t > 24) goto L_29; else goto *jumpTable[i]; That is, it still performs a range comparison. With an explicit jump table, GCC and clang will implement this as: goto *jumpTable[nextBlock - 5]; (where the -5 can be incorporated into the address computation). Unfortunately, the performance impact seems negligible.
* Make `ChunkSwitch` exhaustive * Use direct call in `Return` when exactly one non-self target chunk
Highlights: - Introduce a `structure LLVM` with sub-structures for `Type`, `Value`, `Instr`, `MetaData`, and `ModuleContext`. While much of the LLVM codegen still uses strings, the modules enforce a more correct usage. - Unify `implementsPrim` and `primApp`. - Eliminate a number of instances of code duplication in the translation of primitives. - Eliminate awkward `Context` type, which was a form of manual closure conversion. - Favor direct output and using `AppendList.t` over constructing large strings. - More closely match the C codegen.
Include `Global` and `GCState` domains, distinct from `Heap`, `Stack`, and `Other`. For each domain, include optional extra information for further distinctions: * `GCState`: offset * `Global`: cty, index * `Heap`: kind, tycon, cty, offset * `Stack`: offset Not all options lead to sound alias analysis; limitations are noted in comments.
This allows marking an `_import`ed C function to have its prototype emitted with the `inline` keyword. This will be used to properly mark Basis Library functions (as opposed to primitives) that are provided as `inline` to be properly annotated as such in the emitted C declarations.
Previously, functions meant to be inlined because they correspond to primitives or Basis Library functions (e.g., `Real<N>` and `Real<N>.Math`) were marked `static inline` when included via `c-chunk.h`. If a C compiler chooses not to inline a function (such as clang at -O1), then each .o file included its own copy of the function (and the copy of the function provided by `libmlton.a` was not linked into the final executable). Now, functions are marked `inline` when included via `c-chunk.h` (and the corresponding `_import` is given the `inline` attribute). The C99/C11 semantics of `inline` requires the C compiler to *not* include a copy of the function in the .o file (if it chooses not to inline the function) and treat the function as an external reference. The copy of the function provided by `libmlton.a` is used to satisfy the external reference when linking.
…inline))` Although the functions are small and marked `inline`, clang at -O1 does not inline the functions (even with the `-finline-functions` option).
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). This flag will be used to enable explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the codegens.
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). These functions compute both the arithmetic result and a boolean indicating overflow (using `__builtin_<op>_overflow`). They will be used for explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C codegen for `-codegen-fuse-op-and-check true`.
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). (Note: Because the final md5 state is not used by the `md5` benchmark program, MLton actually optimizes out most of the md5 computation. What is left is a lot of arithmetic from `PackWord32Little.subVec` to check for indices that should raise `Subscript`.) For example, with `-codegen-fuse-op-and-check false` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r9d, %r10d subl $1, %r10d jo .L650 leal -1(%r8), %r10d movl %r10d, %r12d addl %r10d, %edx jo .L650 addl %r10d, %r11d cmpl %eax, %r11d jnb .L656 movl %ebp, %edx addl $1, %edx jo .L659 leal 1(%rcx), %edx movl %edx, %r11d imull %r9d, %r11d jo .L650 imull %r8d, %edx movl %edx, %r11d addl %r10d, %r11d jo .L650 leal (%rdx,%r10), %r11d cmpl %eax, %r11d jnb .L665 What seems to have happened is that gcc has arranged for equivalent values to be in `%r8` and `%r9`. In the first three lines, there is an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while in the fourth line, there is an implementation of `Word32_sub (X, 1)` using `lea` with an offset of `-1`. Notice that `%r10` is used for the result of both, so the fourth line is redundant (the value is already in `%r10`). On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r8d, %r9d subl $1, %r9d jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L651 movl %eax, %ecx addl $1, %ecx jo .L654 imull %r8d, %ecx jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L660
On some small programs (with all `Chunk<N>` fns in the same compilation unit), Clang could observe that all `Chunk<N>` fns return the value `-2` (arising from a C call with no return point). With this knowledge, it would replace a tail call from one `Chunk<N>` fn to a `Chunk<M>` fn with a non-tail call and an explicit `ret -2`. Breaking the tail call and not performing tail-call optimization leads to unbounded C stack growth and segmentation faults. LLVM could make the same optimization, but the LLVM codegen did not exhibit the same problem (perhaps it requires a specific LLVM optimization pass that is requested by Clang at `-O1`, but not included by default by opt at `-O2`). Obscuring the manifest result value by using a function call seems to prevent the problem (though, Clang could observe that all `Chunk<N>` fns return the value `MLton_unreachable()` and make the same transformation, but presumable propagating a function call is considered more expensive than propagating a constant). It could arise again with aggressive link-time optimization.
…>_<op>CheckP` primitives
When a Machine IR temp is used as a destination for `Word{S,U}<N>_<op>AndCheck`, by having its address taken, Clang sometimes fails to turn the `alloca` introduced for the C local variable into an SSA variable. Moreover, Clang introduces `@llvm.lifetime.{start,end}` intrinsic calls at chunk entry and exit; the call at the chunk exit (although they are no-ops) inhibit tail call optimization. Using a manifest temporary C local variable for the results of `Word{S,U}<N>_<op>AndCheck` and then copying them into Machine IR destination operands seems avoid the problem.
Fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens will now occur in either order.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Many updates to C and LLVM codegens. Highlights:
Machine.Program.rflow
to compute{returns,raises}To
control flow (654c557) and use infunctor Chunkify
(1b3b7b8) and in Machine IRRaise/Return
transfers (cf8e487).chunk-jump-table {false|true}
compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1).-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt
compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f)cc10
(aka,ghccc
) calling convention (2e26ebd).simple
chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same ascoalesce4096
, significantly improvesfib
andtak
(for GCC), slightly improveshamlet
, but slightly worsensraytrace
:simple
chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile bygcc
.expect: WordX.t option
to RSSA and MachineSwitch.T
(911b5d4, e2b27ab, 695320d) and add-gc-expect {none|false|true}
compile-time option, where-gc-expect false
should indicate that performing a GC is cold path (823815a); no notable performance impact.c-chunk.h
macros.Machine.Operand.Contents
constructor (006269b).Real<N>_qequal
for C codegen (9b7b2bd) and useis{less,lessequal}
forReal<N>_l{t,e}
for C codegen (7b55819).-llvm-aamd scope
for simplenoalias
/alias.scope
alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact.inline
for primitive and Basis Library functions (311331c, c864492, 4f2d213).-codegen-fuse-op-and-chk {false|true}
compile-time option to explicitly fuse adjacentWord<N>_<op>
andWord{S,U}<N>_<op>CheckP
primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacentWord<N>_<op>
andWord{S,U}<N>_<op>CheckP
primitives. The performance results reported at Add new overflow-checking primitives #273 and Remove old-style arithmetic primitives #292 suggest that this does not always have significant impact, but sometimes-codegen-fuse-op-and-chk true
can have a positive. Unfortunately, it can also have a (significant) negative impact. Inmatrix-multiply
andvector-rev
, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks.md5
mentioned in the commit messages are with respect to themd5
benchmark before 2daaebf.Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change: