Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llvm location aliasing information #324

Merged
merged 24 commits into from
Jul 18, 2019

Conversation

jasoncarr0
Copy link
Contributor

@jasoncarr0 jasoncarr0 commented Jun 22, 2019

Although the type information had very little effect, passing aliasing information about the stack and heap was sufficient to affect many programs.

The only information included is distinguishing the style of operand, only StackOffset/Offset/SequenceOffset are included. Including too much causes major slowdowns in compilation, even with relatively minor sets, as the !noalias information must be reported in sets of total size n^2. Given that the only ways to pass information to LLVM are this and type-alias information, it seems we would need to get an update into LLVM to support more without heavy slowdown, or to handle it as a performance bug.

It's a bit iffy on whether it's worth it. Both versions have considerable compile time costs (which feels like an LLVM bug, considering one of those has only four disjoint classes). There's a mix of slowdowns and speedups, with perhaps a slight bias towards speedups. I've attached two output files: the one with no indices, and the other with indices. Both are run on my machine, so there's some expected differences with e.g. I/O.

detailed-out.txt
simple-out.txt

Some code for object pointers is left in from attempting type-based-aliasing. RepType.deObjptrs does feel a gap, and I ran into the need for it in other code. Otherwise though, we can revert and squash to reduce churn.

@jasoncarr0 jasoncarr0 marked this pull request as ready for review June 22, 2019 00:04
Copy link
Member

@MatthewFluet MatthewFluet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the SimpleOper.t datatype.

val fromOper =
fn Operand.StackOffset _ => Stack
| Operand.Offset _ => Offset
| Operand.SequenceOffset _ => SequenceOffset
Copy link
Member

@MatthewFluet MatthewFluet Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't quite accurate. For Operand.Offset and Operand.SequenceOffset, we need to look at the type of the base field, because both operands can be used for non-ML data (e.g., Operand.Offset is used to access fields of GCState and Operand.SequenceOffset is used to implement the CPointer_{get,set}<ty> primitives).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filtered based on types and put non-objptrs into the Other category (which is disjoint, rather than unknown). That should be sound for our flows, as far as I can tell.


structure SimpleOper = struct

datatype t = Stack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that Offset and SequenceOffset would not alias, a simple Stack | Heap | Other distinction seems simpler.

Copy link
Contributor Author

@jasoncarr0 jasoncarr0 Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I've added the offsets back in, and the complexity seems to have little effect on time for TBAA-based aliasing, I'd argue that there's little benefit to that simplicity any more.

@jasoncarr0
Copy link
Contributor Author

jasoncarr0 commented Jun 25, 2019

Indeed using type-based analysis instead of the alias.scope infrastructure was more performant for llvm. With the full information, hamlet takes 32 seconds to compile, versus 37 with alias.scope, and 30 with no aliasing information (and about 32 as well for the simple information).

I didn't see your comments when pushing the code above.

@jasoncarr0
Copy link
Contributor Author

jasoncarr0 commented Jun 30, 2019

Here's some results from testing on oxygen with the current version (Stack/Offset/SequenceOffset with offsets via TBAA).
Flat-array is invalid as the benchmark gets eliminated entirely by LLVM (run time is 0.00s).

Mostly positive, some slowdowns are consistent and others (lexgen) appear to be noise.

out.txt

@jasoncarr0
Copy link
Contributor Author

jasoncarr0 commented Jul 3, 2019

I forgot to upload this:
Merging all stack values together does still affect some code, but the overall impact is absolutely negligible no matter which way. With this accuracy I can't distinguish any time differences from noise.

Third column is with stack indices not distinguished.
all-three.txt

The current commit has the different stack indices

@jasoncarr0
Copy link
Contributor Author

-llvm-include-aliasing-info {true|false} might be better (to put all of the LLVM codegen compile-time options together).

Good point, applied the change.

@jasoncarr0
Copy link
Contributor Author

I started a preliminary version, which is a bit unsound (having 1 objptr disjoint from unknown, rather than nested, and missing the vector/array fixes). The overall benefit seems negligible from quick testing, and the compile time seems to actually worsen on hamlet (or other larger programs), so I think we don't get much more benefit than the Stack/Heap info we had.

For master vs aliasing with types:
compile time
benchmark MLton0 MLton1
hamlet 36.10 43.20
(Just location aliasing brought it up to 39 seconds)

Hamlet.MLton0.batch.0.ll has 938 lines of metadata for tbaa with these changes. Similar or more for other files in the batch.

@MatthewFluet
Copy link
Member

Benchmark results (sulfur; g357a28440)

Specs:

  • 2 x Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (8 physical cores; 16 logical cores)
  • Ubuntu 16.04.6 LTS
  • llvm: LLVM version 8.0.0
command                                                                                                                                          
C00    /home/mtf/devel/mlton/builds/20190708.201801-g357a28440/bin/mlton -codegen llvm -disable-pass bounceVars -llvm-include-aliasing-info false
C01    /home/mtf/devel/mlton/builds/20190708.201801-g357a28440/bin/mlton -codegen llvm -enable-pass bounceVars -llvm-include-aliasing-info false 
C02    /home/mtf/devel/mlton/builds/20190708.201801-g357a28440/bin/mlton -codegen llvm -disable-pass bounceVars -llvm-include-aliasing-info true 
C03    /home/mtf/devel/mlton/builds/20190708.201801-g357a28440/bin/mlton -codegen llvm -enable-pass bounceVars -llvm-include-aliasing-info true  

Run-Time Ratio

This table shows the effect of -llvm-include-aliasing-info, without and with bounceVars:

program           `C02/C00` `C03/C01`
barnes-hut           0.9920    0.9554
boyer                1.007     0.9484
checksum             0.9913    1.042 
count-graphs         0.9546    0.9489
DLXSimulator         0.9961    1.037 
even-odd             1.002     1.035 
fft                  1.056     0.9688
fib                  1.029     1.031 
flat-array           0.9684    0.9831
hamlet               0.9905    1.009 
imp-for              0.9996    0.8881
knuth-bendix         0.9853    0.9843
lexgen               0.9495    0.9481
life                 1.008     1.005 
logic                0.9880    1.002 
mandelbrot           1.009     0.9757
matrix-multiply      1.015     0.9985
md5                  1.135     0.9665
merge                1.042     1.054 
mlyacc               0.9979    0.9622
model-elimination    1.011     0.9846
mpuz                 0.9845    0.9215
nucleic              1.022     1.001 
output1              0.9302    1.009 
peek                 1.008     0.9801
pidigits             1.060     1.024 
psdes-random         0.9985    0.9991
ratio-regions        0.9589    1.029 
ray                  0.9877    0.9970
raytrace             1.024     1.016 
simple               1.008     0.9872
smith-normal-form    0.9877    0.9821
string-concat        0.9689    1.048 
tailfib              0.9536    0.9735
tailmerge            0.9701    1.018 
tak                  1.020     1.003 
tensor               0.9844    1.013 
tsp                  0.9830    1.023 
tyan                 1.019     0.9881
vector32-concat      0.9977    1.005 
vector64-concat      0.9822    1.023 
vector-rev           0.9076    1.006 
vliw                 1.018     1.053 
wc-input1            0.9932    0.9762
wc-scanStream        0.9834    1.043 
zebra                0.9942    0.9954
zern                 0.9954    0.9820
MIN                  0.9076    0.8881
GMEAN                0.9966    0.9956
MAX                  1.135     1.054 

There does seem to be some overlap in the effects of -llvm-include-aliasing-info and bounceVars; for example the C02/C00 speedups of vector-rev output1 are not present in C03/C01.

Unfortunately, -llvm-include-aliasing-info does not seem to add much over bounceVars; the largest C03/C01 speedups of imp-for and mpuz simply recover some performance that was lost by bounceVars.

program             C00    C01    C02    C03
barnes-hut            1 1.005  0.9920 0.9604
boyer                 1 0.9773 1.007  0.9269
checksum              1 1.001  0.9913 1.043 
count-graphs          1 0.9818 0.9546 0.9317
DLXSimulator          1 0.7390 0.9961 0.7663
even-odd              1 0.9785 1.002  1.013 
fft                   1 1.021  1.056  0.9890
fib                   1 0.9947 1.029  1.025 
flat-array            1 0.9735 0.9684 0.9570
hamlet                1 1.014  0.9905 1.023 
imp-for               1 1.127  0.9996 1.001 
knuth-bendix          1 0.9941 0.9853 0.9785
lexgen                1 1.003  0.9495 0.9510
life                  1 0.9925 1.008  0.9971
logic                 1 0.9911 0.9880 0.9932
mandelbrot            1 1.026  1.009  1.001 
matrix-multiply       1 0.9970 1.015  0.9955
md5                   1 1.062  1.135  1.026 
merge                 1 1.098  1.042  1.157 
mlyacc                1 1.047  0.9979 1.007 
model-elimination     1 0.9851 1.011  0.9700
mpuz                  1 1.042  0.9845 0.9604
nucleic               1 0.9860 1.022  0.9871
output1               1 0.8691 0.9302 0.8767
peek                  1 1.016  1.008  0.9962
pidigits              1 1.009  1.060  1.034 
psdes-random          1 0.9993 0.9985 0.9984
ratio-regions         1 0.9784 0.9589 1.007 
ray                   1 1.013  0.9877 1.010 
raytrace              1 0.9992 1.024  1.015 
simple                1 1.028  1.008  1.015 
smith-normal-form     1 1.013  0.9877 0.9944
string-concat         1 0.9783 0.9689 1.025 
tailfib               1 0.9703 0.9536 0.9446
tailmerge             1 0.9707 0.9701 0.9877
tak                   1 1.034  1.020  1.037 
tensor                1 0.9878 0.9844 1.000 
tsp                   1 1.005  0.9830 1.028 
tyan                  1 1.020  1.019  1.008 
vector32-concat       1 0.9925 0.9977 0.9974
vector64-concat       1 0.9395 0.9822 0.9615
vector-rev            1 0.9126 0.9076 0.9181
vliw                  1 0.8854 1.018  0.9324
wc-input1             1 1.211  0.9932 1.182 
wc-scanStream         1 1.157  0.9834 1.206 
zebra                 1 0.9935 0.9942 0.9889
zern                  1 1.021  0.9954 1.002 
MIN                   1 0.7390 0.9076 0.7663
GMEAN                 1 0.9984 0.9966 0.9940
MAX                   1 1.211  1.135  1.206 

Compile-Time Ratio

program             C00    C01    C02    C03
barnes-hut            1 0.9864 1.024  1.011 
boyer                 1 1.008  1.020  0.9875
checksum              1 1.020  0.9836 0.9768
count-graphs          1 0.9980 0.9869 1.019 
DLXSimulator          1 1.036  1.035  1.025 
even-odd              1 1.025  1.010  0.9717
fft                   1 0.9910 1.036  1.035 
fib                   1 0.9903 0.9900 1.034 
flat-array            1 1.027  1.028  0.9966
hamlet                1 1.072  1.143  1.114 
imp-for               1 1.010  1.028  1.031 
knuth-bendix          1 1.011  1.028  1.014 
lexgen                1 1.056  1.105  1.113 
life                  1 1.011  0.9711 0.9613
logic                 1 0.9897 1.158  1.115 
mandelbrot            1 1.057  1.019  1.094 
matrix-multiply       1 1.036  0.9907 1.023 
md5                   1 1.015  0.9998 0.9989
merge                 1 0.9488 0.9801 0.9855
mlyacc                1 1.236  1.150  1.345 
model-elimination     1 1.037  1.028  1.026 
mpuz                  1 1.055  1.092  1.036 
nucleic               1 0.9935 0.9963 1.034 
output1               1 1.004  0.9981 1.029 
peek                  1 1.033  0.9765 1.008 
pidigits              1 1.026  1.053  1.031 
psdes-random          1 1.113  1.041  1.179 
ratio-regions         1 1.078  1.086  1.103 
ray                   1 1.001  1.045  1.057 
raytrace              1 0.9337 0.9635 0.9848
simple                1 1.014  1.089  1.147 
smith-normal-form     1 1.031  1.006  0.9485
string-concat         1 0.9748 0.9882 1.015 
tailfib               1 1.023  1.020  1.034 
tailmerge             1 0.9973 0.9243 1.009 
tak                   1 1.034  1.031  1.017 
tensor                1 1.024  1.085  1.069 
tsp                   1 1.024  0.9996 1.027 
tyan                  1 1.035  1.021  0.9816
vector32-concat       1 0.9965 0.9901 1.011 
vector64-concat       1 0.9915 0.9788 0.9758
vector-rev            1 0.9532 0.9980 0.9686
vliw                  1 1.082  1.098  1.185 
wc-input1             1 1.063  1.035  1.057 
wc-scanStream         1 0.9963 1.047  1.025 
zebra                 1 1.019  1.033  1.103 
zern                  1 1.024  1.011  1.055 
MIN                   1 0.9337 0.9243 0.9485
GMEAN                 1 1.022  1.027  1.040 
MAX                   1 1.236  1.158  1.345 

Executable-Size Ratios

program             C00    C01    C02    C03
barnes-hut            1 1.003  0.9954 1.001 
boyer                 1 0.9976 0.9945 0.9964
checksum              1 1.000  0.9948 0.9965
count-graphs          1 1.000  0.9922 0.9956
DLXSimulator          1 1.011  0.9981 1.008 
even-odd              1 0.9999 0.9946 0.9964
fft                   1 1.003  0.9976 1.002 
fib                   1 1.000  0.9945 0.9963
flat-array            1 0.9994 0.9948 0.9965
hamlet                1 0.9974 0.9928 0.9903
imp-for               1 0.9989 0.9947 0.9960
knuth-bendix          1 1.002  0.9986 0.9994
lexgen                1 1.028  0.9960 1.024 
life                  1 1.000  0.9989 1.001 
logic                 1 0.9997 0.9915 0.9937
mandelbrot            1 0.9996 0.9953 0.9966
matrix-multiply       1 0.9996 0.9955 0.9964
md5                   1 1.008  1.003  1.006 
merge                 1 0.9996 0.9951 0.9963
mlyacc                1 1.181  0.9977 1.184 
model-elimination     1 1.000  0.9930 0.9952
mpuz                  1 1.001  0.9946 0.9981
nucleic               1 0.9989 0.9950 0.9968
output1               1 1.007  0.9987 1.002 
peek                  1 1.000  0.9950 0.9970
pidigits              1 1.005  1.000  1.001 
psdes-random          1 0.9987 0.9968 0.9974
ratio-regions         1 1.007  0.9897 0.9976
ray                   1 1.021  0.9961 1.017 
raytrace              1 1.014  0.9927 1.000 
simple                1 1.014  0.9930 1.006 
smith-normal-form     1 1.008  1.002  1.007 
string-concat         1 0.9985 0.9971 0.9975
tailfib               1 0.9995 0.9950 0.9966
tailmerge             1 1.000  0.9945 0.9971
tak                   1 1.000  0.9942 0.9967
tensor                1 1.023  1.000  1.020 
tsp                   1 1.007  0.9998 1.006 
tyan                  1 1.034  1.001  1.032 
vector32-concat       1 0.9984 0.9939 0.9955
vector64-concat       1 0.9986 0.9939 0.9956
vector-rev            1 0.9988 0.9946 0.9960
vliw                  1 1.032  0.9927 1.027 
wc-input1             1 1.011  1.001  1.007 
wc-scanStream         1 1.006  1.000  1.002 
zebra                 1 1.007  0.9961 1.0000
zern                  1 1.003  0.9991 1.002 
MIN                   1 0.9974 0.9897 0.9903
GMEAN                 1 1.009  0.9961 1.005 
MAX                   1 1.181  1.003  1.184 

@MatthewFluet
Copy link
Member

I've updated Machine.Statement.object and have a few other minor edits at https://github.com/MatthewFluet/mlton/tree/llvm-location-aliasing. Could you enable Allow edits from maintainers so they can be pushed here?

@MatthewFluet
Copy link
Member

If I'm understanding SimpleOper.fromOper and rawOperScopes correctly, then it doesn't seem that an Object (NONE, i) is recorded as a parent node of Object (SOME ty, i).

@jasoncarr0
Copy link
Contributor Author

jasoncarr0 commented Jul 17, 2019

I've updated Machine.Statement.object and have a few other minor edits at https://github.com/MatthewFluet/mlton/tree/llvm-location-aliasing. Could you enable Allow edits from maintainers so they can be pushed here?

Sorry about that, I'll allow that here. but...

If I'm understanding SimpleOper.fromOper and rawOperScopes correctly, then it doesn't seem that an Object (NONE, i) is recorded as a parent node of Object (SOME ty, i).

This was the note I made in the comment above. 1af34cb is not fully sound. But it also has no real benefits and a lot of costs, so if it's expected to be more performant and faster compiled with unsoundness, then it would be even worse when fixed, so I wanted to drop that. I only left 1af34cb here for visibility. 95e8e37 is sound, and much better for compile time, with no real runtime differences. I would recommend rebasing back and force pushing.

@MatthewFluet
Copy link
Member

O.k. So, we should revert 1af34cb before merging?

@MatthewFluet
Copy link
Member

Oops, I missed your last sentence: rebasing back and force pushing. I'll do that.

Change the object allocation sequence from

    CW(Frontier) = header;
    dst = Frontier + NORMAL_METADATA_SIZE;
    Frontier += size;

to

   dst = Frontier + NORMAL_METADATA_SIZE;
   OW(dst, ~GC_HEADER_SIZE) = header;
   Frontier += size;

This ensures that the write to the heap is through an `Offset` operand
with an `Objptr` base, so that the LLVM tbaa will treat the store as
one into the heap.
@MatthewFluet
Copy link
Member

Some notes for future improvements to the alias-analysis metadada:

  • The {X86,AMD64}.MemLoc.Class.t is used by the native codegens for
    (very rudimentary) may-alias analysis, where distinct classes are
    assumed to be disjoint. It uses Heap, Stack, Locals (for
    Operand.Register), Globals, and GCState.

  • According to the LLVM language reference, TBAA metadata should be an
    access tag, which references a type descriptor, but the TBAA
    metadata emitted by the LLVM codegen is just the type descriptor.
    That is, rather than

    %x = load i64, i64* %p, !tbaa !1
    
    !0 = !{!"operRoot"}
    !1 = !{!"Stack 8", !0, i64 0}
    

    it should be

    %x = load i64, i64* %p, !tbaa !2
    
    !0 = !{!"operRoot"}
    !1 = !{!"Stack 8", !0}
    !2 = !{!1, !1, i64 0}
    

    However, LLVM opt seems to recognize the former and translate it
    to the later.

  • The cost/benefit of including indices in the alias domains has not
    been fully investigated/justified. For example, rather than

    datatype t = Stack of int
               | Offset of int
               | SequenceOffset
               | Other
    

    the simpler

    datatype t = Heap | Stack | Other
    

    or, the more sophisticated

    datatype t = Object of ObjptrTycon.t option * int
               | Stack of int
               | Other
    

    There is one known complication with using ObjptrTycon.t for
    greater precision: a sequence may be accessed at two distinct
    ObjptrTycon.ts, once as an array and then as a vector, due to
    Array_toVector. However, such ObjptrTycon.ts would only differ
    in the hasIdentity component of their OBJECT_TYPE.Sequence, so
    it might suffice to map the two such corresponding ObjptrTycon.ts
    to a canonical representative for the purposes of assigning an
    aliasing domain.

  • There is a slight unsoundness with exception raising. With
    Raise values via stack #321, we now raise exception values via the ML stack.
    The key interaction with respect to aliasing information is at:
    https://github.com/MLton/mlton/pull/321/files#diff-bb88a6ee07c8d914cad73fe2a7bdee26R850
    Essentially, we compute the new stack top in a temporary, then write
    via that temporary with an Operand.Offset, then (in the codegen
    implementation of Machine.Transfer.Raise), set the new stack top.
    So, this is a situation where we use Operand.Offset of a CPointer,
    but are actually writing to the ML stack. One possible "fix" might be
    to change

    | Stack of StackOffset.t
    

    to

    | Stack of {base: t option, offset: StackOffset.t}
    

    or even

    | Stack of {base: t, offset: StackOffset.t}
    

    Most of the time, base would be NONE or StackTop, but for the
    exception raising, we would use the temporary operand. In either
    case, we would know that the aliasing scope of a Stack operand was
    Stack.

@MatthewFluet MatthewFluet merged commit a1faf76 into MLton:master Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants