Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove old-style arithmetic primitives #292

Merged
merged 19 commits into from
Mar 21, 2019

Conversation

daemanos
Copy link
Contributor

In #273, MLton was extended to support overflow-checking primitives for arithmetic operations. This allowed checked arithmetic operations to be encoded as normal if-statements that raise an Overflow exception when appropriate, obviating the need to have a special PrimOverflow exception and associated supporting infrastructure in the various IR datatypes. However, until now the old Arith transfer-style primitives have remained in place. This pull request removes the old primitives entirely, along with the special-case code required to support them, simplifying a number of datatypes and optimizations which now no longer need to keep track of arithmetic overflows and can instead rely on the normal exception infrastructure.

@MatthewFluet
Copy link
Member

Performance

MLton0 -- /home/mtf/devel/mlton/builds/20190217.152511-g9ba427a/bin/mlton -codegen amd64
MLton1 -- /home/mtf/devel/mlton/builds/20190319.152439-g9d251db/bin/mlton -codegen amd64
MLton2 -- /home/mtf/devel/mlton/builds/20190217.152511-g9ba427a/bin/mlton -codegen c
MLton3 -- /home/mtf/devel/mlton/builds/20190319.152439-g9d251db/bin/mlton -codegen c
MLton4 -- /home/mtf/devel/mlton/builds/20190217.152511-g9ba427a/bin/mlton -codegen llvm
MLton5 -- /home/mtf/devel/mlton/builds/20190319.152439-g9d251db/bin/mlton -codegen llvm
run time ratio
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5
barnes-hut          1.00   0.98   1.00   1.01   0.98   0.97
boyer               1.00   1.02   0.99   0.99   1.00   1.00
checksum            1.00   1.00   1.12   0.95   0.76   0.70
count-graphs        1.00   1.00   0.91   0.81   0.96   0.88
DLXSimulator        1.00   1.00   1.01   1.01   0.96   0.97
even-odd            1.00   1.00   1.11   1.20   1.00   1.00
fft                 1.00   0.98   0.90   0.90   0.84   0.84
fib                 1.00   1.00   1.19   1.18   1.08   1.22
flat-array          1.00   1.00   2.29   2.29   0.00   0.00
hamlet              1.00   1.01   2.17   2.08   1.99   2.21
imp-for             1.00   1.00   1.35   0.97   0.45   0.45
knuth-bendix        1.00   1.00   1.06   1.52   1.18   1.45
lexgen              1.00   0.99   0.94   1.01   0.90   0.96
life                1.00   1.00   1.05   1.04   1.09   1.04
logic               1.00   1.03   1.11   1.11   1.09   1.10
mandelbrot          1.00   1.00   0.38   0.42   0.29   0.29
matrix-multiply     1.00   1.00   0.75   0.57   0.48   0.48
md5                 1.00   1.00   1.31   1.28   1.03   1.03
merge               1.00   0.99   1.00   1.00   0.99   0.99
mlyacc              1.00   0.95   1.11   1.08   1.13   1.06
model-elimination   1.00   1.02   1.54   1.59   1.43   1.47
mpuz                1.00   1.00   0.85   0.83   0.48   0.50
nucleic             1.00   0.99   0.86   0.85   0.87   0.87
output1             1.00   1.00   1.19   1.12   1.05   1.09
peek                1.00   1.00   1.05   1.07   0.17   0.16
psdes-random        1.00   1.00   0.73   0.75   0.80   0.80
ratio-regions       1.00   1.00   1.01   1.01   0.96   0.96
ray                 1.00   0.96   0.90   0.91   0.95   1.01
raytrace            1.00   0.99   0.99   0.97   0.97   0.99
simple              1.00   1.02   1.15   1.17   1.29   1.16
smith-normal-form   1.00   0.99   0.99   0.99   0.98   0.99
string-concat       1.00   1.00   1.01   1.02   0.30   0.27
tailfib             1.00   1.01   0.64   0.54   0.42   0.37
tak                 1.00   1.00   1.20   1.11   1.12   1.11
tensor              1.00   0.99   1.07   0.60   0.31   0.31
tsp                 1.00   1.00   0.70   0.70   0.69   0.68
tyan                1.00   1.00   1.10   1.12   1.06   1.05
vector32-concat     1.00   1.00   1.03   1.03   0.28   0.28
vector64-concat     1.00   1.00   1.03   1.03   0.38   0.37
vector-rev          1.00   1.00   1.02   1.01   0.66   0.67
vliw                1.00   0.97   1.17   1.14   0.98   1.04
wc-input1           1.00   1.00   1.45   0.99   0.99   0.98
wc-scanStream       1.00   1.00   1.90   1.17   1.04   1.02
zebra               1.00   1.00   0.98   0.98   1.03   1.02
zern                1.00   1.00   0.97   0.98   0.79   0.80
size
benchmark            MLton0    MLton1     MLton2    MLton3    MLton4    MLton5
barnes-hut          176,143   176,159    174,200   172,888   166,791   166,791
boyer               243,313   243,313    236,417   236,593   219,249   219,233
checksum            117,505   117,617    123,841   124,113   116,513   116,593
count-graphs        145,009   142,609    148,161   146,753   138,129   137,329
DLXSimulator        209,020   209,020    210,676   210,588   199,284   200,060
even-odd            117,473   117,473    123,905   124,065   116,625   116,625
fft                 142,251   142,251    147,030   146,806   131,990   132,134
fib                 117,393   117,393    123,777   123,937   116,561   116,577
flat-array          117,121   117,121    123,553   123,617   116,065   116,065
hamlet            1,434,172 1,368,524  1,467,332 1,409,796 1,534,804 1,501,716
imp-for             117,185   117,185    123,377   123,521   116,225   116,225
knuth-bendix        186,060   186,060    189,044   189,124   179,188   179,380
lexgen              290,875   290,891    305,459   305,491   292,483   293,683
life                141,057   141,057    144,961   145,249   135,953   135,953
logic               197,361   197,361    197,481   197,721   179,705   179,641
mandelbrot          117,217   117,233    127,185   127,329   116,225   116,225
matrix-multiply     119,521   119,521    128,785   128,833   117,249   117,249
md5                 144,620   144,620    148,124   148,460   139,404   139,452
merge               118,897   118,897    124,977   125,121   117,681   117,681
mlyacc              643,499   640,811    653,563   648,379   643,259   639,499
model-elimination   795,998   793,646    834,998   819,422   894,246   894,206
mpuz                123,489   123,489    129,441   129,489   121,553   121,649
nucleic             297,193   297,193    272,554   273,114   268,826   268,826
output1             151,712   151,712    155,312   155,440   146,528   147,472
peek                150,108   150,108    153,740   153,836   145,532   145,548
psdes-random        121,489   121,489    127,905   127,825   119,665   119,649
ratio-regions       144,081   144,081    150,953   150,937   141,353   141,401
ray                 250,002   250,578    250,395   250,843   236,562   236,050
raytrace            368,932   368,516    352,186   351,290   323,636   324,020
simple              345,149   347,261    366,423   369,391   352,702   354,726
smith-normal-form   279,781   279,781    257,005   257,573   246,317   246,821
string-concat       119,073   119,073    125,521   125,633   118,033   118,129
tailfib             117,217   117,217    123,617   123,729   116,209   116,209
tak                 117,393   117,393    123,809   124,001   116,625   116,625
tensor              179,236   178,452    178,236   179,148   166,780   166,860
tsp                 158,804   158,132    161,435   161,355   145,516   145,292
tyan                223,532   223,516    224,684   225,964   212,908   215,036
vector32-concat     118,241   118,241    124,657   124,753   117,377   117,361
vector64-concat     118,273   118,273    124,721   124,833   117,377   117,361
vector-rev          118,049   118,049    124,561   124,673   116,929   116,929
vliw                505,453   506,013    545,189   537,021   553,829   542,685
wc-input1           178,995   178,963    182,347   184,075   172,251   173,931
wc-scanStream       188,099   188,067    192,203   189,995   183,243   183,419
zebra               225,308   225,308    224,812   226,740   211,756   214,532
zern                153,185   153,217    154,183   154,231   142,695   142,695
compile time
benchmark         MLton0 MLton1 MLton2 MLton3MLton4 MLton5
barnes-hut          2.92   2.92   3.51   3.33  4.00   4.11
boyer               3.30   3.39   5.37   5.42  6.55   6.49
checksum            2.47   2.46   2.66   2.59  2.70   2.72
count-graphs        2.63   2.48   3.06   3.00  3.47   3.48
DLXSimulator        3.15   3.19   4.12   4.13  5.20   5.32
even-odd            2.47   2.45   2.60   2.64  2.72   2.73
fft                 2.55   2.60   2.84   2.85  3.07   3.06
fib                 2.46   2.48   2.63   2.63  2.60   2.69
flat-array          2.47   2.46   2.62   2.64  2.46   2.59
hamlet             14.06  14.00  24.68  23.27 45.79  44.97
imp-for             2.47   2.49   2.60   2.61  2.67   2.65
knuth-bendix        2.89   2.88   3.50   3.65  5.00   4.80
lexgen              3.67   3.53   4.86   5.04  7.34   7.09
life                2.58   2.60   2.90   2.97  3.36   3.36
logic               3.00   3.09   3.73   3.69  5.49   5.32
mandelbrot          2.48   2.49   2.63   2.61  2.52   2.72
matrix-multiply     2.48   2.51   2.65   2.67  2.76   2.76
md5                 2.64   2.66   3.00   3.04  3.32   3.10
merge               2.48   2.48   2.64   2.65  2.62   2.77
mlyacc              7.66   7.78  10.65  10.46 17.45  17.84
model-elimination   7.77   7.78  12.68  12.71 22.98  23.23
mpuz                2.51   2.50   2.75   2.73  2.96   2.80
nucleic             4.14   4.14   6.18   6.20  7.22   6.95
output1             2.65   2.66   2.88   3.09  3.42   3.49
peek                2.45   2.66   3.00   3.04  3.44   3.34
psdes-random        2.50   2.51   2.70   2.66  2.63   2.84
ratio-regions       2.76   2.78   3.30   3.33  3.67   3.68
ray                 3.44   3.54   4.42   4.45  5.58   6.00
raytrace            4.49   4.61   6.40   6.49  9.14   9.64
simple              4.03   3.84   5.11   5.31  7.96   8.48
smith-normal-form   3.82   3.79   6.90   7.01  9.13   9.50
string-concat       2.47   2.49   2.61   2.47  2.84   2.76
tailfib             2.25   2.49   2.58   2.67  2.67   2.70
tak                 2.44   2.45   2.57   2.63  2.63   2.72
tensor              3.05   2.99   3.47   3.77  4.39   4.40
tsp                 2.72   2.72   3.46   3.19  3.55   3.56
tyan                3.28   3.24   4.38   4.33  6.05   5.83
vector32-concat     2.46   2.52   2.62   2.63  2.72   2.72
vector64-concat     2.46   2.50   2.60   2.64  2.70   2.71
vector-rev          2.46   2.46   2.71   2.60  2.74   2.64
vliw                5.93   5.99   8.57   8.63 13.29  14.32
wc-input1           2.92   2.87   3.51   3.59  4.27   4.29
wc-scanStream       2.86   2.93   3.48   3.68  4.35   4.11
zebra               3.31   3.30   4.07   4.32  5.34   5.61
zern                2.62   2.65   2.84   2.73  3.20   3.30
run time
benchmark         MLton0 MLton1 MLton2 MLton3 MLton4 MLton5
barnes-hut         28.58  27.94  28.59  28.74  28.02  27.58
boyer              57.50  58.68  56.85  56.99  57.31  57.31
checksum           25.35  25.34  28.31  24.18  19.28  17.69
count-graphs       39.55  39.70  36.00  31.97  38.10  34.79
DLXSimulator       32.32  32.48  32.50  32.50  31.13  31.33
even-odd           39.08  39.09  43.50  46.85  39.02  39.01
fft                31.41  30.90  28.37  28.22  26.52  26.28
fib                17.99  17.97  21.49  21.24  19.40  21.92
flat-array         23.58  23.56  53.99  54.10   0.00   0.00
hamlet             39.62  39.99  86.12  82.22  79.01  87.44
imp-for            24.46  24.49  32.91  23.66  10.95  11.01
knuth-bendix       34.00  34.15  36.05  51.55  40.03  49.42
lexgen             33.56  33.26  31.44  34.04  30.31  32.23
life               38.83  38.92  40.64  40.37  42.47  40.36
logic              34.66  35.79  38.63  38.52  37.73  38.21
mandelbrot         35.80  35.81  13.45  14.93  10.40  10.45
matrix-multiply    29.72  29.69  22.15  16.98  14.16  14.22
md5                28.11  28.06  36.79  36.04  28.93  29.08
merge              32.36  32.17  32.27  32.39  31.96  32.14
mlyacc             32.44  30.98  35.93  35.08  36.68  34.43
model-elimination  38.01  38.71  58.68  60.36  54.40  55.79
mpuz               29.94  29.90  25.50  24.99  14.30  14.89
nucleic            33.74  33.54  28.86  28.79  29.28  29.33
output1            29.97  30.03  35.76  33.59  31.62  32.58
peek               34.24  34.16  35.93  36.58   5.74   5.64
psdes-random       34.00  33.90  24.85  25.37  27.36  27.31
ratio-regions      48.24  48.24  48.60  48.92  46.30  46.25
ray                39.55  37.91  35.55  35.90  37.47  39.88
raytrace           36.60  36.38  36.19  35.47  35.48  36.38
simple             29.33  29.88  33.70  34.45  37.98  34.01
smith-normal-form  39.47  39.23  39.19  39.24  38.79  38.92
string-concat      91.32  91.34  92.27  93.13  27.62  24.68
tailfib            38.04  38.34  24.32  20.60  16.00  13.90
tak                30.83  30.83  36.90  34.35  34.61  34.36
tensor             39.63  39.24  42.37  23.90  12.32  12.29
tsp                37.61  37.75  26.19  26.31  25.88  25.68
tyan               30.51  30.47  33.59  34.32  32.20  31.96
vector32-concat    82.46  82.37  84.75  84.74  23.36  23.23
vector64-concat    91.46  91.88  93.85  93.77  34.95  33.90
vector-rev         26.53  26.54  27.19  26.76  17.64  17.68
vliw               28.16  27.22  32.98  32.20  27.67  29.22
wc-input1          43.86  43.85  63.82  43.25  43.22  43.20
wc-scanStream      21.84  21.85  41.54  25.59  22.65  22.27
zebra              30.29  30.24  29.79  29.68  31.18  30.93
zern               31.87  31.96  31.02  31.18  25.23  25.36

@MatthewFluet MatthewFluet merged commit e7b6276 into MLton:master Mar 21, 2019
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

This flag will be used to enable explicit fusing of adjacent
`Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the
codegens.
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

These functions compute both the arithmetic result and a boolean
indicating overflow (using `__builtin_<op>_overflow`).  They will be
used for explicit fusing of adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives in the C codegen for
`-codegen-fuse-op-and-check true`.
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

(Note: Because the final md5 state is not used by the `md5` benchmark
program, MLton actually optimizes out most of the md5 computation.
What is left is a lot of arithmetic from `PackWord32Little.subVec` to
check for indices that should raise `Subscript`.)

For example, with `-codegen-fuse-op-and-check false` and gcc-9, the
`transform` function of `md5` has the following assembly:

	movl	%r9d, %r10d
	subl	$1, %r10d
	jo	.L650
	leal	-1(%r8), %r10d
	movl	%r10d, %r12d
	addl	%r10d, %edx
	jo	.L650
	addl	%r10d, %r11d
	cmpl	%eax, %r11d
	jnb	.L656
	movl	%ebp, %edx
	addl	$1, %edx
	jo	.L659
	leal	1(%rcx), %edx
	movl	%edx, %r11d
	imull	%r9d, %r11d
	jo	.L650
	imull	%r8d, %edx
	movl	%edx, %r11d
	addl	%r10d, %r11d
	jo	.L650
	leal	(%rdx,%r10), %r11d
	cmpl	%eax, %r11d
	jnb	.L665

What seems to have happened is that gcc has arranged for equivalent
values to be in `%r8` and `%r9`.  In the first three lines, there is
an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while
in the fourth line, there is an implementation of `Word32_sub (X, 1)`
using `lea` with an offset of `-1`.  Notice that `%r10` is used for
the result of both, so the fourth line is redundant (the value is
already in `%r10`).

On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9,
the `transform` function of `md5` has the following assembly:

	movl	%r8d, %r9d
	subl	$1, %r9d
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L651
	movl	%eax, %ecx
	addl	$1, %ecx
	jo	.L654
	imull	%r8d, %ecx
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L660
MatthewFluet added a commit that referenced this pull request Nov 22, 2019
Updates to C and LLVM codegens. Highlights:

* Add `Machine.Program.rflow` to compute `{returns,raises}To` control
  flow (654c557) and use in `functor Chunkify` (1b3b7b8) and in
  Machine IR `Raise/Return` transfers (cf8e487).
* Add `chunk-jump-table {false|true}` compile-time option to force
  generation of a jump table for the chunk switch (8e0dd2d,
  5b6439b, 087a5b1).
* Add `-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt`
  compile-time options to optimize return/raise transfers (7c10c70,
  4d5abde, 4b7c649, c3b9905, 473808f)
* Experiment using LLVM's `cc10` (aka, `ghccc`) calling convention
  (2e26ebd).
* Experiment with a new `simple` chunkify strategy (3330cbe,
  3d9c499, 138512f, faef164, d1df0de); generally performs
  about the same as `coalesce4096`, significantly improves `fib` and
  `tak` (for GCC), slightly improves `hamlet`, but slightly worsens
  `raytrace`:

  config command                                                                                                                          
  C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
  C05    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple                                          
  C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
  C10    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple                                       

  task_clock ratio_means.fieller@0.95 (2-level)
  program           `C05/C04` `C10/C09`
  barnes-hut           0.9978    0.9589
  boyer                1.064     1.076 
  checksum             1.051     0.9775
  count-graphs         1.005     0.9876
  DLXSimulator         1.000     0.9905
  even-odd             1.037     0.9989
  fft                  0.9616    0.9537
  fib                  0.6689    0.6260
  flat-array           1.000     0.9645
  hamlet               0.9547    0.9322
  imp-for              1.067     1.014 
  knuth-bendix         1.092     1.031 
  lexgen               1.031     1.078 
  life                 1.002     0.9911
  logic                1.016     1.015 
  mandelbrot           0.9776    1.030 
  matrix-multiply      0.9903    0.9844
  md5                  1.008     0.9940
  merge                0.9927    1.062 
  mlyacc               0.9810    1.024 
  model-elimination    0.9877    0.9743
  mpuz                 1.011     1.010 
  nucleic              1.036     1.030 
  output1              0.9943    1.021 
  peek                 1.036     1.027 
  pidigits             1.000     0.9653
  psdes-random         1.009     1.014 
  ratio-regions        0.9985    0.9881
  ray                  0.9738    0.9601
  raytrace             1.101     1.100 
  simple               0.9620    0.9272
  smith-normal-form    0.9690    0.9806
  string-concat        0.9610    0.9772
  tailfib              1.006     0.9292
  tailmerge            0.9847    1.023 
  tak                  0.8264    1.013 
  tensor               1.010     0.9998
  tsp                  0.9981    1.010 
  tyan                 1.045     1.027 
  vector-rev           1.012     0.9891
  vector32-concat      0.9495    1.030 
  vector64-concat      1.098     0.9744
  vliw                 0.9413    1.019 
  wc-input1            0.9301    1.098 
  wc-scanStream        1.114     0.9234
  zebra                1.008     1.001 
  zern                 0.9819    1.014 
  MIN                  0.6689    0.6260
  GMEAN                0.9940    0.9912
  MAX                  1.114     1.100 

  The `simple` chunkify strategy is not (yet) suitable for a
  self-compile; it can generate excessively large chunks, including
  one for a self-compile that requires 8min to compile by `gcc`.
* Add `expect: WordX.t option` to RSSA and Machine `Switch.T`
  (911b5d4, e2b27ab, 695320d) and add `-gc-expect
  {none|false|true}` compile-time option, where `-gc-expect false`
  should indicate that performing a GC is cold path (823815a); no
  notable performance impact.
* Lots of tweaks to C codegen, ultimately eliminating almost all
  `c-chunk.h` macros.
* Eliminate unused `Machine.Operand.Contents` constructor (006269b).
* Make a major refactoring of LLVM codegen (cec30c5).
* Implement `Real<N>_qequal` for C codegen (9b7b2bd) and use
  `is{less,lessequal}` for `Real<N>_l{t,e}` for C codegen (7b55819).
* Generalize LLVM type-based alias-analysis (27709ef).
* Add `-llvm-aamd scope` for simple `noalias`/`alias.scope`
  alias-analysis metadata in LLVM codegen (b825f56); no notable
  performance impact.
* Use C99/C11 `inline` for primitive and Basis Library functions
  (311331c, c864492, 4f2d213).
* Add `-codegen-fuse-op-and-chk {false|true}` compile-time option to
  explicitly fuse adjacent `Word<N>_<op>` and
  `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens
  (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199,
  0d46a85).  It appears that GCC (and, to a lesser extent)
  Clang/LLVM do not always successfully fuse adjacent adjacent
  `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives.  The
  performance results reported at
  #273 and
  #292 suggest that this does not
  always have significant impact, but sometimes
  `-codegen-fuse-op-and-chk true` can have a positive.  Unfortunately,
  it can also have a (significant) negative impact.  In
  `matrix-multiply` and `vector-rev`, fusing can cause GCC to not
  recognize that an explicit sequence index can be replaced by a
  stride length; in these benchmarks, it would be nice if MLton
  eliminated the overflow checks.

  config command                                                                                                                          
  C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
  C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
  C11    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true                             
  C15    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true                          

  task_clock ratio_means.fieller@0.95 (2-level)
  program           `C11/C04` `C15/C09`
  barnes-hut           1.005     0.9925
  boyer                1.052     1.013 
  checksum             1.022     1.028 
  count-graphs         0.9722    1.002 
  DLXSimulator         1.004     0.9959
  even-odd             0.8768    1.003 
  fft                  0.9592    1.016 
  fib                  0.9732    0.9798
  flat-array           0.8148    1.019 
  hamlet               0.9966    1.030 
  imp-for              0.8993    0.7985
  knuth-bendix         1.008     1.013 
  lexgen               0.9851    1.043 
  life                 0.9954    1.006 
  logic                0.9994    1.014 
  mandelbrot           0.9440    1.013 
  matrix-multiply      1.336     1.009 
  md5                  0.9604    1.007 
  merge                0.9675    1.037 
  mlyacc               1.032     1.029 
  model-elimination    1.010     1.004 
  mpuz                 1.035     0.9599
  nucleic              0.9938    0.9983
  output1              0.9278    0.9709
  peek                 0.9850    1.035 
  pidigits             0.9702    0.9538
  psdes-random         1.017     0.9986
  ratio-regions        0.9801    0.9887
  ray                  0.9795    1.009 
  raytrace             0.9959    1.026 
  simple               0.9764    1.010 
  smith-normal-form    1.002     1.049 
  string-concat        0.7919    0.9035
  tailfib              1.030     1.227 
  tailmerge            1.017     0.9980
  tak                  0.9790    0.9988
  tensor               0.5258    1.000 
  tsp                  0.9845    1.013 
  tyan                 1.019     0.9739
  vector-rev           1.178     1.253 
  vector32-concat      0.8703    0.9230
  vector64-concat      0.8906    0.9038
  vliw                 0.9921    1.044 
  wc-input1            1.060     0.9809
  wc-scanStream        0.9166    1.040 
  zebra                1.008     1.020 
  zern                 1.051     1.089 
  MIN                  0.5258    0.7985
  GMEAN                0.9720    1.007 
  MAX                  1.336     1.253 

  Note: the issue with `md5` mentioned in the commit messages are with
  respect to the `md5` benchmark before 2daaebf.

Overall, this simplifies the C and LLVM codegen slightly, although
there is little significant performance change:

config command                                                                                                                          
C02    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9                                                           
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C08    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang                                                        
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        

task_clock ratio_means.fieller@0.95 (2-level)
program           `C04/C02` `C09/C08`
barnes-hut           1.036     1.025 
boyer                0.9731    1.006 
checksum             0.9652    1.002 
count-graphs         0.9988    0.9964
DLXSimulator         0.9970    1.023 
even-odd             1.002     0.9881
fft                  1.026     0.9674
fib                  0.9034    0.7846
flat-array           1.014     1.021 
hamlet               0.9740    1.010 
imp-for              0.9707    0.9908
knuth-bendix         0.9077    0.9777
lexgen               1.048     0.8985
life                 1.002     0.9827
logic                1.006     0.9867
mandelbrot           1.000     1.011 
matrix-multiply      1.020     0.9957
md5                  0.9700    0.9960
merge                0.9974    0.9818
mlyacc               1.003     0.9824
model-elimination    0.9936    0.9817
mpuz                 0.9815    0.9466
nucleic              0.9946    1.002 
output1              1.007     1.026 
peek                 0.9832    0.9898
pidigits             0.9950    1.047 
psdes-random         1.009     0.9869
ratio-regions        0.9978    0.9725
ray                  0.9938    0.9663
raytrace             0.9975    1.032 
simple               0.9936    1.000 
smith-normal-form    1.038     0.9941
string-concat        1.041     1.014 
tailfib              0.9865    0.9741
tailmerge            1.010     1.020 
tak                  0.9331    0.9041
tensor               0.9938    0.9941
tsp                  0.9825    1.004 
tyan                 0.9960    0.9879
vector-rev           1.014     0.9091
vector32-concat      1.090     0.9016
vector64-concat      0.9994    0.9800
vliw                 0.9995    0.9876
wc-input1            0.9685    0.8634
wc-scanStream        1.178     1.105 
zebra                0.9857    0.9900
zern                 0.9733    0.9890
MIN                  0.9034    0.7846
GMEAN                0.9982    0.9815
MAX                  1.178     1.105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants