Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on a bigger example #4

Open
hirrolot opened this issue Jul 11, 2024 · 10 comments
Open

Benchmark on a bigger example #4

hirrolot opened this issue Jul 11, 2024 · 10 comments
Assignees

Comments

@hirrolot
Copy link
Contributor

hirrolot commented Jul 11, 2024

We need to benchmark the supercompiler on some bigger example than those in the examples/ folder, to see which parts take most time. The usual suspects are substitution and homeomorphic embedding.

But before that, Mazeppa needs to be smart enough to emit reasonable code for all programs (see #2). It makes no sense to benchmark the algorithm that needs conceptual changes.

Note: all the benchmarks below were conducted on AMD Ryzen 9 5900HX.

@hirrolot hirrolot self-assigned this Jul 11, 2024
@hirrolot
Copy link
Contributor Author

It seems that the self-interpreter example is the best candidate for benchmarking so far. The input list length can be set to 10 or some higher value. After I'm done with a few implementation changes, I'll be posting here benchmarking data and musings about what improvements can be made.

@hirrolot
Copy link
Contributor Author

I'm now pretty confident that homeomorphic embedding is what slows down the supercompilation process the most (unsurprisingly). Take this example:

main() := loop(5000u32, expand());

loop(i, _t) := match =(i, 0u32) {
    T() -> Done(),
    F() -> loop(-(i, 1u32), expand())
};

expand() := Combine(
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar()),
    Foo(Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar(), Bar())
);

The results are:

$ time ../scripts/mazeppa.sh run

real    0m37.573s
user    0m37.530s
sys     0m0.040s

However, if I change the implementation of Homeomorphic_emb.decide to just return false:

$ time ../scripts/mazeppa.sh run

real    0m2.162s
user    0m2.115s
sys     0m0.047s

@hirrolot
Copy link
Contributor Author

hirrolot commented Aug 1, 2024

After commit 1731459 on the above example (0m37.573s real previously):

real    0m15.506s
user    0m15.375s
sys     0m0.130s

A self-interpreter with the following main function:

main() :=
    let input :=
        cons(Const(1i32), cons(Const(2i32), cons(Const(3i32),
        cons(Const(4i32), cons(Const(5i32), cons(Const(6i32),
        cons(Const(7i32), cons(Const(8i32), cons(Const(9i32),
        cons(Const(10i32), nil()))))))))));
    eval(Nil(), FCall("sum", Cons(input, Nil())));

Before:

real    0m7.495s
user    0m7.446s
sys     0m0.049s

After:

real    0m0.507s
user    0m0.458s
sys     0m0.049s

@hirrolot
Copy link
Contributor Author

hirrolot commented Aug 9, 2024

Benchmarking a self-interpreter with the following main function:

main() :=
    let input :=
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        cons(Const(10i32), cons(Const(10i32), cons(Const(10i32),
        nil())))))))))))))));
    eval(Nil(), FCall("sum", Cons(input, Nil())));

Before commit e61c811 (now using hyperfine):

Benchmark 1: ../scripts/mazeppa.sh run
  Time (mean ± σ):      8.524 s ±  0.195 s    [User: 8.469 s, System: 0.053 s]
  Range (min … max):    8.293 s …  8.861 s    10 runs

After:

Benchmark 1: ../scripts/mazeppa.sh run
  Time (mean ± σ):      2.521 s ±  0.047 s    [User: 2.449 s, System: 0.072 s]
  Range (min … max):    2.469 s …  2.615 s    10 runs

Also, it turns out that both caches, the global size cache and the local result cache, improve performance a lot. If we remove any one of them, the performance will be much worse.

@hirrolot
Copy link
Contributor Author

hirrolot commented Aug 9, 2024

Also, I should note that the above contrived example now takes around 32 seconds to supercompile (before adding the result cache, it used to take around 15 seconds). The reasons are:

  • Querying and extending the cache is useless because there seems to be little or no duplication of work.
  • Result_cache.create 128 takes time.
  • Passing the cache as a function parameter takes time.

However, I think that self-interpretation exhibits a more "real-life" scenario of supercompilation, which is far more important to us. Therefore, I'll keep the result cache anyway.

@hirrolot
Copy link
Contributor Author

hirrolot commented Aug 10, 2024

This is a self-interpretation memory benchmark (yes, now using generateList):

main() := sum(generateList(10i32));

--print-gc-stats gives the following data:

minor_collections:      405           
major_collections:      13
compactions:            0
forced_major_collections: 1

minor_words:    102296710
promoted_words:   1392482
major_words:      1701843

top_heap_words: 1177774
heap_words:      110856
live_words:       17435
free_words:       92917
largest_free:         0
fragments:          504

live_blocks: 2937
free_blocks: 0
heap_chunks: 0

That is, Mazeppa has allocated a total of ~782.8 megabytes:

  • (minor_words + major_words − promoted_words)×8÷1024÷1024 = (102296710 + 1701843 − 1392482)×8÷1024÷1024 ≈ 782.8

However, only ~13.0 megabytes (1701843*8/1024/1024) have ended up in the major heap. Also, there have been about 30 more minor collections than major collections.

What this data tells us? Probably that the OCaml GC is doing its best by collecting many short-lived values while using the major heap to its minimum.

@hirrolot
Copy link
Contributor Author

A self-interpreter with the following main function:

main() := sum(generateList(20i32));

Without Flambda (OCaml version 5.1.2):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):     10.742 s ±  0.098 s    [User: 10.661 s, System: 0.083 s]
  Range (min … max):   10.611 s … 10.933 s    10 runs

With Flambda (OCaml version 5.2.0):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):     10.296 s ±  0.114 s    [User: 10.201 s, System: 0.097 s]
  Range (min … max):   10.182 s … 10.510 s    10 runs

So without any fine tuning, the performance with Flambda is slightly better. It's probably worth experimenting with different Flambda options for more aggressive optimization than -O3.

@hirrolot
Copy link
Contributor Author

The same self-interpreter invocation as in the message above:

main() := sum(generateList(20i32));

After commit 11b6018 (with Flambda):

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):      8.365 s ±  0.076 s    [User: 8.284 s, System: 0.084 s]
  Range (min … max):    8.300 s …  8.537 s    10 runs

With regards to memory usage, it's about 1.5x better: previously, Mazeppa allocated ~755.1 MB on the major heap; now it's ~486.3 MB. That is, both speed and performance are better.

@hirrolot
Copy link
Contributor Author

After commit 44e42fb:

Benchmark 1: dune exec mazeppa --release -- run
  Time (mean ± σ):      8.169 s ±  0.057 s    [User: 8.085 s, System: 0.084 s]
  Range (min … max):    8.085 s …  8.256 s    10 runs

@hirrolot
Copy link
Contributor Author

Running time remains the same after commit 5462b02. Memory usage is reduced by a few megabytes (~484.4 MB on the major heap compared to ~486.3 MB previously).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant