Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Improvements for Benchmark Page #127

Open
mert-kurttutan opened this issue Apr 24, 2024 · 3 comments
Open

Some Improvements for Benchmark Page #127

mert-kurttutan opened this issue Apr 24, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@mert-kurttutan
Copy link

mert-kurttutan commented Apr 24, 2024

To make the benchmark more clear (and easier to reproduce (modulo hardware specs)), I think a few more details are needed.
For instance, it is said that the benchmark are run with 12 threads. But, it is not fully clear how many threads are actually used. There are several factors that determines it (depending of the what kind of wrappers are used around BLAS implementation).

To give an example, # threads is managed by $OMP_NUM_THREADS with openmp parallelization enabled (or $MKL_NUM_THREADS and $OMP_NUM_THREADS for intel mkl. What I am saying is that it can be difficult to conclude how many threads are actually used.

It would be better to state number of threads explicitly with environment variables.

The results for singled threaded for other libraries would also be beneficial to include in the benchmark page.

The theoretical limit should also be included (in GFLOPS)

@mert-kurttutan mert-kurttutan added the enhancement New feature or request label Apr 24, 2024
@oscardssmith
Copy link

also, I think it would be nice if the benchmarks were line graphs rather than tables.

@mert-kurttutan
Copy link
Author

Another detail:

  • Sometimes OS does funny things for choosing which cpu cores to run. So, it would be nicer to set cpu affinity through some api either through rust or command line. For me, the easiest option is to use taskset since it does not require any change in code base.
  • Regarding cpu core issue (due to OS assigning different clock speed to cpu cores), when you measure the time taken either through criterion or std::time, in the rust version of gemm functions, it sometimes chooses slower cpu cores and (longer run time) and sometimes the faster one. Funnily enough, the versions from C codebases, e.g. openblas, consistently chooses faster cpu core.

@mert-kurttutan
Copy link
Author

also, I think it would be nice if the benchmarks were line graphs rather than tables.

I think this would be especially useful when you want to represent GFLOPS, which it should. Otherwise, time taken involves wildly different scales, making it a bit more difficult interpret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants