Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pthreads doesn't seem to work #288

Closed
Lephar opened this issue Dec 14, 2018 · 11 comments
Closed

pthreads doesn't seem to work #288

Lephar opened this issue Dec 14, 2018 · 11 comments

Comments

@Lephar
Copy link

Lephar commented Dec 14, 2018

After following multithreading docs and several failed attempts, decided to open this issue but the source of the problem very well be my lack of understanding on the library internals. In any way, it is a good idea to add some simple multithreading examples on examples/tapi since there is no complete code examples on Multithreading.md. Not many tutorials or documentation on the web too, since this is a relatively new library.
Anyway, bli_dgemm() works perfectly fine with single core. But none of the ways specified in multithreading docs has any effect on the computation time and bli_thread_get_num_threads() always returns -1. Here is what I did:
Configured as follows before make (also tried with auto instead of x86_64, --enable-threading=pthreads instead of -t pthreads)

    ./configure -t pthreads x86_64 

Way 1:

    export BLIS_NUM_THREADS=16
    ./my_program

Way 2:

    export BLIS_JC_NT=2
    export BLIS_IC_NT=2
    export BLIS_JR_NT=2
    export BLIS_IR_NT=2
    ./my_program

Also added bli_thread_set_num_threads() to my source and tried with or without environment variables:

    bli_thread_set_num_threads(16);
    bli_dgemm(BLIS_NO_TRANSPOSE, BLIS_NO_TRANSPOSE, m, n, k, &alpha, A, rsa, csa, B, rsb, csb, &beta, C, rsc, csc);

Code compiles and works fine, no syntax or linking error, still correct results on matrices, the problem is threading has just no effect. Tested on Arch Linux (4.19 kernel) with both AUR package and 0.5.0 version downloaded from this github page. Same results.

@fgvanzee
Copy link
Member

@Lephar Sorry to hear about your frustrations with getting multithreading to work as intended, and thanks for reaching out to us.

Way 1 should have worked. Way 2 should have also worked, though the efficiency might not have been great. The third way also should have worked.

You did not indicate the dimensions of your gemm problem. What are the values for m, n, and k? You may not notice a speedup for very small problems. (We consider "very small" to be anything in the neighborhood of m = n = k = 50 or smaller.)

Also, can you give me some information about your hardware? Also, what compiler, and what version of that compiler, are you using? It sounds like you're using Linux, so that rules out a lot that can typically go wrong.

I have some other ideas/things to try, but let's start with this for now.

@Lephar
Copy link
Author

Lephar commented Dec 14, 2018

Hello, you are right, those are some important details I've skipped. m = n = k = 4000, they are both 4000x4000 square matrices and they are filled with random data. It takes around 2.7 - 2.8 seconds on a single thread of my Intel i7 8750H (it has 6 cores, 12 threads). I also watched CPU usage while calculations and it is around 8%~ all the time (around 1/12 of CPU capacity). Both the library and the program are compiled with the same gcc version, gcc 8.2.1 (x64). Included blis/blis.h and linked with -lblis option on gcc. I can also upload the full code when I have access to my PC. Oh and it is nothing near frustration, this is an awesome library :D

@fgvanzee
Copy link
Member

fgvanzee commented Dec 14, 2018

@Lephar Thanks for your kind words about BLIS. I'm glad you sound mostly satisfied, this hiccup aside.

Before I suggest what to try next, a few comments:

  • Looks like you have very recent hardware (Coffee Lake), so BLIS should be selecting the haswell sub-configuration when you configure with configure auto. You can find the sub-configuration selected a couple dozen lines into the configure output. (We use the haswell sub-config on Haswell, Broadwell, Skylake (desktop), Kaby Lake, and Coffee Lake because they are actually all very similar as far as the AVX/FMA instructions we need.)
  • While your hardware supports hyperthreading (which is what I assume you meant by your reference to 12 threads across 6 cores), in our experience we have found that the gemm operation is very rarely (if ever) helped by it, and almost always hindered. So I recommend using a maximum of 6 threads.
  • Your problem size of 4000 is more than large enough to see meaningful speedup when parallelizing. So no obvious issue here.
  • Your version of gcc is very new--so new, in fact, that I haven't even been able to play with 8.x. Though I don't think that should be an issue. (Unless it is? Can't rule anything out yet I suppose.)
  • I noticed that you tried both the auto and x86_64 targets. You may already be aware, but x86_64, which builds a "fat" library, is only really useful when deploying pre-built binaries to an audience where your users might have different systems. Generally, auto is fine for most people's purposes. I would encourage you to limit yourself to auto going forward as we troubleshoot since it simplifies things and builds faster.
  • Also going forward, as we troubleshoot, let's stick with git-cloned source, and let's stay on the master branch (which is the default).

Now, some things to try.

First, I'd like to standardize the test environment, so I'm going to ask you to run the testsuite. With BLIS built (and configured with ./configure -t pthreads auto), change into the testsuite directory and build the testsuite:

cd testsuite
make -j

The testsuite is comprehensive, but you don't need to run the whole thing. Instead, you can limit which tests are run by editing input.operations and changing a couple lines, starting at line 279:

2        # gemm
-1 -1 -1 #   dimensions: m n k
nn       #   parameters: transa transb

I changed the first line to a 2, which means "override all other tests and only run this one". I also changed the parameter chars to nn which means "only test the case where neither A nor B is transposed. (The -1 -1 -1 means "bind each dimension to the problem size," which results in square matrices being used.) Let's also make sure only dgemm runs by changing line 25 of input.general to:

d       # Datatype(s) to test:

And to smooth the results, let's take the best of three trials by changing line 11 to:

3       # Number of repeats per experiment (best result is reported)

Now, I'd like you to set BLIS_JC_NT and BLIS_IC_NT to 2 and unset BLIS_NUM_THREADS:

unset BLIS_NUM_THREADS
export BLIS_JC_NT=2 BLIS_IC_NT=2

Now run the testsuite:

./test_libblis.x

The first crucial portion of the output ends around line 66:

% --- BLIS parallelization info ---
% 
% multithreading                 pthreads
% 
% thread auto-factorization        
%   m dim thread ratio           2
%   n dim thread ratio           1
%   jr max threads               4
%   ir max threads               1
% 
% ways of parallelism     nt    jc    pc    ic    jr    ir
%   environment        unset     2     1     2     1     1
%   gemm   (m,n,k=1000)          2     1     2     1     1
%   herk   (m,k=1000)            2     1     2     1     1
%   trmm_l (m,n=1000)            2     1     2     1     1
%   trmm_r (m,n=1000)            1     1     2     2     1
%   trsm_l (m,n=1000)            2     1     1     2     1
%   trsm_r (m,n=1000)            1     1     4     1     1
% 
% thread partitioning              
%   jr/ir loops                  slab

This confirms that BLIS used the parallelization scheme I specified. The second crucial part is the actual performance at the end of the output. Here's what I get on my 4-core Haswell system:

% --- gemm ---
% 
% gemm m n k                  -1 -1 -1
% gemm operand params         nn
% 

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_rrr                  100   100   100     0.55   1.27e-17   PASS
blis_dgemm_nn_rrr                  200   200   200   100.07   1.01e-17   PASS
blis_dgemm_nn_rrr                  300   300   300   132.54   8.85e-18   PASS
blis_dgemm_nn_rrr                  400   400   400   151.86   1.38e-17   PASS
blis_dgemm_nn_rrr                  500   500   500   159.72   9.02e-18   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc                  100   100   100    26.16   2.94e-17   PASS
blis_dgemm_nn_ccc                  200   200   200   103.93   2.28e-17   PASS
blis_dgemm_nn_ccc                  300   300   300   134.13   1.97e-17   PASS
blis_dgemm_nn_ccc                  400   400   400   147.33   2.73e-17   PASS
blis_dgemm_nn_ccc                  500   500   500   149.35   2.02e-17   PASS

This performance is about right given that single-threaded dgemm() peaks out at about 48 gflops on my system.

Let me know what you see in the testsuite output, and that might tell us more about what's going on.

PS: Another interesting data point would be to go through all of the motions above, with the only difference being that you build BLIS via ./configure -t openmp auto, which will cause BLIS to extract multithreaded parallelism in terms of OpenMP instead of pthreads. (Don't forget to rebuild the testsuite binary after rebuilding BLIS.) If configuring with OpenMP works, then that would suggest there is something broken with pthreads specifically, as opposed to multithreading in general.

@fgvanzee
Copy link
Member

fgvanzee commented Dec 14, 2018

One more important thing: we need a reference point from which to measure speedup. So after running with export BLIS_JC_NT=2 BLIS_IC_NT=2, go ahead and set those both to 1:

export BLIS_JC_NT=1 BLIS_IC_NT=1

(Alternatively, you can unset the variables.) Then run the testsuite with the same input.general and input.operations as above. This will give us a sense of your single-core performance. Here's mine:

% --- gemm ---
% 
% gemm m n k                  -1 -1 -1
% gemm operand params         nn
% 

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_rrr                  100   100   100    31.23   1.27e-17   PASS
blis_dgemm_nn_rrr                  200   200   200    41.83   1.01e-17   PASS
blis_dgemm_nn_rrr                  300   300   300    42.92   8.85e-18   PASS
blis_dgemm_nn_rrr                  400   400   400    46.58   1.38e-17   PASS
blis_dgemm_nn_rrr                  500   500   500    45.73   9.02e-18   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc                  100   100   100    32.21   2.94e-17   PASS
blis_dgemm_nn_ccc                  200   200   200    43.53   2.28e-17   PASS
blis_dgemm_nn_ccc                  300   300   300    44.89   1.97e-17   PASS
blis_dgemm_nn_ccc                  400   400   400    46.60   2.73e-17   PASS
blis_dgemm_nn_ccc                  500   500   500    45.73   2.02e-17   PASS

@Lephar
Copy link
Author

Lephar commented Dec 15, 2018

That was very helpful, thank you for your time. I did exactly as you said and got some interesting results;

$ unset BLIS_NUM_THREADS
$ export BLIS_JR_NT=1 BLIS_IR_NT=1
$ export BLIS_JC_NT=1 BLIS_IC_NT=1
$ ./test_libblis.x

% --- BLIS parallelization info ---
% 
% multithreading                 pthreads
% 
% thread auto-factorization        
%   m dim thread ratio           2
%   n dim thread ratio           1
%   jr max threads               4
%   ir max threads               1
% 
% ways of parallelism     nt    jc    pc    ic    jr    ir
%   environment        unset     1     1     1     1     1
%   gemm   (m,n,k=1000)          1     1     1     1     1
%   herk   (m,k=1000)            1     1     1     1     1
%   trmm_l (m,n=1000)            1     1     1     1     1
%   trmm_r (m,n=1000)            1     1     1     1     1
%   trsm_l (m,n=1000)            1     1     1     1     1
%   trsm_r (m,n=1000)            1     1     1     1     1
% 
% thread partitioning              
%   jr/ir loops                  slab

% --- gemm ---
% 
% gemm m n k                  -1 -1 -1
% gemm operand params         nn
% 

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_rrr                  100   100   100    38.40   1.27e-17   PASS
blis_dgemm_nn_rrr                  200   200   200    49.12   1.01e-17   PASS
blis_dgemm_nn_rrr                  300   300   300    52.13   8.85e-18   PASS
blis_dgemm_nn_rrr                  400   400   400    51.48   1.38e-17   PASS
blis_dgemm_nn_rrr                  500   500   500    55.56   9.02e-18   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc                  100   100   100    41.35   2.94e-17   PASS
blis_dgemm_nn_ccc                  200   200   200    52.63   2.28e-17   PASS
blis_dgemm_nn_ccc                  300   300   300    54.11   1.97e-17   PASS
blis_dgemm_nn_ccc                  400   400   400    56.72   2.73e-17   PASS
blis_dgemm_nn_ccc                  500   500   500    55.56   2.02e-17   PASS
$ export BLIS_JC_NT=2 BLIS_IC_NT=2
$ ./test_libblis.x

% --- BLIS parallelization info ---
% 
% multithreading                 pthreads
% 
% thread auto-factorization        
%   m dim thread ratio           2
%   n dim thread ratio           1
%   jr max threads               4
%   ir max threads               1
% 
% ways of parallelism     nt    jc    pc    ic    jr    ir
%   environment        unset     2     1     2     1     1
%   gemm   (m,n,k=1000)          2     1     2     1     1
%   herk   (m,k=1000)            2     1     2     1     1
%   trmm_l (m,n=1000)            2     1     2     1     1
%   trmm_r (m,n=1000)            1     1     2     2     1
%   trsm_l (m,n=1000)            2     1     1     2     1
%   trsm_r (m,n=1000)            1     1     4     1     1
% 
% thread partitioning              
%   jr/ir loops                  slab

% --- gemm ---
% 
% gemm m n k                  -1 -1 -1
% gemm operand params         nn
% 

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_rrr                  100   100   100    14.37   1.27e-17   PASS
blis_dgemm_nn_rrr                  200   200   200    77.87   1.01e-17   PASS
blis_dgemm_nn_rrr                  300   300   300   105.84   8.85e-18   PASS
blis_dgemm_nn_rrr                  400   400   400   114.56   1.38e-17   PASS
blis_dgemm_nn_rrr                  500   500   500   151.54   9.02e-18   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc                  100   100   100    14.84   2.94e-17   PASS
blis_dgemm_nn_ccc                  200   200   200    70.21   2.28e-17   PASS
blis_dgemm_nn_ccc                  300   300   300   126.25   1.97e-17   PASS
blis_dgemm_nn_ccc                  400   400   400   124.77   2.73e-17   PASS
blis_dgemm_nn_ccc                  500   500   500   139.97   2.02e-17   PASS
$ export BLIS_JC_NT=2 BLIS_IC_NT=3
$ ./test_libblis.x

% --- BLIS parallelization info ---
% 
% multithreading                 pthreads
% 
% thread auto-factorization        
%   m dim thread ratio           2
%   n dim thread ratio           1
%   jr max threads               4
%   ir max threads               1
% 
% ways of parallelism     nt    jc    pc    ic    jr    ir
%   environment        unset     2     1     3     1     1
%   gemm   (m,n,k=1000)          2     1     3     1     1
%   herk   (m,k=1000)            2     1     3     1     1
%   trmm_l (m,n=1000)            2     1     3     1     1
%   trmm_r (m,n=1000)            1     1     3     2     1
%   trsm_l (m,n=1000)            2     1     1     3     1
%   trsm_r (m,n=1000)            1     1     6     1     1

% --- gemm ---
% 
% gemm m n k                  -1 -1 -1
% gemm operand params         nn
% 

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_rrr                  100   100   100     8.24   1.27e-17   PASS
blis_dgemm_nn_rrr                  200   200   200    43.04   1.01e-17   PASS
blis_dgemm_nn_rrr                  300   300   300    78.57   8.85e-18   PASS
blis_dgemm_nn_rrr                  400   400   400   129.32   1.38e-17   PASS
blis_dgemm_nn_rrr                  500   500   500   186.11   9.02e-18   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_dgemm_nn_ccc                  100   100   100    14.77   2.94e-17   PASS
blis_dgemm_nn_ccc                  200   200   200    39.11   2.28e-17   PASS
blis_dgemm_nn_ccc                  300   300   300   115.58   1.97e-17   PASS
blis_dgemm_nn_ccc                  400   400   400   165.24   2.73e-17   PASS
blis_dgemm_nn_ccc                  500   500   500   153.82   2.02e-17   PASS

So it is definitely working. I also tried some extra cases and found that 6 threads are indeed optimal as you said. The gflops values of multithreaded tests varied about ±20 between executions even when I set number of repeats to a number bigger than 3, but still I got the idea.
After confirming the pthreads are working, I experimented with BLIS_NUM_THREADS and problem seemed that it is overriden by other environment variables, after unsetting them (instead of setting them all to 1), BLIS_NUM_THREADS also worked as expected, it allocated threads automatically. All these cases hold true for my program too. bli_thread_set_ways also works, but bli_thread_get_num_threads returns -1 in any case except for BLIS_NUM_THREADS. So it is mostly resolved, confusion was caused by bli_thread_get_num_threads returning -1 and "1 1 1 1 1" yielding almost same result as "2 1 2 2 2" in my case.
Only one problem remains; bli_thread_set_num_threads has no effect, whether environment variables are set or unset and I have no explanation for that.

@fgvanzee
Copy link
Member

The gflops values of multithreaded tests varied about ±20 between executions even when I set number of repeats to a number bigger than 3, but still I got the idea.

This could be because BLIS does not make any attempt to bind threads to cores via CPU affinity when configured with pthreads. Unfortunately, pthreads has no "native" mechanism for specifying affinity; you would have to call an operating system function such as sched_setaffinity() (Linux only). By contrast, you could configure BLIS to use OpenMP (instead of pthreads), which would give you the ability to set affinity via environment variables (e.g. GOMP_CPU_AFFINITY, OMP_PROC_BIND).

I am planning to add a section to the Multithreading.md documentation that discusses affinity, particularly via OpenMP.

After confirming the pthreads are working, I experimented with BLIS_NUM_THREADS and problem seemed that it is overriden by other environment variables

Yes, this behavior is intentional. We had to decide which variable(s) would take precedence if both ways were employed (automatic and manual). We decided that any specification of the manual way should override the automatic way. Sorry you had to discover this empirically. The aforementioned policy was very intentional on our part, and it should have been included in the Multithreading.md documentation. I am planning to add a paragraph on the topic.

I will look into the remaining issue regarding bli_thread_get_num_threads(). Thanks again for your feedback.

@fgvanzee
Copy link
Member

bli_thread_get_num_threads returning -1 and "1 1 1 1 1" yielding almost same result as "2 1 2 2 2" in my case.

@Lephar I meant to comment on this in my previous reply: I found this to be a bit surprising, though entirely believable. Sometimes, oversubscribing threads relative to the number of physical cores causes downright awful performance, and it seems like that was happening in your case. (@dnparikh Your initial intuition looks correct in this case: his over-subscription did choke the CPU.)

I think I figured out the problem with bli_thread_get_num_threads(). The best way to characterize it is probably as a failure on my part to document an extra requirement when setting threading globally at runtime (that is, via bli_thread_set_num_threads() or bli_thread_set_ways()). But before I declare this issue totally figured out in my head, I'd like you to confirm the fix, which is to insert a function call to bli_init() sometime/anywhere before you call bli_thread_set_num_threads() or bli_thread_set_ways().

@fgvanzee
Copy link
Member

@Lephar Happily, I realized that our bli_thread_set_num_threads() issue was actually not a failure of proper documentation, but rather a problem in the code. This is good news because it can easily be fixed and would not require any changes to the documentation. You can still go ahead and try calling bli_init() up-front since this is actually a sort of manual way of fixing the problem, whereas I am going to implement a more permanent fix that activates automatically.

@fgvanzee
Copy link
Member

@Lephar Also, please try out 93d5631, which hopefully contains a permanent fix to the issue. It also includes additions to the Multithreading.md documentation.

Thanks to users like you, we are able to find little issues like this that might otherwise go unnoticed. We sincerely appreciate your feedback. :)

@Lephar
Copy link
Author

Lephar commented Dec 18, 2018

This could be because BLIS does not make any attempt to bind threads to cores via CPU affinity when configured with pthreads. Unfortunately, pthreads has no "native" mechanism for specifying affinity; you would have to call an operating system function such as sched_setaffinity() (Linux only).

@Lephar I meant to comment on this in my previous reply: I found this to be a bit surprising, though entirely believable. Sometimes, oversubscribing threads relative to the number of physical cores causes downright awful performance, and it seems like that was happening in your case.

Yeah, I was expecting a performance penalty (or inconsistency between runs) caused by cache invalidation when using multithreading, I was just not expecting it to happen around 8~16 threads on 4000x4000 matrices. But these operations utilize very continual memory space after all, should've thought that it would be affected by locality a lot more than other operations.

Yes, this behavior is intentional. We had to decide which variable(s) would take precedence if both ways were employed (automatic and manual). We decided that any specification of the manual way should override the automatic way. Sorry you had to discover this empirically.

Well, this is the logical behaviour. It was my oversight to set them to defaults instead of unsetting them.

But before I declare this issue totally figured out in my head, I'd like you to confirm the fix, which is to insert a function call to bli_init() sometime/anywhere before you call bli_thread_set_num_threads() or bli_thread_set_ways().

Yes, adding bli_init() line solves the problem, job finishes at expected time and bli_thread_get_num_threads() returns the correct value now. Also tried the master branch with commit 93d5631 and it works!
I want to thank you and the blis team for your efforts. It is my pleasure to be of any help on this.

@fgvanzee
Copy link
Member

Great. I'll consider this issue closed. If you encounter any other problems, simply open another issue and let us know what you're seeing.

I was about to invite you to join our blis-devel mailing list, but then I noticed you were already a member. :) Thanks for your interest in BLIS. Please keep in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants