Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: GOOS=darwin over-reports CPU usage in mutex-heavy workload #68477

Open
rhysh opened this issue Jul 16, 2024 · 6 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 16, 2024

Go version

$ ~/sdk/go1.23/bin/go version
go version go1.23rc2 darwin/arm64

Output of go env in your module/workspace:

$ ~/sdk/go1.23/bin/go env -changed
$

What did you do?

On an M1 Macbook Air, which I understand to have 8 cores of 1 thread each (4 performance, 4 efficiency), I ran the runtime's ChanContended benchmark while varying GOMAXPROCS, collecting a CPU profile for each.

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=$i -cpuprofile=/tmp/cpu$i; done

What did you see happen?

The profile reports using an average of 0.62 threads of on-CPU time when GOMAXPROCS is 1, and 1.21 when GOMAXPROCS is 2.

However, when GOMAXPROCS is 4 and 8, it reports using 125.54 and 403.14 threads of on-CPU time respectively. It claims that the program consumed 930 seconds of on-CPU time over a period of 2.31 seconds of wall-clock time.

That's more than 50 times more hardware threads than the machine really has.

What did you expect to see?

I expected the profile to report no more than 100 CPU profile samples per thread-second.


This may be related to #57722 . In that issue, GOOS=darwin CPU profiles included the right number of samples but skewed them towards a particular portion of the workload. But here, the workload itself is experiencing extreme over-sampling.

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=$i -cpuprofile=/tmp/cpu$i; done
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended 	  446238	      2638 ns/op
PASS
ok  	runtime	2.113s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-2   	  354793	      3331 ns/op
PASS
ok  	runtime	2.241s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-4   	  231181	      5051 ns/op
PASS
ok  	runtime	1.560s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-8   	  202797	      5902 ns/op
PASS
ok  	runtime	2.468s

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go tool pprof /tmp/cpu$i </dev/null 2>&1 | grep Duration; done
Duration: 1.95s, Total samples = 1.21s (62.01%)
Duration: 2.09s, Total samples = 2.53s (121.18%)
Duration: 1.40s, Total samples = 175.74s (12553.94%)
Duration: 2.31s, Total samples = 929.99s (40314.39%)

A combined execution trace and CPU profile show a change in the density of the magenta "CPU profile sample" annotations at the time when the test transitions from GOMAXPROCS=2 to GOMAXPROCS=3. (It's like it's the opposite of #35057 , where GOOS=linux profiles under-sampled when the program's CPU usage crossed the 250% boundary.)

$ ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=1,2,3,4,5,6,7,8 -cpuprofile=/tmp/cpu -trace=/tmp/cct
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended     	  447841	      2642 ns/op
BenchmarkChanContended-2   	  354488	      3363 ns/op
BenchmarkChanContended-3   	  265177	      4586 ns/op
BenchmarkChanContended-4   	  237844	      5083 ns/op
BenchmarkChanContended-5   	  215295	      5618 ns/op
BenchmarkChanContended-6   	  207518	      5817 ns/op
BenchmarkChanContended-7   	  209977	      5872 ns/op
BenchmarkChanContended-8   	  206673	      5850 ns/op
PASS
ok  	runtime	11.268s

$ ~/sdk/go1.23/bin/go tool trace /tmp/cct
Screenshot 2024-07-16 at 4 20 20 PM
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 16, 2024
@rhysh
Copy link
Contributor Author

rhysh commented Jul 17, 2024

My OS version is macOS Sonoma 14.4.1.

The issue is also present with go1.17.13.

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 19, 2024
@cherrymui cherrymui added this to the Backlog milestone Jul 19, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Jul 24, 2024

Wasn't there a bug in Darwin that CPU sampling is biased toward syscalls? That might explain this. #57722 maybe?

@rhysh
Copy link
Contributor Author

rhysh commented Jul 24, 2024

Yes, #57722 was on my mind too.

What I see here is overcounting, which is more extreme than "bias". Maybe in small doses, the overcounting appears as if it were (only) bias?

@prattmic
Copy link
Member

This seems like potentially an OS bug. Does it reproduce with a 4 thread C program?

If not trivially, I wonder if it is related to threads sleeping and waking (in lock2 presumably). Maybe each thread wake-up is erroneously receiving a SIGPROF?

@rhysh
Copy link
Contributor Author

rhysh commented Jul 25, 2024

OS bug seems likely, yes. I agree that a C reproducer would be a good next step. I have the next few weeks planned out with other work — thanks in advance if someone beats me to it — otherwise I'll take a stab at that later.

Yes, syscalls are related: In lock_sema.go, lock2 does a progression of 1/ procyield, 2/ osyield, 3/ semasleep. Steps 2 and 3 involve syscalls. If I remove either step 2 or 3, the problem persists. If I remove both of those, so lock2's only form of delay is procyield (no syscalls), then the magnitude of the CPU profile stays within the believable range.

I've updated to macOS Sonoma 14.5 and the issue is still present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants