Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eBPF event loop blockage finder #569

Closed
kvakil opened this issue Jul 11, 2022 · 10 comments
Closed

eBPF event loop blockage finder #569

kvakil opened this issue Jul 11, 2022 · 10 comments

Comments

@kvakil
Copy link

kvakil commented Jul 11, 2022

Hi -- I'd like to share an eBPF use-case which I found quite useful in
my day job. (Please let me know if there is a better forum to do this.)

We were experiencing long (10s+) event loop blockages which was
affecting our performance. We were alerted to this issue by
node-blocked.

We had two initial ideas:

  1. CPU profiles: but the overhead felt too high especially since the
    blockages only happened rarely.
  2. async_hooks (specifically blocked-at): but the overhead was
    unacceptable.

The solution we landed on used eBPF, particularly bpftrace:

/* Whenever any thread enters uv__run_timers, record the current time
   in nanoseconds in a map. */
u:NODE_PATH:uv__run_timers { @[tid] = nsecs; }

/* Whenever any thread returns from uv__run_check, clear its time from
   the map. */
ur:NODE_PATH:uv__run_check /@[tid]/ { delete(@[tid]); }

/* 99 times a second, check if any running thread has been blocked
   for longer than 10 seconds. If so, take a core dump and stop
   this script. */
p:hz:99 /@[tid]/ {
    if (nsecs - @[tid] > 10000000000) {
        system("gcore %d", pid);
        exit();
    }
}

We ran this script on a bunch of machines, and eventually it spit out a
coredump. We opened the coredump with llnode and found the cause
via v8 backtrace.

Questions for this group

  • I will also create a separate issue about how llnode is no longer
    supported, but I think this is still useful functionality. For
    example, you can use it to get histograms of event loop blockages
    which is independently useful for workload characterization.

  • On Node's side, it would be nicer if event loop stages were exposed as
    stable tracepoints instead of uprobes. This would make it easier for
    people to package similar tools.

  • One could also imagine a weaker version of this functionality being
    built-in to NodeJS: collecting a Javascript backtrace when the event
    loop has currently been blocked for too long. From talking with other
    engineers, I've heard that attributing event loop blockages is a
    common problem when running NodeJS at scale. Is there interest in
    having this in NodeJS core?

@kvakil
Copy link
Author

kvakil commented Jul 11, 2022

  • One could also imagine a weaker version of this functionality being
    built-in to NodeJS: collecting a Javascript backtrace when the event
    loop has currently been blocked for too long. From talking with other
    engineers, I've heard that attributing event loop blockages is a
    common problem when running NodeJS at scale. Is there interest in
    having this in NodeJS core?

Here is an PoC to show what I mean: https://git.sr.ht/~kvakil/fast-blocked-at/tree
See native.cc for the native code and test.js for the example usage.

@RafaelGSS
Copy link
Member

I will also create a separate issue about how llnode is no longer
supported, but I think this is still useful functionality. For
example, you can use it to get histograms of event loop blockages
which is independently useful for workload characterization.

For llnode, see: nodejs/node#43289

On Node's side, it would be nicer if event loop stages were exposed as
stable tracepoints instead of uprobes. This would make it easier for
people to package similar tools.

Yes, that's a good improvement indeed.


Basically, in our last WG Meeting, I said to wait a while regarding the eBPF support, because I'm definitely interested in helping on that. Unfortunately, I had no time recently to investigate. But, soon as possible I'll try to collect what Node.js would need to support it.

Feel free to investigate too, any help is welcome! I'd say, once it doesn't hurt the Node.js Performance, it will be definitely accepted by the team.

@theanarkh
Copy link

theanarkh commented Jul 14, 2022

I have written a demo by ebpf before to count the cost of every phase of libuv. I think ebpf is very useful for solving problem.

@RafaelGSS
Copy link
Member

I have written a demo by ebpf before to count the cost of every phase of libuv. I think ebpf is very useful for solving problem.

It seems private. 404.

@theanarkh
Copy link

theanarkh commented Jul 14, 2022

I have written a demo by ebpf before to count the cost of every phase of libuv. I think ebpf is very useful for solving problem.

It seems private. 404.

Oh ! sorry, fixed. the link(src/uv_xxx files).

@kvakil
Copy link
Author

kvakil commented Jul 14, 2022

I have written a demo by ebpf before to count the cost of every phase of libuv. I think ebpf is very useful for solving problem.

Thanks for the confirmation that this approach has been useful to
others. The accompanying blog post shows one of the big problems
with uprobes: there are cases (usually inlining) where attaching the
probe fails. A stable tracepoint interface would be better from that
perspective.

On Node's side, it would be nicer if event loop stages were exposed as
stable tracepoints instead of uprobes. This would make it easier for
people to package similar tools.

Yes, that's a good improvement indeed.

Basically, in our last WG Meeting, I said to wait a while regarding the eBPF support, because I'm definitely interested in helping on that. Unfortunately, I had no time recently to investigate. But, soon as possible I'll try to collect what Node.js would need to support it.

I appreciate the insight. I think that the next step here is to
re-implement ebPF probes in libuv, e.g. through libuv/libuv#2209. There
is also some previous discussion on whether native tracepoints are
useful in this WG (#10, #163).

@theanarkh
Copy link

Yes, i think this is the problem of uprobe/kprobe that the name of function may change. I'm looking forward to seeing Libuv support these capabilities(trace the event loop).

@kvakil
Copy link
Author

kvakil commented Jul 18, 2022

It turns out Node.js recently removed the dtrace probes:
nodejs/node#43652. It seems sort of like native tracing capabilities
are not considered "first-class" in the node ecosystem, so pursuing this
further seems a little futile. :\

@kvakil kvakil closed this as completed Jul 18, 2022
@RafaelGSS
Copy link
Member

Anyway, I'll investigate it. For further updates, subscribe to: #535

@JiaHuann
Copy link

JiaHuann commented Apr 13, 2023

Yes, i think this is the problem of uprobe/kprobe that the name of function may change. I'm looking forward to seeing Libuv support these capabilities(trace the event loop).

Actually I want to do this job as well,the "metrics" API seems to have some overhead on additional "epoll_wait", libuv-issues-3937. It is a great idea to use ebpf to instead it.But i am not sure how much work we need to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants