Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLSWrap leaking huge amounts of memory #3047

Closed
arthurschreiber opened this issue Sep 24, 2015 · 13 comments
Closed

TLSWrap leaking huge amounts of memory #3047

arthurschreiber opened this issue Sep 24, 2015 · 13 comments
Labels
memory Issues and PRs related to the memory management or memory footprint. tls Issues and PRs related to the tls subsystem.

Comments

@arthurschreiber
Copy link

We recently upgraded our application to node 4.0 (and 4.1). During load and performance testing we recognized that memory usage is going through the roof.

Here's a graph that shows this:

image

Basically, with the start of the load test memory usage quickly climbed from ~150MB up to ~1GB and then stabilizing at 800MB.

I went ahead and took a heapdump and opened it in the Chrome Development Tools.

schnappschuss_092415_102505_am

Here, you can see that ~400MB are used by 2 TLSWrap instances.

(EDIT: This is on 4.1, but we've seen this also on 4.0)

//cc @indutny @ChALkeR

@ChALkeR ChALkeR added the memory Issues and PRs related to the memory management or memory footprint. label Sep 24, 2015
@mscdex mscdex added the tls Issues and PRs related to the tls subsystem. label Sep 24, 2015
@PSanni
Copy link

PSanni commented Sep 24, 2015

Same issue here..
It's taking mass amount of memory.

@indutny
Copy link
Member

indutny commented Sep 24, 2015

Hello! Thanks for submitting this. Is there any way to reproduce it locally?

@arthurschreiber
Copy link
Author

Nope, no idea. So far I tried different attempts at reproducing this locally, and none of them seemed to work.

I pulled heapdumps from many other workers we have running (we're using the node.js cluster module), and I can see similar behaviour on other nodes as well. It looks like some of these TLSWrap instances are simply left behind and never freed.

I know that a locally reproducible test case would be awesome, but I don't have any. 😞 Any ideas on how I could provide more info on this issue?

@yjhjstz
Copy link

yjhjstz commented Sep 25, 2015

If you don't mind, can you uploader heap dump ? I can go further analyse. @arthurschreiber

@arthurschreiber
Copy link
Author

@yjhjstz Can I send you the dump via email? I don't want to upload it to a public place, as I'm not sure how much sensitive data is in there. I can compress the dumps down to ~30MB. Is that ok?

@arthurschreiber
Copy link
Author

Ok, this is fixed by #3059.

Our application has an uncaughtException handler which does catch exceptions and forces the process to continue. (We know this is bad, and are actively working on changing it to gracefully restart the process instead).

The issue that is fixed by #3059 plus this uncaughtException handler caused requests and their associated TLSWrap object to leak and be never collected.

Thanks everyone for the help!

@ChALkeR
Copy link
Member

ChALkeR commented Sep 26, 2015

Everything below is an offtopic and represents my personal opinion (well, except the «don't continue after an uncaughtException» part).

Our application has an uncaughtException handler which does catch exceptions and forces the process to continue. (We know this is bad, and are actively working on changing it to gracefully restart the process instead).

That is very bad and leaves the process in an corrupted state.

Also, restarting your process from within the uncaughtException handler is also not an ideal approach — your process could die for other reasons, for example an assertion on the c++ side (yes, that will be a bug) or an OOM. Or you process could hang if the logic is somewhere broken in your code. Everything of that won't be covered by restarting the process from within the uncaughtException handler.

IMO, an ideal approach in production would be a system-level supervisor that launches your process and automatically restarts it when it dies for some reason. Also, by sending periodic «I'm alive» pings from your process to the supervisor you can protect your process from hangs. If the process doesn't send a ping in the given time, the supervisor will restart it.

See http://0pointer.de/blog/projects/watchdog.html for an example.

@arthurschreiber
Copy link
Author

@ChALkeR Thanks for your insight.

The problem is that just restarting the application when a lot of requests are currently in flight is reaaally uncool.

We already have a process supervisor and a watchdog + ping in place, so that's already covered.

The approach we want to take is to use the uncaught exception handler to try and shut down everything gracefully, but force a shutdown after some time in case the exception has left the process in some state where a graceful shutdown is not possible.

@indutny
Copy link
Member

indutny commented Sep 26, 2015

LET THE UNCAUGHT EXCEPTION CONTROVERSY BEGIN

@arthurschreiber
Copy link
Author

LET THE UNCAUGHT EXCEPTION CONTROVERSY BEGIN

@defunctzombie
Copy link
Contributor

@ChALkeR The issue with uncaughtException is that it can range from anything to a typo in javascript code to some underlying bad thing with sockets (as in this case) and just die-and-restart is indeed destructive for in-flight requests in the case of a typo which realistically might affect only a single user or call versus taking the whole system down. There may be a non-trivial cost to booting back up. Not arguing one way or the other but it is important to understand why uncaughtException leads folks to try and continue because sometimes you just miss a silly thing and there are no consequences to continuing.

@Fishrock123
Copy link
Contributor

Refs: #3068

@its-eli-z
Copy link

Hey guys,

image

I have the same issues with these TLSWrap objects. After a few hours of running, there are more than 10 huge objects per worker.

  • Happens with node v0.12.7, v4.1.1, v4.1.2.
  • I'm running forever with the cluster module (2 workers).
  • Normal memory usage: ~450MB, after a few hours suddenly: 3.5GB (then there is too much latency and my load balancer turns off this machine). My heapdump file is almost same as posted earlier.
  • I have an uncaughtException handler just to log the error and then immediately call process.exit() - Isn't that the same as when node crashes and the forever automatically restarts it?

Any ideas?
Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory Issues and PRs related to the memory management or memory footprint. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants