Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes ray dashboard hanging problem (#1088) #1109

Merged
merged 2 commits into from
Aug 21, 2022
Merged

Conversation

Michaelvll
Copy link
Collaborator

Closes #1088.

This fixes the ray dashboard hanging problem. By checking the py-spy dump --locals --pid <dashboard.py pid>, we found that the dashboard has some leaked thread _monitor_job in the dashboard/modules/job/job_manager.py. The problem is caused by await job_supervisor.ping.remote(), which may not normally raise the exception after the actor job_supervisor is exited.
After switching the await to ray.get, it seems the problem is solved.

Tested:

  • Run the reproduce code in the issue description and the problem does not occur.
  • Tried on the user's program and it did not stuck after submitting 500 spot jobs during the night. (Previously, it would stuck after 100-200 jobs)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps the toughest bug to track down so far @Michaelvll! Great debugging.

Consider running the smoke tests before merging, as it's deep in the job submission code path.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Aug 21, 2022

Good point!
Tested:

  • tests/run_smoke_tests.sh

@Michaelvll
Copy link
Collaborator Author

The target user's program now successfully submitted 2000 jobs. The problem should be solved. ; )

@Michaelvll Michaelvll merged commit c224819 into master Aug 21, 2022
@Michaelvll Michaelvll deleted the patch-job-manager branch August 21, 2022 23:38
@Michaelvll Michaelvll mentioned this pull request Aug 22, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Ray dashboard hangs making ray job commands not responding
2 participants