-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job] Job queue fail to schedule pending jobs #1130
Comments
now Ray just releases ray2.0 (which fixes a lot of bugs), could you give it a try? Feel free to keep assigning me if issues remain |
The problem still exists after upgrading to ray==2.0.0, but I did find a reproducible code with
If we replace the |
* fix placement group not scheduled issue * Add original job queue test back * fix smoke test * make number of jobs smaller * enlarge task numbers * format * Fix test_smoke * format * More accurate test_smoke by checking the PENDING jobs * condition for on-prem case * format
Oh, then it seems to be a Ray client issue. Ray client is still not very stable, try avoid using |
This problem seems to be caused by ray that the placement group will not be
ray.get()
after being released.A reproducible code (As mentioned in #1125):
After canceling the first 16 jobs, there are only 4 more jobs scheduled.
This could be a bug in ray's placement group scheduling. Our generated ray program fail to get the
ray.get(pg.ready())
, althoughray status
andray.util.placement_group_table()
indicate there are 16 placement group removed, 16 placement group scheduled.Hey @suquark, since you are more familiar with ray's placement group, it would be very helpful if you can help me take a look at this problem.
Update:
The problem also appears after changing the CPU requirement of each task to be 1
The text was updated successfully, but these errors were encountered: