Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job] Job queue fail to schedule pending jobs #1130

Closed
Michaelvll opened this issue Aug 26, 2022 · 3 comments · Fixed by #1134
Closed

[Job] Job queue fail to schedule pending jobs #1130

Michaelvll opened this issue Aug 26, 2022 · 3 comments · Fixed by #1134
Labels
help wanted Extra attention is needed

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Aug 26, 2022

This problem seems to be caused by ray that the placement group will not be ray.get() after being released.

A reproducible code (As mentioned in #1125):

sky launch -c repr --cloud gcp ''

for i in {1..100}; do
  sky exec -d repr "echo start; sleep 1000000000000000"
done

sky cancel repr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

watch -n 5 "sky queue repr | grep RUNNING | wc -l"

After canceling the first 16 jobs, there are only 4 more jobs scheduled.

This could be a bug in ray's placement group scheduling. Our generated ray program fail to get the ray.get(pg.ready()), although ray status and ray.util.placement_group_table() indicate there are 16 placement group removed, 16 placement group scheduled.

Hey @suquark, since you are more familiar with ray's placement group, it would be very helpful if you can help me take a look at this problem.

Update:
The problem also appears after changing the CPU requirement of each task to be 1

@Michaelvll Michaelvll changed the title [Ray] Placement group will not be got after released [Job] Job queue fail to schedule pending jobs Aug 26, 2022
@Michaelvll Michaelvll added the help wanted Extra attention is needed label Aug 26, 2022
@suquark
Copy link
Collaborator

suquark commented Aug 26, 2022

now Ray just releases ray2.0 (which fixes a lot of bugs), could you give it a try? Feel free to keep assigning me if issues remain

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Aug 29, 2022

The problem still exists after upgrading to ray==2.0.0, but I did find a reproducible code with ray job directly without going through SkyPilot's codepath:

# reproduce.py
import os
import pathlib
import sys
import time

import ray

id = sys.argv[1]

ray.init('ray://localhost:10001', namespace=f'my-{id}', log_to_driver=True)

pg = ray.util.placement_group([{"CPU": 0.5}])
ray.get(pg.ready())

pathlib.Path(f'{id}-ready').touch()

while True:
    time.sleep(1)
# reproduce.sh
#!/bin/bash

for i in {1..100}; do
    ray job submit --address=http://127.0.0.1:8265 --submission-id $i-job --no-wait python ./reproduce.py $i
done

sleep 10

for i in {1..16}; do
    ray job stop --address=http://127.0.0.1:8265 $i-job &
done

If we replace the ray://localhost:10001 with auto, the problem will disappear. According to the dashboard (thanks to ray==2.0.0's observability), with the former way, each job will be separated into two jobs (one with Job_ID, and another one with Submission_ID).

Michaelvll added a commit that referenced this issue Aug 29, 2022
* fix placement group not scheduled issue

* Add original job queue test back

* fix smoke test

* make number of jobs smaller

* enlarge task numbers

* format

* Fix test_smoke

* format

* More accurate test_smoke by checking the PENDING jobs

* condition for on-prem case

* format
@suquark
Copy link
Collaborator

suquark commented Sep 8, 2022

Oh, then it seems to be a Ray client issue. Ray client is still not very stable, try avoid using ray:// as much as possible, because this would use ray client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants