[Job] Job queue fail to schedule pending jobs #1130

Michaelvll · 2022-08-26T23:10:56Z

This problem seems to be caused by ray that the placement group will not be ray.get() after being released.

A reproducible code (As mentioned in #1125):

sky launch -c repr --cloud gcp ''

for i in {1..100}; do
  sky exec -d repr "echo start; sleep 1000000000000000"
done

sky cancel repr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

watch -n 5 "sky queue repr | grep RUNNING | wc -l"

After canceling the first 16 jobs, there are only 4 more jobs scheduled.

This could be a bug in ray's placement group scheduling. Our generated ray program fail to get the ray.get(pg.ready()), although ray status and ray.util.placement_group_table() indicate there are 16 placement group removed, 16 placement group scheduled.

Hey @suquark, since you are more familiar with ray's placement group, it would be very helpful if you can help me take a look at this problem.

Update:
The problem also appears after changing the CPU requirement of each task to be 1

The text was updated successfully, but these errors were encountered:

suquark · 2022-08-26T23:25:02Z

now Ray just releases ray2.0 (which fixes a lot of bugs), could you give it a try? Feel free to keep assigning me if issues remain

Michaelvll · 2022-08-29T04:40:43Z

The problem still exists after upgrading to ray==2.0.0, but I did find a reproducible code with ray job directly without going through SkyPilot's codepath:

# reproduce.py
import os
import pathlib
import sys
import time

import ray

id = sys.argv[1]

ray.init('ray://localhost:10001', namespace=f'my-{id}', log_to_driver=True)

pg = ray.util.placement_group([{"CPU": 0.5}])
ray.get(pg.ready())

pathlib.Path(f'{id}-ready').touch()

while True:
    time.sleep(1)

# reproduce.sh
#!/bin/bash

for i in {1..100}; do
    ray job submit --address=http://127.0.0.1:8265 --submission-id $i-job --no-wait python ./reproduce.py $i
done

sleep 10

for i in {1..16}; do
    ray job stop --address=http://127.0.0.1:8265 $i-job &
done

If we replace the ray://localhost:10001 with auto, the problem will disappear. According to the dashboard (thanks to ray==2.0.0's observability), with the former way, each job will be separated into two jobs (one with Job_ID, and another one with Submission_ID).

* fix placement group not scheduled issue * Add original job queue test back * fix smoke test * make number of jobs smaller * enlarge task numbers * format * Fix test_smoke * format * More accurate test_smoke by checking the PENDING jobs * condition for on-prem case * format

suquark · 2022-09-08T22:55:52Z

Oh, then it seems to be a Ray client issue. Ray client is still not very stable, try avoid using ray:// as much as possible, because this would use ray client.

Michaelvll changed the title ~~[Ray] Placement group will not be got after released~~ [Job] Job queue fail to schedule pending jobs Aug 26, 2022

Michaelvll added the help wanted Extra attention is needed label Aug 26, 2022

Michaelvll mentioned this issue Aug 28, 2022

Increase thread limit and fix nofile limit #1128

Merged

5 tasks

Michaelvll mentioned this issue Aug 29, 2022

Fix placement group not scheduled issue (issue #1130) #1134

Merged

2 tasks

Michaelvll closed this as completed in #1134 Aug 29, 2022

Michaelvll mentioned this issue Aug 29, 2022

[Job] Submitting jobs without delay can cause jobs PENDING forever #1125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Job] Job queue fail to schedule pending jobs #1130

[Job] Job queue fail to schedule pending jobs #1130

Michaelvll commented Aug 26, 2022 •

edited

Loading

suquark commented Aug 26, 2022 •

edited

Loading

Michaelvll commented Aug 29, 2022 •

edited

Loading

suquark commented Sep 8, 2022

[Job] Job queue fail to schedule pending jobs #1130

[Job] Job queue fail to schedule pending jobs #1130

Comments

Michaelvll commented Aug 26, 2022 • edited Loading

suquark commented Aug 26, 2022 • edited Loading

Michaelvll commented Aug 29, 2022 • edited Loading

suquark commented Sep 8, 2022

Michaelvll commented Aug 26, 2022 •

edited

Loading

suquark commented Aug 26, 2022 •

edited

Loading

Michaelvll commented Aug 29, 2022 •

edited

Loading