Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster unable to get resource on Azure A10 Instance #3310

Closed
binarycrayon opened this issue Mar 13, 2024 · 2 comments · Fixed by #3313
Closed

Cluster unable to get resource on Azure A10 Instance #3310

binarycrayon opened this issue Mar 13, 2024 · 2 comments · Fixed by #3313
Labels
clouds Cloud support and cloud-specifc features

Comments

@binarycrayon
Copy link

resources requested

resources:
  cloud: azure
  ports: 8080
  accelerators: A10:1
  region: westus2

able to provision instance but blocked at INFO: Waiting for task resources on 1 node. This will block if the cluster is full.

======== Autoscaler status: 2024-03-13 21:41:27.656076 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/1.0 A10
 0.0/6.0 CPU
 0B/31.73GiB memory
 0B/15.87GiB object_store_memory

Demands:
 {'CPU': 0.5, 'A10': 1.0, 'GPU': 1.0} * 1 (STRICT_SPREAD): 1+ pending placement groups
@binarycrayon binarycrayon changed the title Unable to get resource on Azure A10 Instance Cluster unable to get resource on Azure A10 Instance Mar 13, 2024
@Michaelvll
Copy link
Collaborator

Thank you for reporting this issue @binarycrayon! We just pushed a fix for this in #3313. Could you help test if it works with A10 GPUs on Azure, as we don't have the quota for A10 on Azure? : )

If you would like to test it out, the following would be the line to install the fix from that PR:
pip uninstall skypilot skypilot-nightly; pip install git+https://github.com/skypilot-org/skypilot.git@bcac2d764ae5e5fcac8fd64549888573a0b1d39a

@Michaelvll Michaelvll added the clouds Cloud support and cloud-specifc features label Mar 14, 2024
@binarycrayon
Copy link
Author

Yes, confirmed the fix worked. Thanks so much for the quick fix!

I 03-14 20:22:51 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'dialogue-choice-gemma-2b' [1x Azure(Standard_NV6ads_A10_v5, {'A10': 1}, ports=['8080'])].
I 03-14 20:22:51 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 03-14 20:22:57 cloud_vm_ray_backend.py:1364] To view detailed progress: tail -n100 -f /home/../sky_logs/sky-2024-03-14-20-22-48-834635/provision.log
I 03-14 20:22:58 cloud_vm_ray_backend.py:1754] Launching on Azure westus2
I 03-14 20:25:28 log_utils.py:45] Head node is up.
I 03-14 20:28:16 cloud_vm_ray_backend.py:1602] Successfully provisioned or found existing VM.
I 03-14 20:28:20 cloud_vm_ray_backend.py:3076] Running setup on 1 node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clouds Cloud support and cloud-specifc features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants