ERROR when using multi-gpu training #18

tsWen0309 · 2023-11-19T13:47:04Z

Hi, thanks for sharing your work. I can't train your model on my GPUS- two 4090. Is there any solution?

LYMDLUT · 2023-11-22T08:35:28Z

Hi, thanks for sharing your work. I can't train your model on my GPUS- two 4090. Is there any solution?

I have met the same bug

christopher-beckham · 2023-11-26T06:24:04Z

I had a similar issue, make sure you're using PyTorch 1.12 as per the environment.yml file.

tsWen0309 · 2023-11-26T11:14:22Z

Hi, thanks for sharing your work. I can't train your model on my GPUS- two 4090. Is there any solution?

I have met the same bug

I solved the problem by using PyTorch 1.13.1 with cuda 11.6 cudnn8.3.2.0

RachelTeamo · 2024-01-17T07:06:42Z

I tried this code, when I set --nproc_per_node=1, the code works fine, but once --nproc_per_node>1 (e.g.,--nproc_per_node=2), this code doesn't work and reports the same error as in the picture, is there a solution for this? My torch version is 2.1 because I'm using H800GPU.

marcoamonteiro · 2024-02-27T18:29:19Z

@RachelTeamo to run on torch 2.1 replace line 89 in training_loop.py:
ddp = torch.nn.parallel.DistributedDataParallel(net, device_ids=[device], broadcast_buffers=False)
with
ddp = torch.nn.parallel.DistributedDataParallel(net, device_ids=[dist.get_rank()], broadcast_buffers=False)

I couldn't find any info in the PyTorch docs warning about the change in DDP API but this solved the issue for me.

RachelTeamo · 2024-03-11T02:46:17Z

Thanks for your suggestions, I replaced the code follow your suggestion. But the issue still exist.

Shiien · 2024-03-22T13:42:46Z

I solved this by ignoring line 79-84

# if dist.get_rank() == 0:
#     with torch.no_grad():
#         images = torch.zeros([batch_gpu, net.img_channels, net.img_resolution, net.img_resolution], device=device)
#         sigma = torch.ones([batch_gpu], device=device)
#         labels = torch.zeros([batch_gpu, net.label_dim], device=device)
#         misc.print_module_summary(net, [images, sigma, labels], max_nesting=2)

And set ddp = torch.nn.parallel.DistributedDataParallel(net, device_ids=[dist.get_rank()], broadcast_buffers=False)

tsWen0309 closed this as completed Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR when using multi-gpu training #18

ERROR when using multi-gpu training #18

tsWen0309 commented Nov 19, 2023

LYMDLUT commented Nov 22, 2023

christopher-beckham commented Nov 26, 2023 •

edited

Loading

tsWen0309 commented Nov 26, 2023

RachelTeamo commented Jan 17, 2024

marcoamonteiro commented Feb 27, 2024

RachelTeamo commented Mar 11, 2024

Shiien commented Mar 22, 2024

ERROR when using multi-gpu training #18

ERROR when using multi-gpu training #18

Comments

tsWen0309 commented Nov 19, 2023

LYMDLUT commented Nov 22, 2023

christopher-beckham commented Nov 26, 2023 • edited Loading

tsWen0309 commented Nov 26, 2023

RachelTeamo commented Jan 17, 2024

marcoamonteiro commented Feb 27, 2024

RachelTeamo commented Mar 11, 2024

Shiien commented Mar 22, 2024

christopher-beckham commented Nov 26, 2023 •

edited

Loading