Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dose ME 0.5.4 support A100 ? #445

Closed
wangfudong opened this issue Feb 22, 2022 · 3 comments
Closed

Dose ME 0.5.4 support A100 ? #445

wangfudong opened this issue Feb 22, 2022 · 3 comments

Comments

@wangfudong
Copy link

Soory for still bothering you after reading some similar issues of ME like issue#330, issus#350, issue#52.

My problem
I build ME=0.5.4 with anaconda virtualenv:
pytorch=1.7.1;
cudatoolkit=11.0 or 10.2 (with CUDA in system 11.0 or 10.2, respectively)

and system:
ubuntu 18.04
nvidia driver: 450.80.2 (or 450.102.04)
gcc 7.5.0

With the env above, ME-0.5.4 is tested successfully ( including ME.Conv, ME.BN, ME.ReLU, ME.interpolation, and loss.backward ) on GPUs T4 and P40, but fails with A100, the error is 'cudaErrorNoKernelImageForDevice no kernel image is available for execution on the device'.
The details of output error:
{"@timestamp":"2022-02-22 00:14:15.936","@message":" sparse_tensor = ME.SparseTensor(code, coord_sparse)"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" File "/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiSparseTensor.py", line 275, in init"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" coordinates, features, coordinate_map_key = self.initialize_coordinates("}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" File "/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiSparseTensor.py", line 304, in initialize_coordinates"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" ) = self._manager.insert_and_map(coordinates, *coordinate_map_key.get_key())"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" File "/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiCoordinateManager.py", line 179, in insert_and_map"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":" return self._manager.insert_and_map(coordinates, tensor_stride, string_id)"}
{"@timestamp":"2022-02-22 00:14:15.936","@message":"RuntimeError: CUDA error encountered at: /tmp/pip-req-build-16c08htu/src/3rdparty/concurrent_unordered_map.cuh:595: 209 cudaErrorNoKernelImageForDevice no kernel image is available for execution on the device"}

At First, I guess it may be caused by the compatibility between pytorch1.7 and the compute capability of A100. However, pytorch-1.7.1+cuda-11.0+driver-450.80.2 dose support A100 (I used a simple network without ME and it passed successfully).

Have you test ME on A100 and can it work well?
Thank you very much~

@wangfudong
Copy link
Author

The problem has been solved by using docker

@RozDavid
Copy link

As a note for others how find this issue like me, but don't want to use docker for their training, we just have to add export TORCH_CUDA_ARCH_LIST="6.0 6.1 6.2 7.0 7.2 7.5 8.0 8.6" to our script prior pip installing ME.

@Xnhyacinth
Copy link

by using

How do you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants