You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The whole point of the Pipe module is to split a batch into #chunks microbatches and then process these through the stages of the pipeline in order to achieve parallelism by having multiple microbatches being processed on different GPUs at the same time. The benchmark in bechmarks/transformer.py doesn't specify chunks so it defaults to chunks=1, which doesn't make use of any of the microbatch logic. Further, changing the benchmark to set chunks=2 or chunks=4 yields a slowdown, when I would expect that more chunks -> more parallelism.
Command
PYTHONPATH=$PWD python benchmarks/transformer.py
To Reproduce
Steps to reproduce the behavior:
PYTHONPATH=$PWD python benchmarks/transformer.py
Change L263 to specify chunks=2 and rerun the command, e.g. p = pipe.Pipe(model, balance, chunks=2)
Change L263 to specify chunks=4 and rerun the command
chunks=1: test loss 5.57 | time: 30.72s | words: 2304870 | wps: 75028.93
chunks=2: test loss 5.58 | time: 53.51s | words: 2304870 | wps: 43077.41
chunks=4: test loss 5.57 | time: 81.93s | words: 2304870 | wps: 28133.60
Expected behavior
chunks=N is faster than chunks=1 for some N when there are more than 1 devices
Environment
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.1
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 418.116.00
cuDNN version: Could not collect
🐛 Bug
The whole point of the Pipe module is to split a batch into #chunks microbatches and then process these through the stages of the pipeline in order to achieve parallelism by having multiple microbatches being processed on different GPUs at the same time. The benchmark in bechmarks/transformer.py doesn't specify chunks so it defaults to chunks=1, which doesn't make use of any of the microbatch logic. Further, changing the benchmark to set chunks=2 or chunks=4 yields a slowdown, when I would expect that more chunks -> more parallelism.
Command
PYTHONPATH=$PWD python benchmarks/transformer.py
To Reproduce
Steps to reproduce the behavior:
PYTHONPATH=$PWD python benchmarks/transformer.py
p = pipe.Pipe(model, balance, chunks=2)
chunks=1: test loss 5.57 | time: 30.72s | words: 2304870 | wps: 75028.93
chunks=2: test loss 5.58 | time: 53.51s | words: 2304870 | wps: 43077.41
chunks=4: test loss 5.57 | time: 81.93s | words: 2304870 | wps: 28133.60
Expected behavior
chunks=N is faster than chunks=1 for some N when there are more than 1 devices
Environment
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 418.116.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.6.0
[pip3] torchtext==0.7.0
[pip3] torchvision==0.7.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0da4684_0 conda-forge
[conda] numpy 1.19.1 py37hbc911f0_0
[conda] numpy-base 1.19.1 py37hfa32c7d_0
[conda] pytorch 1.6.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchtext 0.7.0 pypi_0 pypi
[conda] torchvision 0.7.0 py37_cu101 pytorch
The text was updated successfully, but these errors were encountered: