getindex for CuArray to fix repeated indices error #1131

jgreener64 · 2021-12-08T18:59:24Z

I spent quite a while working out where my gradients were wrong, until eventually tracking it down to the GPU repeated index issue that is mentioned in multiple places:

Zygote+Views+CuArrays=Bad News Bears (silently gives incorrect results) #600
Incorrect derivative of getindex() with repeating indices on CuArrays #821
Views with repeating indices lead to different results than Base.Array JuliaGPU/CUDA.jl#89
Unexpected result in combination with Zygote.gradient() JuliaGPU/CUDA.jl#1019
RFC: Fix #31392, unaliasing of broadcast arguments against destinations with repeated indices JuliaLang/julia#31407

Based on a solution suggested at https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694 I added a hack to get the correct gradients, which can be found in this PR. It uses scatter! from NNlib to accumulate values over multiple indices. It is considerably slower than the current version, which on the GPU gives silently incorrect and even stochastic gradients. However it is faster than moving the array off the GPU to do this step.

This isn't currently suitable for merging since it adds dependencies and isn't tested beyond the simple case. It might be useful as a starting point for a discussion on how to get this important issue fixed, though, and also might be useful as a hack to others.

CarloLucibello · 2021-12-08T19:04:32Z

src/lib/array.jl

+using CUDA
+import NNlib


these imports shouldn't be here

I think that's what @jgreener64 was referring to with "This isn't currently suitable for merging since it adds dependencies..."

ah sorry, missed that

src/lib/array.jl

CarloLucibello · 2021-12-08T19:06:47Z

src/lib/array.jl

+
+∇getindex(x::CuArray, inds::Tuple{AbstractArray{<:Integer}}) = dy -> begin
+  dx = _zero(x, eltype(dy))
+  NNlib.scatter!(+, dx, dy, inds[1])


since we use only inds[1] we should restict the input to ::AbstractArray{<:Integer} or even better AbstractVector{<:Integer}

The tuple wrapper comes from the varargs in the adjoint, so unless that's changed too there's not much that can be done.

CarloLucibello · 2021-12-08T19:10:25Z

I think we should be correct and avoid silent bugs before aiming at being fast. Maybe taking a fast path after a "has duplicates" check can help performance a bit?

CarloLucibello · 2021-12-08T19:14:11Z

Since Zygote doesn't depend on CUDA and NNlib I'm not sure where we should put this. Here but inside a @require? In NNlib/NNlibCUDA?

jgreener64 · 2021-12-08T19:20:51Z

Maybe taking a fast path after a "has duplicates" check can help performance a bit?

Yes I would imagine so.

I'm not sure where we should put this

There is already a CUDA require block in lib/broadcast.jl though that still leaves the NNlib dependency. Another option is to rewrite a simple version of scatter here.

ToucheSir · 2021-12-08T19:28:46Z

Is this something ReverseDiff could benefit from as well? If so, we could consider adding it as an rrule with RuleConfig{>:HasReverseMode} to NNlibCUDA.

DhairyaLGandhi · 2021-12-09T08:37:52Z

I don't think this is something that can be solved well with an adjoint at this point. The forward pass value would be incorrect by this point already. Specifically, note JuliaGPU/CUDA.jl#89 which describes why this happens. Best fixed at the CUDA.jl level based on how it is implemented.

jgreener64 · 2021-12-09T13:07:46Z

The forward pass value would be incorrect by this point already.

Not in general, since assignment broadcasting is used in the backward pass whether or not it is used in the forward pass. For example on the current release:

using Zygote, CUDA

mul2(x) = 2x

function f(a)
    v = view(a, [1, 1, 2])
    sum(mul2.(v))
end

a = [1.0f0, 1.0f0, 1.0f0]
f(a) # Correct, 6.0f0
f(cu(a)) # Correct, 6.0f0

gradient(f, a) # Correct, (Float32[4.0, 2.0, 0.0],)
gradient(f, cu(a)) # Incorrect, (Float32[2.0, 2.0, 0.0],)

The gradient is correct with the change in this PR.

Best fixed at the CUDA.jl level based on how it is implemented.

The problem is that this seems difficult to fix at the CUDA level based on what @maleadt says at JuliaGPU/CUDA.jl#89.

CarloLucibello · 2021-12-25T10:02:49Z

Can we have some benchmarks here?

ToucheSir · 2021-12-25T22:09:22Z

So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.

jgreener64 · 2022-01-05T14:34:38Z

I added the check for unique indices. Here are some benchmarks on a NVIDIA RTX A6000 with 1000 indices:

using Zygote, CUDA, BenchmarkTools, Random
using Zygote: ∇getindex

arr = cu(rand(Float32, 1000))
inds_norep = shuffle(1:1000)
inds_norep_cu = cu(inds_norep)
inds_rep = rand(1:1000, 1000)
inds_rep_cu = cu(inds_rep)
yb = cu(rand(Float32, 1000))

# On master
@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 11.474 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 15.839 μs (31 allocations: 10.06 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 11.380 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 16.069 μs (31 allocations: 10.06 KiB)

# With this change
@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 23.712 μs (53 allocations: 59.36 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 35.649 μs (50 allocations: 67.16 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 3.626 ms (26013 allocations: 1.29 MiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 3.663 ms (26014 allocations: 1.29 MiB)

So it's a 2x slowdown for the previously correct case of no repeated indices due to the uniqueness check, and a 300x slowdown for the previously incorrect case of repeated indices due to the use of NNlib.scatter!.

CarloLucibello · 2022-02-12T10:53:38Z

What should we do here? I think we should take the performance hit for the correctness.

So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.

Hi don't quite understand this suggestion

ToucheSir · 2022-02-12T15:32:49Z

I meant that we could copy the scatter kernel definition from NNlibCUDA and use it as an internal function. Pretty messy, but a couple of duplicated methods isn't the end of the world.

CarloLucibello · 2022-02-12T15:37:04Z

Yes, we could do that, not ideal but seems the best alternative

CarloLucibello · 2022-02-12T15:38:51Z

@jgreener64 do you want to implement this plan?

jgreener64 · 2022-02-20T17:57:53Z

Looking at this again, another option when the indices are not unique is to do the broadcast on the CPU and move it to the GPU. I added that in the last commit. Compared to the above it benchmarks as:

@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 24.253 μs (57 allocations: 59.64 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 37.904 μs (53 allocations: 67.39 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 12.634 μs (21 allocations: 19.92 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 20.333 μs (22 allocations: 19.94 KiB)

i.e. it is fast, though I imagine it increases GPU memory requirements. It may be a good way to go here due to its simplicity.

mcabbott · 2022-02-20T18:02:02Z

Do these times need a CUDA.@sync? Seems a little crazy that copying to an Array could be quicker, but if it really is then that's simple. And if so, it should probably skip the check? allunique is surprisingly slow, has to make a Dict I think:

julia> @btime allunique($(shuffle!(collect(1:1000))));
  min 11.958 μs, mean 20.876 μs (17 allocations, 49.14 KiB)

jgreener64 · 2022-02-22T16:24:57Z

I'm not fully sure but with this benchmark I think it is quicker to do on CPU and copy across. Skipping the uniqueness check and always doing this, which I just added to the branch, gives:

@btime $(∇getindex(arr, (inds_norep   ,)))($(yb)) # 11.843 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 19.797 μs (12 allocations: 16.31 KiB)
@btime $(∇getindex(arr, (inds_rep     ,)))($(yb)) # 12.079 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_rep_cu  ,)))($(yb)) # 19.589 μs (12 allocations: 16.31 KiB)

This is very similar to the results on master above, but I'm unsure of the wider impact of changing every indexing operation on a GPU array as it may be slower or memory-heavy in other contexts.

Are there downstream benchmark tests that can be run (e.g. for Flux) to assess this?

ToucheSir · 2022-02-22T16:26:06Z

Could you package up some of the test cases you're using and put them in https://github.com/FluxML/Zygote.jl/blob/master/test/cuda.jl?

mcabbott · 2022-08-03T13:24:08Z

JuliaDiff/ChainRules.jl#655 implements the same copy-to-CPU idea is 98e87c3. Comments (or benchmarking) would be welcome.

Adding test cases here would also be a great idea, as they will be run on a real GPU via buildkite.

jgreener64 · 2022-08-03T15:08:50Z

I added a simple test case that is fixed by this PR.

Yes it probably makes sense to deal with this on the ChainRules.jl side long term.

Is there a way to benchmark performance on various downstream tasks such as Flux? I'm a little worried about the consequences for GPU memory if a CPU-GPU conversion happens with every indexing step, though in my tests above it seemed okay. The approach of allowing integers/ranges in JuliaDiff/ChainRules.jl#655 without doing a uniqueness check seems sensible.

jgreener64 · 2023-10-06T15:35:34Z

I think this was fixed in JuliaDiff/ChainRules.jl#655.

CarloLucibello reviewed Dec 8, 2021

View reviewed changes

src/lib/array.jl Outdated Show resolved Hide resolved

CarloLucibello reviewed Dec 8, 2021

View reviewed changes

jgreener64 force-pushed the gpu-getindex branch from dd4d7d2 to 40517a6 Compare February 20, 2022 17:56

mcabbott added the CUDA All things GPU label Jul 4, 2022

mcabbott mentioned this pull request Aug 3, 2022

Indexing JuliaDiff/ChainRules.jl#655

Merged

jgreener64 added 6 commits August 3, 2022 14:43

getindex for CuArray

5d58b06

Check for unique indices

448ba91

Move to CUDA requires block

fd5596a

Broadcast on CPU

c951b0a

Skip uniqueness check

67069cb

update for changes on master

167d500

test repeated indexing

8e06259

jgreener64 force-pushed the gpu-getindex branch from 3b85689 to 8e06259 Compare August 3, 2022 15:00

ToucheSir mentioned this pull request Oct 6, 2023

Excise getindex adjoint #1328

Merged

1 task

jgreener64 closed this Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getindex for CuArray to fix repeated indices error #1131

getindex for CuArray to fix repeated indices error #1131

jgreener64 commented Dec 8, 2021

CarloLucibello Dec 8, 2021

ToucheSir Dec 8, 2021

CarloLucibello Dec 8, 2021

CarloLucibello Dec 8, 2021

ToucheSir Dec 25, 2021

CarloLucibello commented Dec 8, 2021

CarloLucibello commented Dec 8, 2021

jgreener64 commented Dec 8, 2021

ToucheSir commented Dec 8, 2021

DhairyaLGandhi commented Dec 9, 2021 •

edited

Loading

jgreener64 commented Dec 9, 2021

CarloLucibello commented Dec 25, 2021

ToucheSir commented Dec 25, 2021

jgreener64 commented Jan 5, 2022

CarloLucibello commented Feb 12, 2022

ToucheSir commented Feb 12, 2022

CarloLucibello commented Feb 12, 2022

CarloLucibello commented Feb 12, 2022

jgreener64 commented Feb 20, 2022

mcabbott commented Feb 20, 2022

jgreener64 commented Feb 22, 2022

ToucheSir commented Feb 22, 2022

mcabbott commented Aug 3, 2022

jgreener64 commented Aug 3, 2022

jgreener64 commented Oct 6, 2023

getindex for CuArray to fix repeated indices error #1131

getindex for CuArray to fix repeated indices error #1131

Conversation

jgreener64 commented Dec 8, 2021

CarloLucibello Dec 8, 2021

Choose a reason for hiding this comment

ToucheSir Dec 8, 2021

Choose a reason for hiding this comment

CarloLucibello Dec 8, 2021

Choose a reason for hiding this comment

CarloLucibello Dec 8, 2021

Choose a reason for hiding this comment

ToucheSir Dec 25, 2021

Choose a reason for hiding this comment

CarloLucibello commented Dec 8, 2021

CarloLucibello commented Dec 8, 2021

jgreener64 commented Dec 8, 2021

ToucheSir commented Dec 8, 2021

DhairyaLGandhi commented Dec 9, 2021 • edited Loading

jgreener64 commented Dec 9, 2021

CarloLucibello commented Dec 25, 2021

ToucheSir commented Dec 25, 2021

jgreener64 commented Jan 5, 2022

CarloLucibello commented Feb 12, 2022

ToucheSir commented Feb 12, 2022

CarloLucibello commented Feb 12, 2022

CarloLucibello commented Feb 12, 2022

jgreener64 commented Feb 20, 2022

mcabbott commented Feb 20, 2022

jgreener64 commented Feb 22, 2022

ToucheSir commented Feb 22, 2022

mcabbott commented Aug 3, 2022

jgreener64 commented Aug 3, 2022

jgreener64 commented Oct 6, 2023

DhairyaLGandhi commented Dec 9, 2021 •

edited

Loading