Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread Safety #62

Closed
lassepe opened this issue May 16, 2022 · 10 comments · Fixed by #63
Closed

Thread Safety #62

lassepe opened this issue May 16, 2022 · 10 comments · Fixed by #63

Comments

@lassepe
Copy link
Contributor

lassepe commented May 16, 2022

When I call the PATH solver from multiple threads at the same time, I reliably run into a segfault. Is there a workaround to avoid this?

Here is an MWE (make sure to start the Julia REPL with multiple threads):

using PATHSolver: PATHSolver
using SparseArrays: SparseArrays

PATHSolver.c_api_License_SetString("2830898829&Courtesy&&&USR&45321&5_1_2021&1000&PATH&GEN&31_12_2025&0_0_0&6000&0_0")

function solve_example()
    M = convert(
        SparseArrays.SparseMatrixCSC{Cdouble,Cint},
        SparseArrays.sparse([
            0 0 -1 -1
            0 0 1 -2
            1 -1 2 -2
            1 2 -2 4
        ]),
    )
    status, z, info = PATHSolver.solve_mcp(
        M,
        Float64[2, 2, -2, -6],
        fill(0.0, 4),
        fill(10.0, 4),
        [0.0, 0.0, 0.0, 0.0];
        output = "yes"
    )
end

Threads.@threads for _ in 1:10
    solve_example()
end
@odow
Copy link
Collaborator

odow commented May 16, 2022

We don't do anything special, so I assume the underlying solver is not thread safe.

@lassepe
Copy link
Contributor Author

lassepe commented May 16, 2022

I did a quick run with GDB and I get the segfault here. But maybe that's not the actual point of segfault and Julia just happened to GC when this happened. Here's the GDB backtrace:

Thread 3 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe2f3b700 (LWP 199218)]
0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
    at /home/lassepe/worktree/julia/src/julia_locks.h:34
34	            jl_gc_safepoint_(self->ptls);
(gdb) bt
#0  0x00007ffff752c3e3 in jl_mutex_wait (lock=0x7ffff7888920 <codegen_lock>, safepoint=1)
    at /home/lassepe/worktree/julia/src/julia_locks.h:34
#1  0x00007ffff752c536 in jl_mutex_lock (lock=0x7ffff7888920 <codegen_lock>)
    at /home/lassepe/worktree/julia/src/julia_locks.h:85
#2  0x00007ffff752db6d in jl_generate_fptr (mi=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/jitlayers.cpp:316
#3  0x00007ffff7481916 in jl_compile_method_internal (mi=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/gf.c:1980
#4  0x00007ffff748275e in _jl_invoke (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0, mfunc=0x7ffff0a17350, world=31344)
    at /home/lassepe/worktree/julia/src/gf.c:2239
#5  0x00007ffff7483185 in jl_apply_generic (F=0x7ffff086f1b0, args=0x7fffef1f3408, nargs=0)
    at /home/lassepe/worktree/julia/src/gf.c:2429
#6  0x00007ffff74a8cdc in jl_apply (args=0x7fffef1f3400, nargs=1) at /home/lassepe/worktree/julia/src/julia.h:1788
#7  0x00007ffff74aac20 in start_task () at /home/lassepe/worktree/julia/src/task.c:877

@odow
Copy link
Collaborator

odow commented May 16, 2022

That just means the issue is probably in us cleaning up an instance of Path while another is running, and that the two instances share some global state in C. If I recall, there's a workspace feature of the C library that we aren't using it at the moment. But not sure if that would help.

@lassepe
Copy link
Contributor Author

lassepe commented May 17, 2022

Thank you for your swift response. I dug a little deeper and it indeed seems like this is a global state issue inside PATH. At least the function signature of the solver suggests that there is only one global solver object.

FUN_DECL(Void) Path_Create(Int maxSize, Int maxNNZ)

PATHSolver.jl never calls that directly but I guess that would be created as a result of the call to Path_Solve. So when the MCP is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.

@odow
Copy link
Collaborator

odow commented May 17, 2022

So when the MCP is cleaned up one probably destroys the workspace out from under the other solver running on the same workspace in parallel.

Yes. We can probably fix this by manually creating and managing our own workspace, but I don't know what this entails. You'd have to look into the C code. I'll review a PR if you have time to work on it 😄

You can find the C header files here: http://www.cs.wisc.edu/~ferris/path/path_5.0.05_Linux64.zip

@lassepe
Copy link
Contributor Author

lassepe commented May 17, 2022

A colleague of mine contacted the PATH team and Steven Dirkse replied:

Unfortunately, PATH is indeed known to NOT be thread-safe. There are
multiple reasons for this that go deep into the way PATH is organized. We
spent some time considering this issue some time ago and concluded that the
fixes would require a substantial amount of effort, something we could not
commit to then or since.

I cannot suggest any workaround I would feel comfortable and safe with,
apart from avoiding multi-threaded use of PATH.

Therefore, I guess there's not much we can do on this end for now. A note on this issue in the README or somewhere in the docs may be the best solution for now.

@odow
Copy link
Collaborator

odow commented May 17, 2022

Okay, I'll add a note. Thanks for digging into this further!

@odow
Copy link
Collaborator

odow commented May 17, 2022

I'll also say that threading might be broken, but you can still use process-based parallelism:

import Distributed
Distributed.addprocs(4)
Distributed.@everywhere begin
    import Pkg
    Pkg.activate("/tmp/path")  # Or wherever your environment is
end

Distributed.@everywhere begin
    import PATHSolver
    import SparseArrays
    function solve_example(i)
        println("Solving i=$i on node $(Distributed.myid())")
        M = convert(
            SparseArrays.SparseMatrixCSC{Cdouble,Cint},
            SparseArrays.sparse([
                0 0 -1 -1
                0 0 1 -2
                1 -1 2 -2
                1 2 -2 4
            ]),
        )
        return PATHSolver.solve_mcp(
            M,
            Cdouble[2, 2, -2, -6],
            fill(0.0, 4),
            fill(10.0, 4),
            Cdouble[0.0, 0.0, 0.0, 0.0],
            output = "no",
        )
    end
end

results = Distributed.pmap(solve_example, 1:10)

@lassepe
Copy link
Contributor Author

lassepe commented May 17, 2022

Right, multi-processing works because each process loads the lib separately. Unfortunately, in my current usecase the problems are very small so multi-processing adds too much overhead to provide an overall speedup. Still good to keep in mind for other settings though.

Should I close this issue since it's not real actionable or do you want to keep it around as additional warning/reminder?

@odow
Copy link
Collaborator

odow commented May 17, 2022

I'll close it when I merge #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants