-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Affects: PythonCall
Describe the bug
This is a very quirky bug. I'm getting a segmentation fault when using python's gymnasium
package with multiple processes while a Flux model is loaded on the GPU.
Setup:
]add CondaPkg
]add PythonCall
]add Flux
]add CUDA
using CondaPkg
CondaPkg.add("gymnasium")
CondaPkg.add("swig")
CondaPkg.add("gymnasium-box2d")
CondaPkg.add("gymnasium-other")
Run (crash is non-deterministic, try running a few times on a machine with an NVIDIA GPU):
using Distributed
addprocs(12; env=["CUDA_HARD_MEMORY_LIMIT" => "5%", "CUDA_MEMORY_POOL"=>"none"])
@everywhere begin
using CUDA
using Flux
using CondaPkg
using PythonCall
function initialize_car_racing_env(_)
gym = pyimport("gymnasium")
x = Flux.Dense(512=>512) |> gpu
env = gym.make("CarRacing-v3")
obs, info = env.reset()
env.close()
return 1
end
end
for generation in 1:10_000
if generation % 100 == 0
println("Generation: $generation")
end
pmap(initialize_car_racing_env, 1:12)
end
Stack trace:
From worker 5:
From worker 5: [35654] signal 11: Segmentation fault
From worker 5: in expression starting at none:0
From worker 5: jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:334 [inlined]
From worker 5: jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:329 [inlined]
From worker 5: jl_gc_state_save_and_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:340
From worker 5: throw_internal_altstack at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:755 [inlined]
From worker 5: ijl_sig_throw at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:800
From worker 5: Allocations: 21901595 (Pool: 21900914; Big: 681); GC: 219
ERROR: Worker 5 terminated.LoadError:
ProcessExitedException(Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:970
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:978
[3] unsafe_read
@ ./io.jl:891 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:890
[5] read!
@ ./io.jl:895 [inlined]
[6] deserialize_hdr_raw
@ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/sh
Your system
Please provide detailed information about your system:
- The operating system
5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- The version of Julia, Python, PythonCall, JuliaCall and any other affected packages
[052768ef] CUDA v5.5.2
[992eb4ea] CondaPkg v0.2.24
[587475ba] Flux v0.14.25
[6099a3de] PythonCall v0.9.23 `https://github.com/JuliaPy/PythonCall.jl.git#main`
[02a925ec] cuDNN v1.4.0
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
LD_LIBRARY_PATH =
CondaPkg Status /home/garbus/.julia/environments/v1.11/CondaPkg.toml
Environment
/home/garbus/.julia/environments/v1.11/.CondaPkg/env
Packages
gymnasium v1.0.0
gymnasium-box2d v1.0.0
gymnasium-other v1.0.0
swig v4.2.1
Additional context
I'm researching embodied AI and trying to use Julia's distributed capabilities to do so while still evaluating on python environments.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working