Skip to content

Supporting Distributed OptiGraphs#144

Merged
dlcole3 merged 61 commits intoplasmo-dev:mainfrom
dlcole3:dc/distributed
Nov 6, 2025
Merged

Supporting Distributed OptiGraphs#144
dlcole3 merged 61 commits intoplasmo-dev:mainfrom
dlcole3:dc/distributed

Conversation

@dlcole3
Copy link
Copy Markdown
Collaborator

@dlcole3 dlcole3 commented Aug 4, 2025

This PR supports distributing OptiGraphs. Most of the source code is contained in the distributed folder that has been added. A set of unit tests and a two pages of documentation have also been added.

The core structures are outlined in distributed/core_types.jl and include the RemoteOptiGraph, RemoteNodeRef, RemoteVariableRef, RemoteEdgeRef, and RemoteOptiEdge. Passing any of these objects directly to the remote worker also results in passing the whole RemoteOptiGraph to the remote worker, so proxy objects have also been defined that allow for only passing the minimum data to create the corresponding objects or accessing the objects on the remote.

In defining these new structures, I have tried to extend the same functions that have been extended for Plasmo. One exception is the relax_integrality function which could perhaps still be included in this PR.

Copy link
Copy Markdown
Member

@jalving jalving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is PR very is very comprehensive. I only commented on minor things. For the sake of getting a working distributed version released, I think it is okay to release soon since base Plasmo code is hardly touched.

In the future, it would be great to find ways to cut down on duplicate implementations of the same Plasmo/JuMP interface, but this might be unavoidable given how JuMP extensions work.

@jalving
Copy link
Copy Markdown
Member

jalving commented Aug 18, 2025

I get this error from distributed when i run tests locally:

Unhandled Task ERROR: IOError: stream is closed or unusable
Stacktrace:
  [1] check_open
    @ ./stream.jl:388 [inlined]
  [2] uv_write_async(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1108
  [3] uv_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1069
  [4] unsafe_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1154
  [5] write
    @ ./strings/io.jl:248 [inlined]
  [6] print
    @ ./strings/io.jl:250 [inlined]
  [7] print(::Base.PipeEndpoint, ::String, ::String)
    @ Base ./strings/io.jl:46
  [8] println(::Base.PipeEndpoint, ::String, ::Vararg{Any})
    @ Base ./strings/io.jl:75
  [9] println(xs::String)
    @ Base ./coreio.jl:4
 [10] (::Distributed.var"#37#38"{Int64, Base.PipeEndpoint})()
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:289

But tests still pass.

I think using @suppress causes issues with distributed with how stdout gets forwarded. I think it should be simplest to just use JuMP.set_silent(graph). You might need to set a HiGHS optimizer attribute as well.

@dlcole3
Copy link
Copy Markdown
Collaborator Author

dlcole3 commented Aug 18, 2025

I get this error from distributed when i run tests locally:

Unhandled Task ERROR: IOError: stream is closed or unusable
Stacktrace:
  [1] check_open
    @ ./stream.jl:388 [inlined]
  [2] uv_write_async(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1108
  [3] uv_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1069
  [4] unsafe_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1154
  [5] write
    @ ./strings/io.jl:248 [inlined]
  [6] print
    @ ./strings/io.jl:250 [inlined]
  [7] print(::Base.PipeEndpoint, ::String, ::String)
    @ Base ./strings/io.jl:46
  [8] println(::Base.PipeEndpoint, ::String, ::Vararg{Any})
    @ Base ./strings/io.jl:75
  [9] println(xs::String)
    @ Base ./coreio.jl:4
 [10] (::Distributed.var"#37#38"{Int64, Base.PipeEndpoint})()
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:289

But tests still pass.

I think using @suppress causes issues with distributed with how stdout gets forwarded. I think it should be simplest to just use JuMP.set_silent(graph). You might need to set a HiGHS optimizer attribute as well.

I have had this error come up a couple times for me, but it did not always seem to be replicatable. I will take a closer look at the tests and see if it is from suppress or something else. At least one or two times where this error came up for me, I reran the code and it seemed to work the second time. I'll see if I can recreate the error and figure out what is throwing it.

@dlcole3
Copy link
Copy Markdown
Collaborator Author

dlcole3 commented Aug 20, 2025

@jalving Another thing I would be interested in your feedback on for the code review is how best to distinguish local link constraints on the remote graph. Some of this became a little more clear when I was writing the tests. There are RemoteEdgeRefs and RemoteOptiEdges for link constraints. The RemoteEdgeRefs are references to edges that exist on the OptiGraph on the remote worker, while RemoteOptiEdges exist on the main worker and link across 2 or more RemoteOptiGraph objects. I chose to make separate objects for these so that it is easy to distinguish between them. For instance, PlasmoBenders will only use the RemoteOptiEdges because they link the subproblems.
A challenge that comes from this is what functions should call. For instance, num_constraints(::RemoteOptiGraph) returns the number of constraints stored on remote workers and does not count the number of constraints stored on the RemoteOptiEdges (but maybe should). Happy to rewrite some of these functions based on your suggestions. I also have all_link_constraints which returns constraints on RemoteOptiEdges and all_remote_link_constraints which returns the constraints stored on RemoteEdgeRefs (that is, the latter returns the link constraints on the remote worker). This may not be the best way of doing it, but regardless of what we do, I think I should add a page to the documentation that explains this more clearly before we merge the PR.

I think differentiating between types of edges as you did is necessary for the distributed implementation. Plasmo already takes a centralized approach to handling subgraphs (parent graphs own linking information), so denoting RemoteOptiEdge for links across workers is consistent with the abstraction. I see the RemoteOptiGraph that you implemented as an effective way to refer to an optigraph on a worker and contain the extra information about the process connectivity. The only confusion for me is whether RemoteOptiEdge is the name we want to go with since technically the edge itself is not 'remote'; it connects remote objects.

For querying the remote graph, it is my opinion to have num_constraints(graph::RemoteOptiGraph) return all the constraints (including cross process constraints) to be consistent with the optigraph interface. We can still have other specific methods for differentiating between the two types of constraints.

@jalving I think I am going to go with InterWorkerEdges unless you have a better idea for the name.

@dlcole3
Copy link
Copy Markdown
Collaborator Author

dlcole3 commented Aug 20, 2025

I get this error from distributed when i run tests locally:

Unhandled Task ERROR: IOError: stream is closed or unusable
Stacktrace:
  [1] check_open
    @ ./stream.jl:388 [inlined]
  [2] uv_write_async(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1108
  [3] uv_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1069
  [4] unsafe_write(s::Base.PipeEndpoint, p::Ptr{UInt8}, n::UInt64)
    @ Base ./stream.jl:1154
  [5] write
    @ ./strings/io.jl:248 [inlined]
  [6] print
    @ ./strings/io.jl:250 [inlined]
  [7] print(::Base.PipeEndpoint, ::String, ::String)
    @ Base ./strings/io.jl:46
  [8] println(::Base.PipeEndpoint, ::String, ::Vararg{Any})
    @ Base ./strings/io.jl:75
  [9] println(xs::String)
    @ Base ./coreio.jl:4
 [10] (::Distributed.var"#37#38"{Int64, Base.PipeEndpoint})()
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:289

But tests still pass.

I think using @suppress causes issues with distributed with how stdout gets forwarded. I think it should be simplest to just use JuMP.set_silent(graph). You might need to set a HiGHS optimizer attribute as well.

I updated the script to skip @suppress on the remote graph optimize calls, and I extended JuMP.set_silent (this was missed before). I am not able to replicate this error, but the tests all pass on my machine (with the exception fo the KahyPar tests, but they pass on Github). Let me know if you have any issues with the tests now.

@jalving
Copy link
Copy Markdown
Member

jalving commented Aug 23, 2025

I updated the script to skip @Suppress on the remote graph optimize calls, and I extended JuMP.set_silent (this was missed before). I am not able to replicate this error, but the tests all pass on my machine (with the exception fo the KahyPar tests, but they pass on Github). Let me know if you have any issues with the tests now.

This fixed it 👍

@dlcole3
Copy link
Copy Markdown
Collaborator Author

dlcole3 commented Aug 24, 2025

@jalving I renamed RemoteOptiEdge to InterWorkerEdge. The OptiGraph object has a field called optiedges, and I followed a similar approach for the RemoteOptiGraph, where it has a field optiedges, but since we are calling these InterWorkerEdges, do you think we should still call this field optiedges? It is more consistent with the OptiGraph object, and they are effectively optiedges, but I was uncertain if this could cause some confusion or not.

@jalving
Copy link
Copy Markdown
Member

jalving commented Aug 24, 2025

@jalving I renamed RemoteOptiEdge to InterWorkerEdge. The OptiGraph object has a field called optiedges, and I followed a similar approach for the RemoteOptiGraph, where it has a field optiedges, but since we are calling these InterWorkerEdges, do you think we should still call this field optiedges? It is more consistent with the OptiGraph object, and they are effectively optiedges, but I was uncertain if this could cause some confusion or not.

Keeping the attribute name as optiedges is fine. They are already typed as InterWorkerEdges. Users are supposed to use the getter functions anyways 🙂

@jalving
Copy link
Copy Markdown
Member

jalving commented Aug 24, 2025

@dlcole3 feel free to merge when ready

@dlcole3 dlcole3 merged commit 1ddb773 into plasmo-dev:main Nov 6, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

num_constraints should ignore variable constraints by default.

2 participants