Skip to content

multi-viewer: synchronise shared-state mutations + skip empty render cycles#171

Open
mushogenshin wants to merge 4 commits into
nmfisher:developfrom
mushogenshin:fix/multi-viewer-serialisation
Open

multi-viewer: synchronise shared-state mutations + skip empty render cycles#171
mushogenshin wants to merge 4 commits into
nmfisher:developfrom
mushogenshin:fix/multi-viewer-serialisation

Conversation

@mushogenshin
Copy link
Copy Markdown
Contributor

Summary

Four fixes that together make Thermion's multi-viewer pipeline (multiple `SwapChain`s sharing one `Engine` / `Renderer`) safe under contention. Each addresses a distinct race exposed when several viewers mount, dispose, or render concurrently. Single-viewer is unaffected.

Background

Filament documents `Renderer` as a per-window primitive — the example in `Engine.h` shows one `Engine` ↔ one `SwapChain` ↔ one `Renderer` ↔ one `View`, and `Renderer.h:343` says "It is recommended to use the same swapChain for every call to beginFrame, failing to do so can result is losing all or part of the FrameInfo history." Thermion shares a single `mRenderer` across every `SwapChain` in the engine and calls `beginFrame(SC_i)` / `endFrame()` for each entry in `mViewAttachments` inside one `render()` call. That pattern works in single-viewer apps but has multiple race windows the moment more than one viewer is involved.

This PR closes those windows. (Filament's per-window assumption is itself worth addressing long-term via per-SwapChain Renderers, but the changes below make the shared-Renderer-with-multiple-SwapChains pattern reliable in the meantime.)

Fix 1 — `flushAndWait` before destroying a SwapChain

`Engine::destroy(SwapChain*)` is documented as not synchronising with the GPU; explicit guidance is to call `flushAndWait()` first if a frame might still be in flight. `destroySwapChain` previously called `detachAll` (which removes the swap chain from `mViewAttachments` synchronously) and then immediately `Engine_destroySwapChainRenderThread`. Filament's backend command queue could still have a `BEGIN_FRAME` / `END_FRAME` pair referencing the swap chain when the synchronous `engine->destroy(swapChain)` frees the underlying object. The backend then runs endFrame on freed memory and aborts at `Renderer.cpp:490` — "SwapChain must remain valid until endFrame is called."

Insert `await flush()` between `detachAll` and `Engine_destroySwapChainRenderThread`.

Fix 2 — skip begin/end cycle when SwapChain has no attached views

Both `setRendering(false)` and `renderManager.detach(view)` remove the view from a swap chain entry's view list but leave the swap chain entry itself in `mViewAttachments` for later reattachment. When `renderSwapChainAt` then iterates that entry, it called `mRenderer->beginFrame(...)` / `endFrame()` with no actual rendering work in between — wasted in single-viewer apps, dangerous in multi-viewer ones because the per-pair internal state on Filament's `mSwapChain` gets set/cleared rapidly across the iteration, opening a race window for the validity assertion when concurrent destroy/attach traffic from a sibling viewer hits Filament's internal state.

Scan `attachment.views` before calling `beginFrame`. Return false if all are null. The begin/end pair only happens when there's actual rendering work — which is what Filament's Renderer expects.

Fix 3 — serialise `destroySwapChain` + snapshot `_syncViews` iteration

Two changes in one commit because they address overlapping symptoms in the same dispose path:

(a) `destroySwapChain` serialisation: in mass-dispose scenarios (e.g., a stress-test screen toggling 8 viewers down to 1), eight concurrent `destroySwapChain` calls each ran their own `detachAll → flush → destroy` sequence in unsynchronised parallel. Each call's `flush()` had no stable target — other destroys were concurrently queueing fresh commands on Filament's backend, so no individual flush guaranteed a quiet backend. The shared Renderer's `mSwapChain` could end up pointing at a SwapChain another destroy was in the middle of freeing. Add a `static Future _destroyChain` to `FFIFilamentApp` so each `destroySwapChain` awaits the previous one — flush() now has a quiet backend at the moment it runs.

(b) `_syncViews` iteration snapshot: the loop iterated `_attachments.keys` directly, awaiting a render-thread round-trip on each iteration. Concurrent `attach` / `detach` from sibling viewers mutated the live map across the await, leaving the iterator with stale state — `_syncViews` then pushed inconsistent view lists to C++ RenderManager. Snapshot keys + per-key views at function entry; iterate the snapshot.

Fix 4 — serialise `attach` / `detach` / `detachAll` through a static Future chain

Fix 3 kept iteration self-consistent across its own awaits, but didn't synchronise between sibling viewers' attach/detach calls. Each viewer's snapshot was taken at its own moment; one viewer's snapshot could capture another viewer's view that was being concurrently destroyed via `destroyView` (which itself awaits `detach(view)`). By the time the snapshot's `_syncViews` reached `RenderManager_setRenderableRenderThread`, the captured view's native handle had been freed — Filament dereferenced the dangling pointer and the process took a SIGSEGV (past the precondition checks).

Introduce a static `_opChain` Future and a `_serialize` helper in `FFIRenderManager`. `attach`, `detach`, and `detachAll` now each await the previous op before running. Concurrent calls queue cleanly; view destruction can no longer interleave inside another viewer's sync.

Result

Combined, every shared-state mutation in the multi-viewer pipeline runs serially. The spike app's stress-test screen on Android — which previously crashed on every viewer-count toggle — now mounts, renders, region-picks, and disposes cleanly across 1 / 4 / 8 viewer counts and arbitrary toggle sequences.

Verification

  • 1 → 4 → 8 → 4 → 1 → 4 toggle sequence on Android emulator: stable. No SIGABRT, no SIGSEGV, no `endFrame:490` precondition.
  • iOS, macOS, Windows: existing test paths pass — no new locks introduced where none were needed.

Related

…ing endFrame

Filament's `Engine::destroy(SwapChain*)` is documented as not
synchronising with the GPU; explicit guidance is to call
`flushAndWait()` first if a frame might still be in flight.

`destroySwapChain` previously called `detachAll(swapChain)` (which
takes RenderManager's mutex and removes the swap chain from C++
mViewAttachments synchronously, so no further RenderManager::render
iteration touches it) and then immediately
`Engine_destroySwapChainRenderThread`. That's enough on Thermion's
own render thread, but it's not enough for Filament's *backend*
command queue: Filament's BEGIN_FRAME / render / END_FRAME
commands are processed on its backend thread independently of the
caller. An endFrame queued during the last RenderManager::render
before detachAll could still be sitting in the backend queue when
the synchronous `engine->destroy(swapChain)` frees the SwapChain
object. The backend then runs endFrame on freed memory and aborts
at Renderer.cpp:490 — "SwapChain must remain valid until endFrame
is called."

Symptom: SIGABRT on multi-viewer dispose / mount toggles in
Flutter apps. Single-viewer is unaffected because the frame
scheduler doesn't have anything queued to drain at dispose time.
The earlier hypothesis (Issue nmfisher#167) was that the precondition
itself was the bug; that turned out to be a separate list-aliasing
issue. With nmfisher#167 / nmfisher#168 / nmfisher#170 / nmfisher#171 in place, this surfaced as
the real underlying race that all four were downstream of.

Fix: insert `await flush()` (which calls
Engine_flushAndWaitRenderThread) between detachAll and
Engine_destroySwapChainRenderThread. flushAndWait blocks until the
backend has processed every command submitted up to this point,
including any pending endFrame referencing the swap chain. Then
destroy is safe.

flush() is heavy but acceptable on a destroy path that already
implies the surface is going away.
…ched views

Filament's `Renderer` class is documented as a per-window primitive
(see Renderer.h: "endFrame() schedules the current frame to be
displayed on the Renderer's window."), but Thermion's RenderManager
shares a single Renderer across every SwapChain in the Engine,
calling `beginFrame(SC_i)` / `endFrame()` for each entry in
`mViewAttachments` inside one render() call. That works fine when
each entry has views — Filament's internal `mSwapChain` is set,
read, and reset cleanly per pair. It misbehaves when an entry has
ZERO views attached: the begin/end pair fires the same state
machine as a real frame but does nothing useful, opening a window
for SwapChain validity assertions to trip during the dispose /
mount transitions multi-viewer apps create.

Empty-views entries arise easily: `setRendering(false)` and
`renderManager.detach(view)` both remove the view from the entry's
view list but leave the SwapChain entry in `mViewAttachments` for
later reattachment. Each render iteration then ran a wasted
begin/end against that empty entry, while concurrent destroy /
attach traffic from the disposing or mounting viewer hit Filament's
internal state. On Android and Windows this reliably fired
`endFrame:490 — SwapChain must remain valid until endFrame is
called.` on multi-viewer count toggles. Single-viewer never
triggered it because there was never an empty entry to iterate.

Fix: scan `attachment.views` before calling beginFrame and
return false (skip) if all entries are null. The begin/end pair
only happens when there's actual rendering work, which is what
Filament's Renderer expects. Single-viewer behaviour is
unchanged. Multi-viewer dispose / mount no longer races against
empty begin/end cycles.
Two fixes for the residual `endFrame:490 — SwapChain must remain
valid` precondition that fires on mass dispose / mount in multi-
viewer apps (8 viewers in particular):

1) **`FFIRenderManager._syncViews` snapshots before iterating.**

   The old loop iterated `_attachments.keys` and `_attachments[key]`
   directly. Each iteration awaits a render-thread round-trip, and
   while awaiting, concurrent `attach` / `detach` calls (common
   when multiple viewers mount/dispose at once) mutate the live
   map. The iterator state across awaits ends up stale: some swap
   chains get skipped, others written with old data. The C++
   RenderManager then sees inconsistent attachments and the next
   render tickles the SwapChain validity assertion.

   Snapshot the keys + per-key view list at function entry, then
   iterate the snapshot. Tolerates concurrent modifications: if
   an entry is removed by the time we get back to it, the views
   list is null and we skip it cleanly.

2) **`FFIFilamentApp.destroySwapChain` serialises through a static
   Future chain.**

   With 8 simultaneous viewers disposing, eight `destroySwapChain`
   calls run concurrently. Each calls `flush()` (Engine::flushAndWait)
   between detach and engine->destroy — but if other destroys are
   queueing more commands on Filament's backend at the same
   instant, no individual flush guarantees a quiet backend. The
   shared Renderer's `mSwapChain` state ends up pointing at a
   SwapChain another destroy is in the middle of freeing.

   Serialise: each destroy awaits the previous one before starting.
   The flush in each destroy now has a stable target — nothing
   else is queueing destroy work — so it truly drains, and the
   synchronous engine->destroy can proceed safely.

Together with the empty-views skip in `RenderManager::renderSwap-
ChainAt` (52c33f7), these fixes close the timing windows for the
shared-Renderer-with-multiple-SwapChains pattern. Filament's
Renderer is documented as per-window, but Thermion's design
shares it; both patches make the shared use safe under contention.

Tested scenario: stress-test screen toggle 1 → 4 → 8 → 1 → 4. The
4-viewer state had been working (with frame drops); 8-viewer was
stable but crashed on extended use; 8 → 1 transitions crashed
reliably. All three are race surfaces with shared destroy/sync
flow; serialisation + snapshotting eliminate them.
… Future chain

Previous fix (d261b8f) snapshotted `_syncViews` iteration so the
loop wouldn't see stale state across awaits. That kept iteration
self-consistent, but didn't address concurrent calls *between*
sibling viewers: in multi-viewer apps where multiple viewers
mount or dispose simultaneously, sibling viewers' `attach`,
`detach`, and `detachAll` calls run in parallel. Each takes its
own snapshot at a different moment — and one viewer's snapshot
can capture another viewer's view that's about to be destroyed
via `destroyView`. By the time the snapshot's `_syncViews`
reaches `RenderManager_setRenderableRenderThread`, that view's
native handle has been freed; Filament dereferences the dangling
pointer and the process takes a SIGSEGV.

(The previous SIGABRT preconditions all triggered on detected
inconsistent state at the same boundary; SIGSEGV happens when
the inconsistent state slips past the precondition checks and
hits real memory.)

Fix: introduce a static `_opChain` Future and a `_serialize`
helper that wraps each mutation of `_attachments`. attach,
detach, and detachAll now each await the previous op before
running — concurrent calls queue up cleanly, each runs to
completion (including its `_syncViews` round-trips) before the
next starts. View destruction (`destroyView` awaits
`detach(view)` first) cannot interleave inside another viewer's
sync.

Combined with the destroySwapChain serialisation in
ffi_filament_app.dart (also d261b8f), every shared-state
mutation in the multi-viewer pipeline now runs serially.

Tested scenario was the same one that surfaced the SIGSEGV:
toggle stress-test from 8 viewers back to 1. The dispose flow
fires 8 simultaneous viewer.dispose chains plus 8 simultaneous
ThermionWidget.dispose async closures, each touching the shared
`_attachments` and View handle state. Without serialisation the
attach for the new single viewer would race against destroyView
for the eight outgoing viewers; with serialisation, the ordering
is enforced at the boundary that matters.
@nmfisher
Copy link
Copy Markdown
Owner

Thanks for this (and all the other PRs), these look like some very valuable contributions. Unfortunately this week is looking quite busy so I can't promise when I can get to them, but as soon as I get a chance I'll come back to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants