multi-viewer: synchronise shared-state mutations + skip empty render cycles#171
Open
mushogenshin wants to merge 4 commits into
Open
multi-viewer: synchronise shared-state mutations + skip empty render cycles#171mushogenshin wants to merge 4 commits into
mushogenshin wants to merge 4 commits into
Conversation
…ing endFrame Filament's `Engine::destroy(SwapChain*)` is documented as not synchronising with the GPU; explicit guidance is to call `flushAndWait()` first if a frame might still be in flight. `destroySwapChain` previously called `detachAll(swapChain)` (which takes RenderManager's mutex and removes the swap chain from C++ mViewAttachments synchronously, so no further RenderManager::render iteration touches it) and then immediately `Engine_destroySwapChainRenderThread`. That's enough on Thermion's own render thread, but it's not enough for Filament's *backend* command queue: Filament's BEGIN_FRAME / render / END_FRAME commands are processed on its backend thread independently of the caller. An endFrame queued during the last RenderManager::render before detachAll could still be sitting in the backend queue when the synchronous `engine->destroy(swapChain)` frees the SwapChain object. The backend then runs endFrame on freed memory and aborts at Renderer.cpp:490 — "SwapChain must remain valid until endFrame is called." Symptom: SIGABRT on multi-viewer dispose / mount toggles in Flutter apps. Single-viewer is unaffected because the frame scheduler doesn't have anything queued to drain at dispose time. The earlier hypothesis (Issue nmfisher#167) was that the precondition itself was the bug; that turned out to be a separate list-aliasing issue. With nmfisher#167 / nmfisher#168 / nmfisher#170 / nmfisher#171 in place, this surfaced as the real underlying race that all four were downstream of. Fix: insert `await flush()` (which calls Engine_flushAndWaitRenderThread) between detachAll and Engine_destroySwapChainRenderThread. flushAndWait blocks until the backend has processed every command submitted up to this point, including any pending endFrame referencing the swap chain. Then destroy is safe. flush() is heavy but acceptable on a destroy path that already implies the surface is going away.
…ched views Filament's `Renderer` class is documented as a per-window primitive (see Renderer.h: "endFrame() schedules the current frame to be displayed on the Renderer's window."), but Thermion's RenderManager shares a single Renderer across every SwapChain in the Engine, calling `beginFrame(SC_i)` / `endFrame()` for each entry in `mViewAttachments` inside one render() call. That works fine when each entry has views — Filament's internal `mSwapChain` is set, read, and reset cleanly per pair. It misbehaves when an entry has ZERO views attached: the begin/end pair fires the same state machine as a real frame but does nothing useful, opening a window for SwapChain validity assertions to trip during the dispose / mount transitions multi-viewer apps create. Empty-views entries arise easily: `setRendering(false)` and `renderManager.detach(view)` both remove the view from the entry's view list but leave the SwapChain entry in `mViewAttachments` for later reattachment. Each render iteration then ran a wasted begin/end against that empty entry, while concurrent destroy / attach traffic from the disposing or mounting viewer hit Filament's internal state. On Android and Windows this reliably fired `endFrame:490 — SwapChain must remain valid until endFrame is called.` on multi-viewer count toggles. Single-viewer never triggered it because there was never an empty entry to iterate. Fix: scan `attachment.views` before calling beginFrame and return false (skip) if all entries are null. The begin/end pair only happens when there's actual rendering work, which is what Filament's Renderer expects. Single-viewer behaviour is unchanged. Multi-viewer dispose / mount no longer races against empty begin/end cycles.
Two fixes for the residual `endFrame:490 — SwapChain must remain valid` precondition that fires on mass dispose / mount in multi- viewer apps (8 viewers in particular): 1) **`FFIRenderManager._syncViews` snapshots before iterating.** The old loop iterated `_attachments.keys` and `_attachments[key]` directly. Each iteration awaits a render-thread round-trip, and while awaiting, concurrent `attach` / `detach` calls (common when multiple viewers mount/dispose at once) mutate the live map. The iterator state across awaits ends up stale: some swap chains get skipped, others written with old data. The C++ RenderManager then sees inconsistent attachments and the next render tickles the SwapChain validity assertion. Snapshot the keys + per-key view list at function entry, then iterate the snapshot. Tolerates concurrent modifications: if an entry is removed by the time we get back to it, the views list is null and we skip it cleanly. 2) **`FFIFilamentApp.destroySwapChain` serialises through a static Future chain.** With 8 simultaneous viewers disposing, eight `destroySwapChain` calls run concurrently. Each calls `flush()` (Engine::flushAndWait) between detach and engine->destroy — but if other destroys are queueing more commands on Filament's backend at the same instant, no individual flush guarantees a quiet backend. The shared Renderer's `mSwapChain` state ends up pointing at a SwapChain another destroy is in the middle of freeing. Serialise: each destroy awaits the previous one before starting. The flush in each destroy now has a stable target — nothing else is queueing destroy work — so it truly drains, and the synchronous engine->destroy can proceed safely. Together with the empty-views skip in `RenderManager::renderSwap- ChainAt` (52c33f7), these fixes close the timing windows for the shared-Renderer-with-multiple-SwapChains pattern. Filament's Renderer is documented as per-window, but Thermion's design shares it; both patches make the shared use safe under contention. Tested scenario: stress-test screen toggle 1 → 4 → 8 → 1 → 4. The 4-viewer state had been working (with frame drops); 8-viewer was stable but crashed on extended use; 8 → 1 transitions crashed reliably. All three are race surfaces with shared destroy/sync flow; serialisation + snapshotting eliminate them.
… Future chain Previous fix (d261b8f) snapshotted `_syncViews` iteration so the loop wouldn't see stale state across awaits. That kept iteration self-consistent, but didn't address concurrent calls *between* sibling viewers: in multi-viewer apps where multiple viewers mount or dispose simultaneously, sibling viewers' `attach`, `detach`, and `detachAll` calls run in parallel. Each takes its own snapshot at a different moment — and one viewer's snapshot can capture another viewer's view that's about to be destroyed via `destroyView`. By the time the snapshot's `_syncViews` reaches `RenderManager_setRenderableRenderThread`, that view's native handle has been freed; Filament dereferences the dangling pointer and the process takes a SIGSEGV. (The previous SIGABRT preconditions all triggered on detected inconsistent state at the same boundary; SIGSEGV happens when the inconsistent state slips past the precondition checks and hits real memory.) Fix: introduce a static `_opChain` Future and a `_serialize` helper that wraps each mutation of `_attachments`. attach, detach, and detachAll now each await the previous op before running — concurrent calls queue up cleanly, each runs to completion (including its `_syncViews` round-trips) before the next starts. View destruction (`destroyView` awaits `detach(view)` first) cannot interleave inside another viewer's sync. Combined with the destroySwapChain serialisation in ffi_filament_app.dart (also d261b8f), every shared-state mutation in the multi-viewer pipeline now runs serially. Tested scenario was the same one that surfaced the SIGSEGV: toggle stress-test from 8 viewers back to 1. The dispose flow fires 8 simultaneous viewer.dispose chains plus 8 simultaneous ThermionWidget.dispose async closures, each touching the shared `_attachments` and View handle state. Without serialisation the attach for the new single viewer would race against destroyView for the eight outgoing viewers; with serialisation, the ordering is enforced at the boundary that matters.
Owner
|
Thanks for this (and all the other PRs), these look like some very valuable contributions. Unfortunately this week is looking quite busy so I can't promise when I can get to them, but as soon as I get a chance I'll come back to review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four fixes that together make Thermion's multi-viewer pipeline (multiple `SwapChain`s sharing one `Engine` / `Renderer`) safe under contention. Each addresses a distinct race exposed when several viewers mount, dispose, or render concurrently. Single-viewer is unaffected.
Background
Filament documents `Renderer` as a per-window primitive — the example in `Engine.h` shows one `Engine` ↔ one `SwapChain` ↔ one `Renderer` ↔ one `View`, and `Renderer.h:343` says "It is recommended to use the same swapChain for every call to beginFrame, failing to do so can result is losing all or part of the FrameInfo history." Thermion shares a single `mRenderer` across every `SwapChain` in the engine and calls `beginFrame(SC_i)` / `endFrame()` for each entry in `mViewAttachments` inside one `render()` call. That pattern works in single-viewer apps but has multiple race windows the moment more than one viewer is involved.
This PR closes those windows. (Filament's per-window assumption is itself worth addressing long-term via per-SwapChain Renderers, but the changes below make the shared-Renderer-with-multiple-SwapChains pattern reliable in the meantime.)
Fix 1 — `flushAndWait` before destroying a SwapChain
`Engine::destroy(SwapChain*)` is documented as not synchronising with the GPU; explicit guidance is to call `flushAndWait()` first if a frame might still be in flight. `destroySwapChain` previously called `detachAll` (which removes the swap chain from `mViewAttachments` synchronously) and then immediately `Engine_destroySwapChainRenderThread`. Filament's backend command queue could still have a `BEGIN_FRAME` / `END_FRAME` pair referencing the swap chain when the synchronous `engine->destroy(swapChain)` frees the underlying object. The backend then runs endFrame on freed memory and aborts at `Renderer.cpp:490` — "SwapChain must remain valid until endFrame is called."
Insert `await flush()` between `detachAll` and `Engine_destroySwapChainRenderThread`.
Fix 2 — skip begin/end cycle when SwapChain has no attached views
Both `setRendering(false)` and `renderManager.detach(view)` remove the view from a swap chain entry's view list but leave the swap chain entry itself in `mViewAttachments` for later reattachment. When `renderSwapChainAt` then iterates that entry, it called `mRenderer->beginFrame(...)` / `endFrame()` with no actual rendering work in between — wasted in single-viewer apps, dangerous in multi-viewer ones because the per-pair internal state on Filament's `mSwapChain` gets set/cleared rapidly across the iteration, opening a race window for the validity assertion when concurrent destroy/attach traffic from a sibling viewer hits Filament's internal state.
Scan `attachment.views` before calling `beginFrame`. Return false if all are null. The begin/end pair only happens when there's actual rendering work — which is what Filament's Renderer expects.
Fix 3 — serialise `destroySwapChain` + snapshot `_syncViews` iteration
Two changes in one commit because they address overlapping symptoms in the same dispose path:
(a) `destroySwapChain` serialisation: in mass-dispose scenarios (e.g., a stress-test screen toggling 8 viewers down to 1), eight concurrent `destroySwapChain` calls each ran their own `detachAll → flush → destroy` sequence in unsynchronised parallel. Each call's `flush()` had no stable target — other destroys were concurrently queueing fresh commands on Filament's backend, so no individual flush guaranteed a quiet backend. The shared Renderer's `mSwapChain` could end up pointing at a SwapChain another destroy was in the middle of freeing. Add a `static Future _destroyChain` to `FFIFilamentApp` so each `destroySwapChain` awaits the previous one — flush() now has a quiet backend at the moment it runs.
(b) `_syncViews` iteration snapshot: the loop iterated `_attachments.keys` directly, awaiting a render-thread round-trip on each iteration. Concurrent `attach` / `detach` from sibling viewers mutated the live map across the await, leaving the iterator with stale state — `_syncViews` then pushed inconsistent view lists to C++ RenderManager. Snapshot keys + per-key views at function entry; iterate the snapshot.
Fix 4 — serialise `attach` / `detach` / `detachAll` through a static Future chain
Fix 3 kept iteration self-consistent across its own awaits, but didn't synchronise between sibling viewers' attach/detach calls. Each viewer's snapshot was taken at its own moment; one viewer's snapshot could capture another viewer's view that was being concurrently destroyed via `destroyView` (which itself awaits `detach(view)`). By the time the snapshot's `_syncViews` reached `RenderManager_setRenderableRenderThread`, the captured view's native handle had been freed — Filament dereferenced the dangling pointer and the process took a SIGSEGV (past the precondition checks).
Introduce a static `_opChain` Future and a `_serialize` helper in `FFIRenderManager`. `attach`, `detach`, and `detachAll` now each await the previous op before running. Concurrent calls queue cleanly; view destruction can no longer interleave inside another viewer's sync.
Result
Combined, every shared-state mutation in the multi-viewer pipeline runs serially. The spike app's stress-test screen on Android — which previously crashed on every viewer-count toggle — now mounts, renders, region-picks, and disposes cleanly across 1 / 4 / 8 viewer counts and arbitrary toggle sequences.
Verification
Related