Skip to content

android: getSwapChains() should return a snapshot, not the live list#167

Closed
mushogenshin wants to merge 18 commits into
nmfisher:developfrom
mushogenshin:fix/android-swapchain-self-destroy
Closed

android: getSwapChains() should return a snapshot, not the live list#167
mushogenshin wants to merge 18 commits into
nmfisher:developfrom
mushogenshin:fix/android-swapchain-self-destroy

Conversation

@mushogenshin
Copy link
Copy Markdown
Contributor

Problem

On Android, every `ThermionViewer` mount aborts the process with:

```
E/Filament: Precondition
E/Filament: in endFrame:490
E/Filament: reason: SwapChain must remain valid until endFrame is called.
F/libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE)
```

Reproducible with `examples/flutter/quickstart` unmodified, both on the emulator (`sdk gphone64 arm64`, API 35) and on a physical Android device. iOS, macOS, and Windows are unaffected.

Cause

The crash is not a destroy-vs-`endFrame` race. `Filament` complains because the SwapChain pointer was freed before `endFrame` ran. How that happened is a list-aliasing bug in `FFIFilamentApp.getSwapChains()`:

```dart
// thermion_dart/lib/src/filament/src/implementation/ffi_filament_app.dart
final _swapChains = [];

Future<Iterable> getSwapChains() async {
return _swapChains; // ← live reference, not a snapshot
}
```

The Android branch of `thermion_flutter_plugin_native.dart` (`createTextureAndBindToView`) reads the swap chains, creates a new one for the surface texture, then destroys whatever was in the snapshot:

```dart
if (Platform.isAndroid) {
final swapChains = await FilamentApp.instance!.getSwapChains(); // aliases _swapChains
final swapChain = await FilamentApp.instance!.createSwapChain( // appends to _swapChains
Pointer.fromAddress(descriptor.windowHandle!));
if (swapChains.isNotEmpty) { // now contains [newSwapChain]
await FilamentApp.instance!.destroySwapChain(swapChains.first); // destroys the NEW one
}
await FilamentApp.instance!.renderManager.attach(view, swapChain);
// `swapChain` now points to freed memory.
}
```

The intent reads cleanly — "snapshot the existing chains, create the new one, tear down old chains" — but because `getSwapChains()` returned the live list reference, `createSwapChain` mutated the very list the plugin was iterating. `swapChains.first` was the swap chain the plugin had just created. The plugin destroyed it, attached the view to a freed pointer, and the next render hit the precondition.

iOS, macOS, and Windows don't reach this code path (their plugin branches don't iterate `getSwapChains()`).

Reproduction

```bash
git clone https://github.com/nmfisher/thermion
cd thermion/examples/flutter/quickstart
flutter pub get
flutter run -d
```

Crashes on first viewer mount, before any user interaction. Log signature:

```
[INFO] FFIFilamentApp: Created swapchain from window
[INFO] FFIFilamentApp: Destroying swapchain
[INFO] FFIFilamentApp: Destroyed swapchain
E/Filament: Precondition in endFrame:490 — SwapChain must remain valid until endFrame is called.
F/libc: Fatal signal 6 (SIGABRT)
```

Fix

One-line change in `getSwapChains` to return an unmodifiable snapshot:

```dart
Future<Iterable> getSwapChains() async {
return List.unmodifiable(_swapChains);
}
```

Defensive at the API layer — fixes the Android plugin's logic without modifying the plugin, and protects any future caller from the same aliasing pitfall.

Verification

  • `examples/flutter/quickstart` runs without crashing on `sdk gphone64 arm64` (API 35) emulator on macOS / Apple Silicon.
  • Verified by spike author downstream that the same fix unblocks a multi-viewer app on Android device.
  • iOS, macOS, Windows behaviour unchanged (their plugin paths don't iterate `getSwapChains`).

Related fork patches

This is the fourth Android-/iOS-/Windows-targeted fix on the mushogenshin/thermion fork. The other three are filed as #160 (iOS vulkan exclusion), #163 (eager-scale gesture for arena conflict), and #164 (Windows native build hook + plugin includes). All independent of this one.

The build hook in thermion_dart/hook/build.dart enumerates every
.cpp file under native/src/ and applies platform-specific exclusions
for `windows`, `d3d`, and `linux` paths. There is no analogous
exclusion for `vulkan/`, so on iOS the hook compiles
`native/src/vulkan/VulkanUtils.cpp` and
`native/src/vulkan/BaseVulkanTexture.cpp`. Both reference symbols in
the `bluevk::` namespace, but iOS is Metal-only and the platform
branch never adds `bluevk` to its `libs` list.

Result on iOS device builds (Xcode 15, Flutter master):

    Undefined symbols for architecture arm64:
      "bluevk::vkMapMemory", referenced from:
          thermion::vulkan::readVkImageToBitmap(...) in VulkanUtils-*.o
      "bluevk::vkFreeMemory", referenced from: ...
      ... (~50 more bluevk:: symbols)
    ld: symbol(s) not found for architecture arm64

The whole `flutter build ios` fails inside `dart_build` because the
shared lib never links.

Fix: skip vulkan-prefixed source paths when targetOS is iOS, mirroring
the structure of the existing windows / linux exclusions a few lines
above. macOS still gets the vulkan sources (it links `bluevk`),
Android still gets them (also links `bluevk`), Linux still gets them.
Only iOS changes.

Verified: `flutter build ios --debug --no-codesign` now produces a
working `Runner.app` on top of upstream develop.
The mobile gesture handler in `_MobileListenerWidget` used a regular
`GestureDetector` whose internal `ScaleGestureRecognizer` waits for
movement to cross `kPanSlop` before claiming the gesture arena. When
a Thermion view sits inside an ancestor `Scrollable` (e.g. ListView,
PageView, CustomScrollView), the ancestor's `VerticalDragGesture-
Recognizer` reaches its acceptance threshold first and wins the
arena — touches starting on the viewer get interpreted as page
scrolls, not viewport gestures. The result on iOS is "sometimes
tumbles, sometimes scrolls" depending on drag direction and speed.

Fix: switch to `RawGestureDetector` with a `_EagerScaleGesture-
Recognizer` subclass that calls `resolve(GestureDisposition.accepted)`
inside `addAllowedPointer`. The arena is claimed on PointerDown, so
ancestor scrollables never get the chance to win. Single-finger orbit
and pinch-zoom both resolve to the viewport regardless of how the
viewer is composed.

Tap and double-tap can no longer be detected via separate
`TapGestureRecognizer` / `DoubleTapGestureRecognizer` entries (the
eager scale wins arena before they ever accept), so they're
synthesized inside the scale callbacks: a "tap" is a scale gesture
whose total focal-point movement stayed below 8 px and whose duration
stayed below 250 ms; a "double-tap" is two such taps within 300 ms
of each other and within 8 px of each other. Both still feed
`InputHandler` via the existing `TouchEvent(tap, ...)` / `TouchEvent(
doubleTap, ...)` events, so callers see no API change.

Verified: `flutter build ios --debug --no-codesign` succeeds; running
on iPhone the multi-viewer stress test (8 viewers in a vertical
ListView) now lets the parent scroll only when drags start in the
gutter and routes every touch starting on a viewer to the viewport
deterministically.

Filing alongside the iOS vulkan-source exclusion (PR nmfisher#160).
The Windows branch of the build hook gathers cl.exe options
(includes, defines, response file with sources) but stops short of
adding the /link separator and /LIBPATH:$libDir. Without those, the
linker has no idea where the Filament .lib artifacts that pub get
downloaded actually live, so it fails with LNK1104 / LNK2019 on
every Filament symbol reference and cl.exe exits 2.

The library *inputs* themselves don't need to be added on the
command line because native/include/ThermionWin32.h declares all of
them via #pragma comment(lib, "filament.lib") and friends, and that
header is transitively included by the Windows vulkan/d3d sources
plus the generic c_api headers (TCamera.h, TRenderer.cpp,
TRenderManager.cpp). The pragma directives emit linker directives
that the linker honours automatically — but it still has to find
the .lib files via a search path, hence /LIBPATH.

Uncomment /link and /LIBPATH:$libDir; leave /DLL out (CBuilder.library
already passes /LD which makes /DLL redundant) and leave the
individual sources out (they're already in the response file).

Verification needed on Windows: I don't have a Windows dev box.
Filing this as a separate fork branch so it can be rebased / dropped
independently of the iOS fixes (nmfisher#160 vulkan-source exclusion, nmfisher#163
eager scale).
Thermion's build hook routes all log output through a Logger that
writes to .dart_tool/thermion_dart/log/build.log. native_toolchain_c
captures subprocess (cl.exe, clang, ld) stdout/stderr and routes
the stderr through `logger.severe(...)`. On a build failure
runProcess throws a ProcessException whose message is just the
command line and exit code — the actual compiler / linker output
stays inside that build.log file in the pub cache, where it is
invisible to CI logs and to anyone running `flutter build` who
isn't already poking around in `.dart_tool/`.

Result: compile / link failures look completely opaque downstream.
The exception output shows the cl.exe command, exit code 2, and
nothing else. Has been blocking diagnosis of the Windows native
build for several CI iterations.

Mirror SEVERE-level records to stderr inside the existing log
handler so the real error reaches whoever's watching. Successful
builds aren't noisier — compilers don't emit much stderr on
success — and the build.log file behavior is unchanged.
native_toolchain_c.runCl already injects /Fe:, /LD, its own /link
separator, /MACHINE, and /LIBPATH (from libraryDirectories) when
dynamicLibrary is set. By manually adding our own '/link
/LIBPATH:$libDir' to the user flags list, cl.exe's parser stopped
reading compile-time flags at our separator — pushing the toolchain's
auto-injected /LD and /Fe: into LINK's argument vector, where LINK
ignored them as LNK4044. With /LD never reaching cl.exe, no /DLL was
forwarded to the linker, which then tried to produce an EXE and bailed
out with LNK1561 ("entry point must be defined").

Drop the redundant /link /LIBPATH and let `libraryDirectories: [libDir]`
do its job. Also drop /VERBOSE from the compile flag list — it's a
linker option that cl.exe parses as deprecated /V<string> (warning
D9035); if the verbose link map is ever needed, it belongs after
native_toolchain_c's own /link separator.

Diagnosed via the stderr-tee added in the prior commit; the smoking
gun was the cl.exe command dump in build.log showing two /link
separators and unrecognized cl flags trailing the first one.
Windows filesystems are case-insensitive, so the basename collision
between native/src/scene/Gizmo.cpp and native/include/material/gizmo.c
caused both source files to compile to the same gizmo.obj path. The
material was added to the source list AFTER the scene class, so its
.obj write-clobbered the class's, leaving thermion::Gizmo's
constructor, pick, highlight, and unhighlight symbols out of the link.
TGizmo.cpp's c_api wrappers then failed with four LNK2019s and a
final LNK1120 ("4 unresolved externals"). LINK's earlier LNK4042
warning ("object specified more than once; extras ignored") was the
breadcrumb.

Rename the material .c file to gizmo_material.c (gizmo.h stays put,
since only the .c basename collided). materialSources entry in
build.dart updated to match. The "gizmo" key still drives the
GIZMO_ENABLED=1 define so existing call sites and tests are
unaffected.

Verified with `flutter test` on Windows after both this and the
preceding cl/link fix: 500/500 tests pass, thermion_dart.dll
produced.
The link hook called CLinker.library(... LinkerOptions.manual(...))
with no `sources`. native_toolchain_c.runCl translates that into:

    cl.exe /O2 /LD /Fe:thermion_dart.dll /link /OPT:REF /MACHINE:X64
           /LIBPATH:<empty link/ output dir>

i.e. cl.exe with no inputs at all, which exits immediately with
`cl : Command line error D8003 : missing source filename`. Under
`flutter test` the link hook is not invoked, so this only surfaces
during a real `flutter build windows --release` — where it shows up
as the terse "Linking native assets failed" through MSBuild, with
the actual D8003 stranded inside the per-package build.log in the
pub cache.

The link phase here is optional: the build hook already produced
thermion_dart.dll and added a CodeAsset for it. Pass those
build-hook assets through unchanged on Windows. Keep the CLinker
call on platforms where it currently works.
…call

`thermion_flutter_plugin`'s own translation units include Filament
headers (transitively, via `WindowsVulkanContext.h` → `Platform.h`),
but `target_include_directories(... INTERFACE ...)` only exposes the
DART_PKG_HEADERS paths to consumers, not to the plugin itself. cl.exe
then could not find `<utils/...>` headers when compiling the plugin,
producing C3083 / C2039 / C4430 on filament headers.

Switch the scope to PUBLIC so the plugin's own compile and any
consumer both see the headers.

Also drop the immediately-following

    include_directories(${PLUGIN_NAME} INTERFACE ...)

call. `include_directories()` accepts only path arguments — the
target name and `INTERFACE` keyword are silently treated as bogus
"directory" entries, so the call doesn't do what its author
intended; it just adds two non-existent paths to the directory-level
include set.
Filament's `backend/Platform.h` friend-declares
`utils::io::ostream& operator<<(...)` without first declaring the
nested namespace. Sibling Filament headers like
`backend/DriverEnums.h` carry the forward declaration

    namespace utils::io { class ostream; }

so the rest of Filament compiles cleanly. The plugin's translation
unit reaches `Platform.h` (via `WindowsVulkanContext.h`) before any
of those siblings, leaving the namespace undeclared and cl.exe
rejecting the friend with C3083 ("the symbol to the left of '::'
must be a type"), C2039 ("'ostream': is not a member of 'utils'"),
and C4430.

Mirror Filament's own forward declaration in the plugin header,
just before the WindowsVulkanContext include. Pulling the full
`<utils/ostream.h>` would also work but drags in more than the
friend needs; the namespace + class fwd-decl matches what
DriverEnums.h does.
…ying its own swapchain

Real cause of the SIGABRT-on-mount on Android. Not a destroy-vs-
endFrame race (the previous flush() patch was wrong). It's a
straight-up API bug in FFIFilamentApp.getSwapChains():

  Future<Iterable<SwapChain>> getSwapChains() async {
    return _swapChains;   // ← live reference, not a snapshot
  }

In `thermion_flutter_plugin_native.dart` the Android branch of
createTextureAndBindToView reads this list, creates a new swap
chain, then destroys whatever was in the snapshot:

  final swapChains = await FilamentApp.instance!.getSwapChains();
  final swapChain  = await FilamentApp.instance!.createSwapChain(...);
  if (swapChains.isNotEmpty) {
    await FilamentApp.instance!.destroySwapChain(swapChains.first);
  }
  await FilamentApp.instance!.renderManager.attach(view, swapChain);

The intent reads correctly: "snapshot existing chains; create the
new one; tear down old chains". But because getSwapChains returned
the live list reference rather than a copy, `swapChains` aliased
`_swapChains`. createSwapChain appended to _swapChains, so
swapChains.first immediately referred to the *new* swap chain we
just created. The plugin then destroyed its own new swap chain
and attached the view to a freed pointer. The next render hit
Filament's `SwapChain must remain valid until endFrame is called`
precondition and the process aborted with SIGABRT.

Reproduced on every viewer mount on Android — including
examples/flutter/quickstart, ARM64 emulator and physical device.
iOS / macOS / Windows are unaffected because their plugin paths
don't iterate getSwapChains().

Fix: return List.unmodifiable(_swapChains). Defensive at the API
layer; the call-site logic in plugin_native.dart is now correct
without modification.
The Android branch of `createTextureAndBindToView` (in
`thermion_flutter_plugin_native.dart`) destroys the *previous* swap
chain for the size-change case so the renderer doesn't keep an old,
wrong-sized surface around. The previous version chose what to
destroy by iterating FilamentApp's global swap chain list and
picking `swapChains.first`. After the upstream getSwapChains()
snapshot fix (nmfisher#167) the list is correctly snapshot, but in a
multi-viewer app `swapChains.first` is whichever viewer's swap
chain happens to be earliest in the list — usually a *different*
viewer's, not this one's. So mounting viewer #N destroyed viewer
#(N-1)'s swap chain. The renderer for #(N-1) then ran with a
freed pointer; symptom on screen was middle viewers freezing
while only the most recently mounted one (whose swap chain
nothing destroyed) kept rendering.

Fix: track the swap chain currently bound to each view in a
`<View, SwapChain>` map and destroy *that* one, scoped to this
view only. Other viewers' swap chains stay live.

Order also rearranged: attach the new swap chain *before*
destroying the old one, so the view never has a window of being
unattached. Other ordering inside the function is unchanged.

`_viewSwapChains` mirrors the pattern of the existing
`_viewRenderTargets` map a few lines above. No cleanup hook on
view-destroy, same as `_viewRenderTargets`'s existing convention;
that's a separate lifecycle concern.

Reproduces with the spike's stress-test screen at 4 / 8 viewers
on Android emulator. Single-viewer is unaffected (map starts
empty, no destroy on first mount). iOS / macOS / Windows are
unaffected — only the Android plugin branch ever ran the destroy
logic.
Fixes a bug introduced by this same patch: the synthesised tap was
dispatched with `event.focalPoint`, which Flutter documents as
*global* coordinates (relative to the screen / Flutter window).
`TouchEvent.localPosition` consumers — most importantly
`View.pick(x, y, …)` — expect coordinates relative to the
rendered viewport, not the entire screen. Whenever the
ThermionWidget wasn't anchored at (0, 0) on screen (i.e. every
realistic layout: anything inside a Scaffold body, a ListView, a
Padding, etc.), picks landed outside the viewport and the result
handler always returned the background entity.

Symptom on Android: region clicks did nothing in any layout that
nested the viewer below an AppBar / inside a list. iOS / macOS
hadn't been hit yet because the spike author's iOS test layout
happened to have the viewer near the screen origin.

Fix: capture `event.localFocalPoint` (widget-relative) into
`_scaleStartFocal`, and compare with `event.localFocalPoint` in
onUpdate so the tap-vs-drag movement threshold is also computed
in widget-local space. The `inputHandler.handle(ScaleStart/Update
Event(localFocalPoint: …))` calls keep passing
`event.focalPoint` (global) since that matches the upstream
convention used elsewhere in this file for those events — and
those events are consumed by orbit-camera code that uses deltas,
not absolute positions, so the global/local distinction doesn't
affect them.

Doesn't change behaviour for any caller that doesn't use the
synthesised tap (orbit, pinch-zoom, double-tap detection's
duration-based half is already correct).
After the previous patch (per-view swap chain tracking) the Android
plugin correctly created and reused a SwapChain bound to each
View. Disposal of the chain when the widget unmounted, however, was
not wired up — the SwapChain stayed in `_viewSwapChains` and stayed
attached to the RenderManager, while `ThermionWidget.dispose()`
released the underlying SurfaceTexture (and thus the native
window).

Result on the next frame: Filament's render thread iterates the
attached swap chains, hits the now-orphaned one, and calls
`eglSwapBuffers` on the freed native window. The emulator's EGL
backend logs:

  E/EGL_emulation: egl_window_surface_t::swapBuffers called with
  NULL buffer
  E/EGL_emulation: tid <n>: swapBuffers(764): error 0x300d
  (EGL_BAD_SURFACE)

repeatedly until the next frame after the SwapChain is collected
elsewhere. On the spike's stress-test screen at 4 / 8 viewers,
even brief widget rebuilds (e.g. toggling viewer count) flood the
log and cause visible frame drops because every render iteration
attempts the bad swap.

Fix: add a public `releaseTextureBindingForView(View view)` method
to `ThermionFlutterPlugin` that releases plugin-side resources
bound to a view (currently: the Android per-view SwapChain). The
native plugin removes the entry from `_viewSwapChains` and calls
`destroySwapChain` (which detaches from RenderManager and destroys
the Filament SwapChain). Web plugin is a no-op — it doesn't track
per-view bindings.

`ThermionWidget.dispose()` now calls this BEFORE destroying the
texture descriptor. The two are sequenced inside an unawaited
async closure so dispose() stays synchronous for the framework
but the swap chain destroy completes before the SurfaceTexture
release. Other platforms aren't affected — releaseTextureBinding
ForView is a no-op when no per-view binding exists.
`ThermionViewerFFI.dispose()` destroyed the underlying Filament
view without first removing it from `FFIRenderManager`'s Dart-side
`_attachments` map. The map keys each swap-chain entry to a list
of (renderOrder, View) tuples; if a view is destroyed while still
listed, any later `attach` / `detach` / `_syncViews` call from a
*different* viewer will iterate the now-stale tuple, retrieve the
freed view's native pointer, and pass it to
`RenderManager_setRenderableRenderThread`. Filament then does a
`handle_cast` and aborts with the generic
"Postcondition: corrupted heap Handle ... tag=(no tag)".

Reproduces reliably in multi-viewer Flutter apps when one viewer
is disposed concurrently with another viewer being mounted:

  - Old viewer.dispose() runs (queued onto whatever the host
    serialises on, e.g. a static `_createSerial` Future chain).
  - New viewer's createTextureAndBindToView calls
    renderManager.attach(newView, newSwapChain), which calls
    `_syncViews()` — this walks every entry in `_attachments`,
    including the entry containing the just-destroyed old view.
  - The freed pointer hits Filament's handle table → SIGABRT.

Symptom on Android (which is the most multi-viewer-heavy platform
in practice) is an immediate crash on toggling viewer count.
iOS / macOS happen to work because their host plugins don't
require the same reattach pattern, but the bug is real on every
platform — the symptom is just less reproducible.

Fix: call `FilamentApp.instance!.renderManager.detach(view)`
before `View_setScene(view, nullptr)` and `destroyView(view)`.
`detach(view)` (no swap chain argument) iterates `_attachments`
and removes the view from every entry, then runs `_syncViews()`
to push the cleaned state to C++ RenderManager. After that,
destroying the view is safe — no other code path will try to
dereference its handle.
…ing endFrame

Filament's `Engine::destroy(SwapChain*)` is documented as not
synchronising with the GPU; explicit guidance is to call
`flushAndWait()` first if a frame might still be in flight.

`destroySwapChain` previously called `detachAll(swapChain)` (which
takes RenderManager's mutex and removes the swap chain from C++
mViewAttachments synchronously, so no further RenderManager::render
iteration touches it) and then immediately
`Engine_destroySwapChainRenderThread`. That's enough on Thermion's
own render thread, but it's not enough for Filament's *backend*
command queue: Filament's BEGIN_FRAME / render / END_FRAME
commands are processed on its backend thread independently of the
caller. An endFrame queued during the last RenderManager::render
before detachAll could still be sitting in the backend queue when
the synchronous `engine->destroy(swapChain)` frees the SwapChain
object. The backend then runs endFrame on freed memory and aborts
at Renderer.cpp:490 — "SwapChain must remain valid until endFrame
is called."

Symptom: SIGABRT on multi-viewer dispose / mount toggles in
Flutter apps. Single-viewer is unaffected because the frame
scheduler doesn't have anything queued to drain at dispose time.
The earlier hypothesis (Issue nmfisher#167) was that the precondition
itself was the bug; that turned out to be a separate list-aliasing
issue. With nmfisher#167 / nmfisher#168 / nmfisher#170 / nmfisher#171 in place, this surfaced as
the real underlying race that all four were downstream of.

Fix: insert `await flush()` (which calls
Engine_flushAndWaitRenderThread) between detachAll and
Engine_destroySwapChainRenderThread. flushAndWait blocks until the
backend has processed every command submitted up to this point,
including any pending endFrame referencing the swap chain. Then
destroy is safe.

flush() is heavy but acceptable on a destroy path that already
implies the surface is going away.
…ched views

Filament's `Renderer` class is documented as a per-window primitive
(see Renderer.h: "endFrame() schedules the current frame to be
displayed on the Renderer's window."), but Thermion's RenderManager
shares a single Renderer across every SwapChain in the Engine,
calling `beginFrame(SC_i)` / `endFrame()` for each entry in
`mViewAttachments` inside one render() call. That works fine when
each entry has views — Filament's internal `mSwapChain` is set,
read, and reset cleanly per pair. It misbehaves when an entry has
ZERO views attached: the begin/end pair fires the same state
machine as a real frame but does nothing useful, opening a window
for SwapChain validity assertions to trip during the dispose /
mount transitions multi-viewer apps create.

Empty-views entries arise easily: `setRendering(false)` and
`renderManager.detach(view)` both remove the view from the entry's
view list but leave the SwapChain entry in `mViewAttachments` for
later reattachment. Each render iteration then ran a wasted
begin/end against that empty entry, while concurrent destroy /
attach traffic from the disposing or mounting viewer hit Filament's
internal state. On Android and Windows this reliably fired
`endFrame:490 — SwapChain must remain valid until endFrame is
called.` on multi-viewer count toggles. Single-viewer never
triggered it because there was never an empty entry to iterate.

Fix: scan `attachment.views` before calling beginFrame and
return false (skip) if all entries are null. The begin/end pair
only happens when there's actual rendering work, which is what
Filament's Renderer expects. Single-viewer behaviour is
unchanged. Multi-viewer dispose / mount no longer races against
empty begin/end cycles.
Two fixes for the residual `endFrame:490 — SwapChain must remain
valid` precondition that fires on mass dispose / mount in multi-
viewer apps (8 viewers in particular):

1) **`FFIRenderManager._syncViews` snapshots before iterating.**

   The old loop iterated `_attachments.keys` and `_attachments[key]`
   directly. Each iteration awaits a render-thread round-trip, and
   while awaiting, concurrent `attach` / `detach` calls (common
   when multiple viewers mount/dispose at once) mutate the live
   map. The iterator state across awaits ends up stale: some swap
   chains get skipped, others written with old data. The C++
   RenderManager then sees inconsistent attachments and the next
   render tickles the SwapChain validity assertion.

   Snapshot the keys + per-key view list at function entry, then
   iterate the snapshot. Tolerates concurrent modifications: if
   an entry is removed by the time we get back to it, the views
   list is null and we skip it cleanly.

2) **`FFIFilamentApp.destroySwapChain` serialises through a static
   Future chain.**

   With 8 simultaneous viewers disposing, eight `destroySwapChain`
   calls run concurrently. Each calls `flush()` (Engine::flushAndWait)
   between detach and engine->destroy — but if other destroys are
   queueing more commands on Filament's backend at the same
   instant, no individual flush guarantees a quiet backend. The
   shared Renderer's `mSwapChain` state ends up pointing at a
   SwapChain another destroy is in the middle of freeing.

   Serialise: each destroy awaits the previous one before starting.
   The flush in each destroy now has a stable target — nothing
   else is queueing destroy work — so it truly drains, and the
   synchronous engine->destroy can proceed safely.

Together with the empty-views skip in `RenderManager::renderSwap-
ChainAt` (52c33f7), these fixes close the timing windows for the
shared-Renderer-with-multiple-SwapChains pattern. Filament's
Renderer is documented as per-window, but Thermion's design
shares it; both patches make the shared use safe under contention.

Tested scenario: stress-test screen toggle 1 → 4 → 8 → 1 → 4. The
4-viewer state had been working (with frame drops); 8-viewer was
stable but crashed on extended use; 8 → 1 transitions crashed
reliably. All three are race surfaces with shared destroy/sync
flow; serialisation + snapshotting eliminate them.
… Future chain

Previous fix (d261b8f) snapshotted `_syncViews` iteration so the
loop wouldn't see stale state across awaits. That kept iteration
self-consistent, but didn't address concurrent calls *between*
sibling viewers: in multi-viewer apps where multiple viewers
mount or dispose simultaneously, sibling viewers' `attach`,
`detach`, and `detachAll` calls run in parallel. Each takes its
own snapshot at a different moment — and one viewer's snapshot
can capture another viewer's view that's about to be destroyed
via `destroyView`. By the time the snapshot's `_syncViews`
reaches `RenderManager_setRenderableRenderThread`, that view's
native handle has been freed; Filament dereferences the dangling
pointer and the process takes a SIGSEGV.

(The previous SIGABRT preconditions all triggered on detected
inconsistent state at the same boundary; SIGSEGV happens when
the inconsistent state slips past the precondition checks and
hits real memory.)

Fix: introduce a static `_opChain` Future and a `_serialize`
helper that wraps each mutation of `_attachments`. attach,
detach, and detachAll now each await the previous op before
running — concurrent calls queue up cleanly, each runs to
completion (including its `_syncViews` round-trips) before the
next starts. View destruction (`destroyView` awaits
`detach(view)` first) cannot interleave inside another viewer's
sync.

Combined with the destroySwapChain serialisation in
ffi_filament_app.dart (also d261b8f), every shared-state
mutation in the multi-viewer pipeline now runs serially.

Tested scenario was the same one that surfaced the SIGSEGV:
toggle stress-test from 8 viewers back to 1. The dispose flow
fires 8 simultaneous viewer.dispose chains plus 8 simultaneous
ThermionWidget.dispose async closures, each touching the shared
`_attachments` and View handle state. Without serialisation the
attach for the new single viewer would race against destroyView
for the eight outgoing viewers; with serialisation, the ordering
is enforced at the boundary that matters.
@mushogenshin
Copy link
Copy Markdown
Contributor Author

Closing in favour of more focused PRs.

This branch was originally about a single bug (`getSwapChains()` returning a live list reference) but accumulated nine commits as we kept finding adjacent races in the multi-viewer dispose / mount path on Android. That makes for an unwieldy review. I've split the contents into three independently-reviewable PRs against `develop`, each with a single coherent narrative:

Also: #168 — follow-up to the merged #163 fixing the synthesised tap to use `event.localFocalPoint` (Android picking) — the cherry-pick of that fix was on this branch as ec6b305; it's now its own PR against develop since it logically belongs with #163.

The branch (`fix/android-swapchain-self-destroy`) is preserved on the fork for spike continuity but is no longer the upstream PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant