feat: improve handling of UI↔Background initialization timeout errors#40306
feat: improve handling of UI↔Background initialization timeout errors#40306gauthierpetetin wants to merge 65 commits intomainfrom
Conversation
- Add restore accounts option on critical error screen when vault backup exists - Background: reload all UI ports on timeout restore (align with vault recovery) - Consolidate critical error analytics to two events: Critical Error Screen Viewed (with restore_accounts_enabled) and Critical Error Restore Wallet Button Pressed - Add track-critical-error.ts for early Segment tracking - Critical startup handler: three-phase check (liveness, init, state sync) with BACKGROUND_INITIALIZED and distinct timeout messages - E2E: onboardThenTriggerTimeOutFlow, CriticalErrorPage restore link and waitForPageAfterExtensionReload; use waitUntil instead of driver.delay after restore confirm; home page send button supports EVM and non-EVM - Unit tests: display-critical-error (restore link), critical-startup-error-handler - i18n and manifest flags for init/state-sync hang simulation Co-authored-by: Cursor <cursoragent@cursor.com>
|
CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes. |
✨ Files requiring CODEOWNER review ✨🧪 @MetaMask/qa (2 files, +94 -15)
|
Builds ready [56aeb76]
⚡ Performance Benchmarks (1410 ± 113 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Remove backgroundInitializationTimeout and backgroundStateSyncTimeout from en and en_GB; they were never referenced in code (error messages are hardcoded in critical-startup-error-handler). Co-authored-by: Cursor <cursoragent@cursor.com>
…tical-error flow Co-authored-by: Cursor <cursoragent@cursor.com>
Builds ready [1a0a013]
⚡ Performance Benchmarks (1353 ± 105 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
…atch - Add onupgradeneeded handler when opening metamask-backup so the store object store is created if the DB is created here first (avoids corrupting backup for PersistenceManager). - Close DB connection in catch path to avoid leaking the connection when transaction or store access fails. Co-authored-by: Cursor <cursoragent@cursor.com>
Use fixed 3s delay then single switchToWindowWithTitle (like VaultRecoveryPage.clickRecoveryButton) instead of polling. Polling was flaky in CI and could contribute to ECONNREFUSED. Co-authored-by: Cursor <cursoragent@cursor.com>
Builds ready [d491397]
⚡ Performance Benchmarks (1370 ± 113 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Co-authored-by: Cursor <cursoragent@cursor.com>
…rs restore tests Co-authored-by: Cursor <cursoragent@cursor.com>
Builds ready [5b10d08]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
…d derived type Made-with: Cursor
Made-with: Cursor
Builds ready [8cc0f2e]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Builds ready [4f0d8e9]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
app/scripts/background.js
Outdated
| }); | ||
| } | ||
| if (hasVault(backup)) { | ||
| await initBackground(backup); |
There was a problem hiding this comment.
Hm, repairAndReinitialize now calls initBackground(...) again without destroying the existing controller / global listeners first.
That didn't happen in the old corruption flow because recovery only ever happened before background init completed, but this PR now reaches this path from the state-sync timeout after BACKGROUND_INITIALIZED has already been sent. At that point we already have a live controller and process-wide side effects (store subscriptions, webRequest listener, DeepLinkRouter, tab listeners, intervals, etc.), so reinitializing here duplicates a lot of background state machines and listeners after recovery.
We might need a larger refactor of the initialization process in order to make this work, but even then, we might not be able to properly tear down everything we need torn down, since anything could have set up misc globally listeners and streams.
I had an LLM write a test to test for this, here: https://github.com/MetaMask/metamask-extension/pull/41023/files#diff-6d83f689934173d72a111dcf3beb2c1b045a65cdd2f9ebea6a15b85e44850b24R190
It tests that the DeeplinkRouter is doubly subscribed after the restore call, and thus emits too many events.
There was a problem hiding this comment.
You're right, well done for the catch. It's not an easy one to fix :/
I looked at what we'd need to implement/refactor to properly teardown everything that needs to be and the amount of changes looks important. Such a global teardown function would also be hard to test and maintain.
I'm a bit disappointed by this conclusion, but I tend to think we should remove the "Restore accounts" option on the critical error screen for cases where the background is already initialized, i.e. after UI receives BACKGROUND_INITIALIZED message. The only option left in that case would be "Restart MetaMask" or "Contact support".
Does that sound reasonable to you, or do you have any other option in mind?
There was a problem hiding this comment.
The solution you recommended on Slack is now implemented in this secondary PR and merged.
Builds ready [a6135e0]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
…itical error restore flow (#41093) ## **Description** This secondary PR will be merged into this [primary PR](#40306) soon. It's just kept as separate for now to facilitate reviews. [](https://codespaces.new/MetaMask/metamask-extension/pull/41093?quickstart=1) ## **Changelog** CHANGELOG entry: null ## **Related issues** See [primary PR](#40306) ## **Manual testing steps** See [primary PR](#40306) ## **Screenshots/Recordings** <img width="1721" height="745" alt="Screenshot 2026-03-20 at 17 22 01" src="https://github.com/user-attachments/assets/880481a2-744c-40a3-aae5-cec2ceb3a78f" /> ## **Pre-merge author checklist** - [ ] I've followed [MetaMask Contributor Docs](https://github.com/MetaMask/contributor-docs) and [MetaMask Extension Coding Standards](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/CODING_GUIDELINES.md). - [ ] I've completed the PR template to the best of my ability - [ ] I’ve included tests if applicable - [ ] I’ve documented my code using [JSDoc](https://jsdoc.app/) format if applicable - [ ] I’ve applied the right labels on the PR (see [labeling guidelines](https://github.com/MetaMask/metamask-extension/blob/main/.github/guidelines/LABELING_GUIDELINES.md)). Not required for external contributors. ## **Pre-merge reviewer checklist** - [ ] I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed). - [ ] I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Medium risk because it changes the critical startup error restore path and background initialization sequencing, relying on persisted `storage.local` state and tab handoff behavior across `runtime.reload()`. > > **Overview** > Refactors the critical error restore flow to **open a `https://metamask.io/restoring#...` tab, persist restore metadata in `storage.local`, and trigger a `runtime.reload()`-based restart** instead of reinitializing the background in-place. > > On restart, `background.js` now checks for a pending restore, initializes from the IndexedDB backup, sets `FirstTimeFlowType.restore`, and then **hands off the restoring tab to the extension UI** (only if the tab URL still matches, including locale redirects). Tests and e2e flows were updated to match the new behavior, including more robust window/tab handling after reload and optional skipping of multichain sync waits in CI. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c3d55e6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…and rename to tab-handoff Co-locate tab handoff logic with the other critical-error modules (recovery, tracking) for better discoverability. Made-with: Cursor
| }, | ||
| this.addMultichainWalletButton, | ||
| ], | ||
| [this.addMultichainAccountButton, this.addMultichainWalletButton], |
There was a problem hiding this comment.
After a critical-error restore, the multichain account button displays "Syncing..." while background sync is in progress, so matching by both data-testid and text "Add account" causes a timeout in CI where sync takes longer. Removing the text requirement lets checkPageIsLoaded succeed as soon as the button element exists, and the new waitForSync option controls whether to additionally wait for sync to complete.
…airCallback Now that the critical error handler uses runtime.reload() instead of in-process reinit, repairAndReinitialize has only one caller. Inline it to match the original main structure. Made-with: Cursor
…r in repair.ts Both runRepairAndReloadExtension and runRepairAndReloadPorts were thin wrappers. Inline them at their single call sites to reduce indirection and minimize the diff against main. Made-with: Cursor
Builds ready [f00fc58]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
…during onboarding The merge from main changed onboarding MetaMetrics flow to use ensureParticipateInMetaMetricsIsUnchecked(), which takes ~10s. This gave the background enough time to write the backup to IndexedDB, causing simulateBackgroundStateSyncHang to block state sync when handleSidepanelPostOnboarding navigated to home.html — even though the hang should only activate after a runtime.reload(). Fix: check backup existence once at startup in startExtensionInitialization and share the result (backupExistedAtStartup) with both simulateBackground*Hang checks, so they only trigger after a restart. Made-with: Cursor
…ame flag - Replace backup?.KeyringController with hasVault(backup) in background.js and persistence-manager.ts for consistency - Compute hasVault once (backupHasVault) and share between the simulateBackground*Hang snapshot and pending-restore check - Rename backupExistedAtStartup → inTestHasVault for consistency with inTestRestoreFlow Made-with: Cursor
…ession Rename types, functions, variables, and comments for consistency: - PendingCriticalErrorRestore → CriticalErrorRestoreSession - readPendingCriticalErrorRestore → readCriticalErrorRestoreSession - clearPendingCriticalErrorRestore → clearCriticalErrorRestoreSession - pendingRestore → restoreInProgress Made-with: Cursor
Made-with: Cursor
|
Builds ready [40192a0]
⚡ Performance Benchmarks
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
davidmurdoch
left a comment
There was a problem hiding this comment.
mostly just a review of the background.js changes. i'll review more files later!
| testingFlags?.simulateBackgroundStateSyncHang || | ||
| testingFlags?.simulateBackgroundInitializationHang | ||
| ) { | ||
| await persistenceManager.open(); |
There was a problem hiding this comment.
ah, just noticed there is a potential race condition bug in persistence manager, as multiple callsites could try to open the db at the same time.
might need to fix persistenceManager's open so multiple opens can't be "active" at the same time. Maybe open here should be guarded with a if (this.#openPromise) return await this.#openPromise;, and then getBackup itself could also call this.#open() on its own so we don't have to call it here too.
| } | ||
|
|
||
| try { | ||
| controller.onboardingController.setFirstTimeFlowType( |
| initBackground(backup); | ||
| try { | ||
| await isInitialized; | ||
| } catch (error) { | ||
| log.error('critical-error-restore: initialization failed', error); | ||
| return; | ||
| } |
There was a problem hiding this comment.
why not await initBackground(backup) here instead of calling then awaiting isInitialized? I'm guessing there would be a bug here if you did, related to initBackground never rejecting since it is already wrapped in a try/catch itself?
Perhaps the pre-existing vault corruption repairCallback function already has this bug, as this is almost the same code. Hmm, this is worth discussing I think.
| connectWindowPostMessage(port); | ||
| connectWindowPostMessage( | ||
| port, | ||
| isMetaMaskUIPort ? removeCriticalErrorListeners : undefined, |
There was a problem hiding this comment.
do we need the ternary here? removeCriticalErrorListeners is only defined when isMetaMaskUIPort is true
| } | ||
| } | ||
| if (!process.env.SKIP_BACKGROUND_INITIALIZATION) { | ||
| async function startExtensionInitialization() { |
There was a problem hiding this comment.
startExtensionInitialization calls initBackground which calls initialize.
Wonder if we can either improve the naming here, or just the code organization in a way that we don't need 3 cascading initialization functions. Worth talking about i think.
| return; | ||
| } | ||
|
|
||
| const restoreInProgress = await readCriticalErrorRestoreSession(browser); |
There was a problem hiding this comment.
this introduces a new point of failure, as readCriticalErrorRestoreSession reads from storage.local, which is probably the root of almost all of our problems this PR is working to improve. haha
Perhaps readCriticalErrorRestoreSession should try/catch itself and always swallow all errors (but log them, or course)?
| const backupHasVault = hasVault(backup); | ||
|
|
||
| if ( | ||
| testingFlags?.simulateBackgroundStateSyncHang || | ||
| testingFlags?.simulateBackgroundInitializationHang | ||
| ) { | ||
| inTestHasVault = backupHasVault; | ||
| } |
There was a problem hiding this comment.
The inTest related logic is hard to follow. This was already a confusing pattern before your PR (and I'm to blame, as I added the first instance of inTest ? getManifestFlags().testing : undefined right in this same file!😅), but now that you are storing global state in runtime code specifically for tests, it gets even harder. haha
What if, and this is just an idea off the top of my head -- not a request for changes, we did something like a test helper thing.
import { inTestFlags, inTestState } = "./some-path/in-test-helper.ts";
// then we could do things like:
inTestFlags?.has(["simulateBackgroundStateSyncHang", "simulateBackgroundInitializationHang"]);
// and
inTestState?.get("value");
inTestState?.set("value", something);
inTestState?.remove();
// and maybe in the future if we find a need:
inTestDb?.set/get
// in-test-helper.ts pseudo code
const inTest = process.env.IN_TEST;
export const inTestFlags= inTest ? getTheFlagsMap() : null;
export const inTestState = inTest ? new Map() : null;
export const inTestDb= inTest ? getTheDbImpl() : null;
A potential big downside is that this might not be compiled out completely like it can when everything is defined in a single file. This can be solved in webpack config rules, but it is really gross to do it (and we don't do anything like this yet).
What do you think? Is it worth discussing further?
| } | ||
|
|
||
| if (restoreInProgress) { | ||
| await clearCriticalErrorRestoreSession(browser); |
There was a problem hiding this comment.
this is also sort of another new point of failure. hmmm. it might not be possible to avoid adding one here though, since we really DO need to clear out the old restore session, or we'll send the user into a loop. If we reject when trying to clear the restore session, we'll still them into a loop... but at least we'll have logs telling us where things when wrong.
I'm just mentioning this in case you hadn't yet thought of it, and could think of a better idea than I can.



Description
This change improves handling of UI↔Background initialization timeout errors by distinguishing:
All these errors bring the user critical error screen where UI can now show a "Restore accounts" option when a vault backup exists, that user can click to trigger the existing vault recovery flow.
UI↔Background initialization flow
Sequence diagram (old flow)
The UI used a 2-phase startup check: first liveness (ALIVE), then a single wait for state sync (START_UI_SYNC). Because there was no separate signal for “background initialization done,” any timeout after liveness was reported as a generic “UI initialization timeout”, so we could not tell whether the background was stuck initializing or stuck sending state.
sequenceDiagram participant UI as UI (CriticalStartupErrorHandler) participant Port as Runtime Port participant BG as Background participant MC as MetaMaskController Note over UI,MC: Phase 1 – Liveness (timeout 15s) UI->>Port: connect Port->>BG: onConnect BG->>Port: postMessage(ALIVE) Port->>UI: ALIVE UI->>UI: phase 1 OK → start single “sync” phase Note over UI,MC: Single phase – wait for state sync (timeout 16s) Note over BG,MC: Background does init + state sync with no separate signal to UI BG->>BG: await isInitialized (controllers) BG->>MC: connectWindowPostMessage(port) MC->>Port: START_UI_SYNC + initialState Port->>UI: START_UI_SYNC UI->>UI: OK → normal app load Note over UI,MC: On timeout Note over UI: Phase 1 timeout → "Background connection unresponsive" Note over UI: Phase 2 timeout → "UI initialization timeout" (same for init or state-sync hang) UI->>UI: displayCriticalErrorMessage(...) UI->>UI: Show "MetaMask had trouble starting" (no Restore link)Timeouts (old behavior)
ALIVESTART_UI_SYNCSequence diagram (new flow)
The UI now uses a 3-phase startup check. Each phase waits for a message from the background.
This allows to know what step caused the timeout issue when one occurs.
sequenceDiagram participant UI as UI (CriticalStartupErrorHandler) participant Port as Runtime Port participant BG as Background participant MC as MetaMaskController Note over UI,MC: Phase 1 – Liveness (timeout 15s) UI->>Port: connect Port->>BG: onConnect BG->>Port: postMessage(ALIVE) Port->>UI: ALIVE UI->>UI: phase 1 OK → start Phase 2 Note over UI,MC: Phase 2 – Initialization (timeout 16s) BG->>BG: await isInitialized (controllers) BG->>Port: postMessage(BACKGROUND_INITIALIZED) Port->>UI: BACKGROUND_INITIALIZED UI->>UI: phase 2 OK → start Phase 3 Note over UI,MC: Phase 3 – State sync (timeout 16s) BG->>MC: connectWindowPostMessage(port) MC->>Port: START_UI_SYNC + initialState Port->>UI: START_UI_SYNC UI->>UI: phase 3 OK → normal app load Note over UI,MC: On timeout in any phase UI->>UI: displayCriticalErrorMessage(...) UI->>UI: Show "MetaMask had trouble starting" + Restore link if backup existsTimeouts (new behavior)
ALIVEBACKGROUND_INITIALIZEDSTART_UI_SYNCNote for reviewers: the heart of the logic is located in these 4 files:
app/scripts/background.jsshared/lib/error-utils.jshasBackupis true.ui/helpers/utils/critical-startup-error-handler.tsui/helpers/utils/display-critical-error.tscheckVaultBackupExists()(IndexedDB) to show "Restore accounts" and notifies background when the user clicks restore.Most other files are there for adding tests or metrics.
Changelog
CHANGELOG entry: Adds "Restore accounts" on critical error screen when a backup exists.
Related issues
Fixes: #15629
Manual testing steps
yarn build:test(or use manifest flag to simulate init/state-sync hang).simulateBackgroundInitializationHang).simulateBackgroundStateSyncHang) and confirm the same restore flow.PS: To set remote feature flags, you can add the following in .manifest-overrides.json
and the following line in .metamaskrc:
Screenshots/Recordings
Before
We didn't know what was causing the timeout, and there was no way to trigger the vault recovery flow (i.e. "Restore accounts"):
After
We now differentiate whether timeout occurred during background initialization or while background sends the state to UI (state sync).
Also, we've added "Restore accounts" amongst options user can trigger.
Timeout during background initialization:

Timeout while background sends the state to UI:

Pre-merge author checklist
Pre-merge reviewer checklist
Note
Medium Risk
Touches UI↔background initialization, error handling, and recovery/repair flows; mistakes could block startup or trigger unintended reload/restore behavior, though changes are guarded and well-tested.
Overview
Improves critical startup timeout handling by splitting UI startup checks into three phases (background liveness, background initialized, and state sync) with distinct timeout errors, powered by a new
BACKGROUND_INITIALIZED_METHODsignal from the background.Adds a restore-from-backup path on the critical error screen: the UI now detects whether an IndexedDB vault backup exists, shows a "Restore accounts" link when it does, and can trigger a new timeout-specific repair message (
METHOD_REPAIR_DATABASE_TIMEOUT) back to the background.Refactors repair/backup utilities into shared helpers (
shared/lib/backup,app/scripts/lib/repair) and wires a newCriticalErrorHandlerin the background to coordinate repair + reload across all UI ports, including early Segment metrics (CriticalErrorScreenViewed,CriticalErrorRestoreWalletButtonPressed). Includes new unit tests and E2E coverage for init/state-sync timeout restore flows, plus i18n support for a reorderable critical error footer.Written by Cursor Bugbot for commit ca022d3. This will update automatically on new commits. Configure here.