fix(it): stop flaky integration tests (#27649)#27651
fix(it): stop flaky integration tests (#27649)#27651mohityadav766 wants to merge 7 commits intomainfrom
Conversation
GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a JVM-wide singleton) on, which makes every concurrent test class start doing synchronous Fuseki writes on entity create, saturating the Dropwizard thread pool and causing 60s request timeouts. @execution (SAME_THREAD) alone only serialises within this class. WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant waitForWorkflowDeployment call — the create path already waits. Add descriptive aliases to the two await() polls so the next flake tells us which FQN or workflow name actually timed out instead of an anonymous lambda. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.
Changes:
- Added JUnit
@IsolatedtoGlossaryOntologyExportITto prevent concurrent test classes from inheriting the globalRdfUpdaterconfiguration. - Removed a redundant workflow deployment wait in
triggerWorkflow_SDK(deployment is already awaited during workflow creation). - Added descriptive aliases to Awaitility
await(...)calls to make future timeouts actionable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java |
Removes redundant wait and adds Awaitility aliases for clearer timeout failures. |
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java |
Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests. |
Live search indexing was silently skipped whenever a reindex job was in RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended() and returned early with nothing written, nothing enqueued — entities vanished from search until a future reindex happened to cover them. The retry worker doubled down: when the scope refresh observed an active job, it purged the retry queue; and processRecord() deleted records whose type was suspended. So even manually enqueued retries were wiped. This is how the #27649 flake surfaced: AppsResourceIT triggers SearchIndexingApplication runs and its best-effort 30s wait silently swallows timeouts. If a run was still RUNNING when AppsResourceIT finished, the next class in the sequential fork (WorkflowDefinitionResourceIT) inherited the suspension and its freshly-created tables were never indexed — waitForEntityIndexedInSearch then timed out at 120s. Same mechanism bites real users mid-reindex in production. Remove the suspension mechanism entirely: * SearchRepository — drop the 8 isEntityTypeSuspended() early-returns; the client-availability path already enqueues for retry on its own. * SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded() and the suspension branches in processRecord(); remove the retry-queue purge on suspendAll. * SearchIndexRetryQueue — delete the updateSuspension / clearSuspension / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming / getSuspendedEntityTypes API and the static AtomicBoolean / AtomicReference they backed. * Drop the two IT cases that asserted the removed behaviour. Live writes now always reach the search client; reindex and live writes both target the same indices as before. Version conflicts between the two paths (stale reindex batch overwriting a newer live write) remain possible as they did before suspension was introduced — that is the race suspension was meant to dodge, but dropping writes altogether was worse than the race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🟡 Playwright Results — all passed (20 flaky)✅ 3691 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 89 skipped
🟡 20 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
The distributed reindex has a TOCTOU: partitions read from a DB snapshot at T0 and write to a staged index, then at T1 (seconds later) the alias is atomically swapped from the old index to the staged one and the old index is deleted. Any entity that live-writers create between T0 and T1 goes via the alias → old index, and is destroyed when that old index is deleted post-swap. The CI log for #27649 shows this directly: 10:13:35 staged table_search_index_rebuild_…_215646 built from snapshot 10:13:40 POST /v1/tables table1_gold → written to alias target (old index _179670) 10:13:40 table2_silver, table3_bronze, table4_brass all written to old index _179670 10:13:42 Atomically swapped aliases from [_179670] to _215646 10:13:42 Successfully deleted index _179670 10:13:43+ waitForEntityIndexedInSearch polls, finds nothing, times out at 2 min Removing the silent-skip suspension mechanism in the previous commit exposed this race (it had been hidden by dropping the writes outright, which was strictly worse). Route live writes to the staged index during the reindex window: * SearchRepository gains an activeStagedIndices map (entityType → stagedIndex) plus register/unregister/resolveWriteIndex. Writes resolve to the staged index when one is registered for the type, otherwise to the canonical alias — the existing behaviour. * DefaultRecreateHandler.recreateIndexFromMapping registers the staged index as soon as it is created; finalizeReindex and promoteEntityIndex unregister it on every exit path (successful swap, swap failure, failed-reindex delete, exception). * Every live-write path in SearchRepository — createEntityIndex, createEntitiesIndex, indexTableColumns, indexColumnsForTables, updateEntityIndex, createTimeSeriesEntity, updateTimeSeriesEntity, deleteEntityIndex, deleteEntityByFQNPrefix, deleteTimeSeriesEntityById — goes through resolveWriteIndex instead of reading the canonical alias directly. During a reindex, live writes land in the index that the alias will promote to; after the swap the alias points to that same index and subsequent writes continue to reach the same place. Old-index deletion no longer discards fresh data. Note: searches through the alias during the brief reindex window (< seconds in the CI log) can miss a write until the swap lands — an acceptable trade compared to silently dropping the write or losing it on deletion. The #27649 test tolerates this because its 120s poll spans many swap cycles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
Code Review ✅ ApprovedEliminates flaky integration tests by refining test synchronization, ensuring reliable execution across the suite. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Summary
Fixes #27649 — two backend integration tests that time out under concurrent load.
GlossaryOntologyExportIT— add@IsolatedRdfUpdateris a JVM-wide singleton. This class's@BeforeAllflips it on (and@AfterAllflips it off). While it's on, every test class running concurrently starts doing synchronous Fuseki writes on every entity create — which saturates the Dropwizard thread pool and turns the 60s HTTP timeout into the observedrequest timed outatGlossaryOntologyExportIT.java:142.PR #27172 already moved this class to
@Execution(SAME_THREAD)and bumped the timeout to 60s, butSAME_THREADonly serialises within the class — the parallel failsafe execution still runs the rest of the*ITclasses alongside it.RdfResourceITalready uses@Isolatedfor the same reason; this brings us in line.WorkflowDefinitionResourceIT#triggerWorkflow_SDK— drop redundant wait + diagnostic aliasesThe test at line 2086 was timing out with the generic "Condition org.openmetadata.it.tests.WorkflowDefinitionResourceIT$$Lambda/... was not fulfilled within 2 minutes" — no way to tell which of the two
atMost(120s)polls died.waitForWorkflowDeployment(client, workflowName)fromtriggerWorkflow_SDK—createDataCompletenessWorkflow_SDKalready calls it during creation.await(...)so the next failure, if any, names the workflow or the FQN that didn't appear in the search index.Test plan
mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml— passed (3:37)mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml+testExportGlossaryAsTurtlewith@Isolatedapplied — 2/2 passedmvn verify -Dit.test=WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK— passedmvn test-compileclean🤖 Generated with Claude Code
Summary by Gitar
activeStagedIndicesmap inSearchRepositoryto track indexes during reindexing operations.resolveWriteIndexto dynamically route live entity writes to staged indexes, preventing data loss during alias swaps.registerStagedIndexandunregisterStagedIndexto manage staged index lifecycle inDefaultRecreateHandler.This will update automatically on new commits.