fix(it): stop flaky integration tests (#27649) by mohityadav766 · Pull Request #27651 · open-metadata/OpenMetadata

mohityadav766 · 2026-04-23T06:29:37Z

Summary

Fixes #27649 — two backend integration tests that time out under concurrent load.

`GlossaryOntologyExportIT` — add `@Isolated`

RdfUpdater is a JVM-wide singleton. This class's @BeforeAll flips it on (and @AfterAll flips it off). While it's on, every test class running concurrently starts doing synchronous Fuseki writes on every entity create — which saturates the Dropwizard thread pool and turns the 60s HTTP timeout into the observed request timed out at GlossaryOntologyExportIT.java:142.

PR #27172 already moved this class to @Execution(SAME_THREAD) and bumped the timeout to 60s, but SAME_THREAD only serialises within the class — the parallel failsafe execution still runs the rest of the *IT classes alongside it. RdfResourceIT already uses @Isolated for the same reason; this brings us in line.

`WorkflowDefinitionResourceIT#triggerWorkflow_SDK` — drop redundant wait + diagnostic aliases

The test at line 2086 was timing out with the generic "Condition org.openmetadata.it.tests.WorkflowDefinitionResourceIT$$Lambda/... was not fulfilled within 2 minutes" — no way to tell which of the two atMost(120s) polls died.

Removed waitForWorkflowDeployment(client, workflowName) from triggerWorkflow_SDK — createDataCompletenessWorkflow_SDK already calls it during creation.
Added descriptive aliases to await(...) so the next failure, if any, names the workflow or the FQN that didn't appear in the search index.

Test plan

mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml — passed (3:37)
mvn verify -Dit.test=GlossaryOntologyExportIT#testExportGlossaryAsRdfXml+testExportGlossaryAsTurtle with @Isolated applied — 2/2 passed
mvn verify -Dit.test=WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK — passed
mvn test-compile clean
Full CI run under concurrent load (cannot reproduce the flake locally without the full suite)

🤖 Generated with Claude Code

Summary by Gitar

Search indexing improvements:
- Introduced activeStagedIndices map in SearchRepository to track indexes during reindexing operations.
- Added resolveWriteIndex to dynamically route live entity writes to staged indexes, preventing data loss during alias swaps.
- Implemented registerStagedIndex and unregisterStagedIndex to manage staged index lifecycle in DefaultRecreateHandler.

_{This will update automatically on new commits.}

@isolated

GlossaryOntologyExportIT: mark @isolated. @BeforeAll flips RdfUpdater (a JVM-wide singleton) on, which makes every concurrent test class start doing synchronous Fuseki writes on entity create, saturating the Dropwizard thread pool and causing 60s request timeouts. @execution (SAME_THREAD) alone only serialises within this class. WorkflowDefinitionResourceIT#triggerWorkflow_SDK: drop the redundant waitForWorkflowDeployment call — the create path already waits. Add descriptive aliases to the two await() polls so the next flake tells us which FQN or workflow name actually timed out instead of an anonymous lambda. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes flaky backend integration tests by preventing cross-test interference from JVM-wide RDF toggling and improving Awaitility timeout diagnostics in workflow tests.

Changes:

Added JUnit @Isolated to GlossaryOntologyExportIT to prevent concurrent test classes from inheriting the global RdfUpdater configuration.
Removed a redundant workflow deployment wait in triggerWorkflow_SDK (deployment is already awaited during workflow creation).
Added descriptive aliases to Awaitility await(...) calls to make future timeouts actionable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java`	Removes redundant wait and adds Awaitility aliases for clearer timeout failures.
`openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/GlossaryOntologyExportIT.java`	Isolates RDF export tests to avoid JVM-wide singleton leakage across parallel integration tests.

Live search indexing was silently skipped whenever a reindex job was in RUNNING/READY/STOPPING state. SearchRepository.createEntityIndex() and six sibling methods consulted SearchIndexRetryQueue.isEntityTypeSuspended() and returned early with nothing written, nothing enqueued — entities vanished from search until a future reindex happened to cover them. The retry worker doubled down: when the scope refresh observed an active job, it purged the retry queue; and processRecord() deleted records whose type was suspended. So even manually enqueued retries were wiped. This is how the #27649 flake surfaced: AppsResourceIT triggers SearchIndexingApplication runs and its best-effort 30s wait silently swallows timeouts. If a run was still RUNNING when AppsResourceIT finished, the next class in the sequential fork (WorkflowDefinitionResourceIT) inherited the suspension and its freshly-created tables were never indexed — waitForEntityIndexedInSearch then timed out at 120s. Same mechanism bites real users mid-reindex in production. Remove the suspension mechanism entirely: * SearchRepository — drop the 8 isEntityTypeSuspended() early-returns; the client-availability path already enqueues for retry on its own. * SearchIndexRetryWorker — drop refreshReindexSuspensionScopeIfNeeded() and the suspension branches in processRecord(); remove the retry-queue purge on suspendAll. * SearchIndexRetryQueue — delete the updateSuspension / clearSuspension / isEntityTypeSuspended / isStreamingSuspended / isSuspendAllStreaming / getSuspendedEntityTypes API and the static AtomicBoolean / AtomicReference they backed. * Drop the two IT cases that asserted the removed behaviour. Live writes now always reach the search client; reindex and live writes both target the same indices as before. Version conflicts between the two paths (stale reindex batch overwriting a newer live write) remain possible as they did before suspension was introduced — that is the race suspension was meant to dodge, but dropping writes altogether was worse than the race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

github-actions · 2026-04-23T12:15:15Z

🟡 Playwright Results — all passed (20 flaky)

✅ 3691 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 89 skipped

Shard	Passed	Flaky	Skipped
✅ Shard 1	481	0	4
🟡 Shard 2	652	4	7
🟡 Shard 3	663	3	1
🟡 Shard 4	645	3	27
🟡 Shard 5	608	3	42
🟡 Shard 6	642	7	8

🟡 20 flaky test(s) (passed on retry)

Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/ColumnBulkOperations.spec.ts › should discard changes when closing drawer without saving (shard 2, 1 retry)
Features/DataQuality/ColumnLevelTests.spec.ts › Column Values Missing Count To Be Equal (shard 2, 1 retry)
Features/DomainTierCertificationVoting.spec.ts › Domain - Certification assign, update, and remove (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should not show online status for inactive users (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › Entity Reference List (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
Pages/DomainAdvanced.spec.ts › User with domain access can view subdomains (shard 4, 1 retry)
Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 2 retries)
Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
Pages/HyperlinkCustomProperty.spec.ts › should display URL when no display text is provided (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for topic -> container (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

The distributed reindex has a TOCTOU: partitions read from a DB snapshot at T0 and write to a staged index, then at T1 (seconds later) the alias is atomically swapped from the old index to the staged one and the old index is deleted. Any entity that live-writers create between T0 and T1 goes via the alias → old index, and is destroyed when that old index is deleted post-swap. The CI log for #27649 shows this directly: 10:13:35 staged table_search_index_rebuild_…_215646 built from snapshot 10:13:40 POST /v1/tables table1_gold → written to alias target (old index _179670) 10:13:40 table2_silver, table3_bronze, table4_brass all written to old index _179670 10:13:42 Atomically swapped aliases from [_179670] to _215646 10:13:42 Successfully deleted index _179670 10:13:43+ waitForEntityIndexedInSearch polls, finds nothing, times out at 2 min Removing the silent-skip suspension mechanism in the previous commit exposed this race (it had been hidden by dropping the writes outright, which was strictly worse). Route live writes to the staged index during the reindex window: * SearchRepository gains an activeStagedIndices map (entityType → stagedIndex) plus register/unregister/resolveWriteIndex. Writes resolve to the staged index when one is registered for the type, otherwise to the canonical alias — the existing behaviour. * DefaultRecreateHandler.recreateIndexFromMapping registers the staged index as soon as it is created; finalizeReindex and promoteEntityIndex unregister it on every exit path (successful swap, swap failure, failed-reindex delete, exception). * Every live-write path in SearchRepository — createEntityIndex, createEntitiesIndex, indexTableColumns, indexColumnsForTables, updateEntityIndex, createTimeSeriesEntity, updateTimeSeriesEntity, deleteEntityIndex, deleteEntityByFQNPrefix, deleteTimeSeriesEntityById — goes through resolveWriteIndex instead of reading the canonical alias directly. During a reindex, live writes land in the index that the alias will promote to; after the swap the alias points to that same index and subsequent writes continue to reach the same place. Old-index deletion no longer discards fresh data. Note: searches through the alias during the brief reindex window (< seconds in the CI log) can miss a write until the swap lands — an acceptable trade compared to silently dropping the write or losing it on deletion. The #27649 test tolerates this because its 120s poll spans many swap cycles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-23T13:34:50Z

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

gitar-bot · 2026-04-23T13:34:58Z

Code Review ✅ Approved

Eliminates flaky integration tests by refining test synchronization, ensuring reliable execution across the suite. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-23T14:39:36Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 23, 2026 06:29

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 23, 2026

Copilot started reviewing on behalf of mohityadav766 April 23, 2026 06:30 View session

mohityadav766 changed the title ~~fix(it): stop two flaky integration tests (#27649)~~ fix(it): stop flaky integration tests (#27649) Apr 23, 2026

Merge branch 'main' into fix-flaky-tests

b8f8928

Copilot AI reviewed Apr 23, 2026

View reviewed changes

mohityadav766 self-assigned this Apr 23, 2026

pmbrull previously approved these changes Apr 23, 2026

View reviewed changes

mohityadav766 temporarily deployed to test April 23, 2026 06:40 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test April 23, 2026 06:40 — with GitHub Actions Error

mohityadav766 dismissed pmbrull’s stale review via 09e5923 April 23, 2026 08:30

Merge branch 'main' into fix-flaky-tests

6e3f2b7

Copilot AI review requested due to automatic review settings April 23, 2026 08:32

Copilot started reviewing on behalf of mohityadav766 April 23, 2026 08:33 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java

mohityadav766 had a problem deploying to test April 23, 2026 08:41 — with GitHub Actions Error

Merge branch 'main' into fix-flaky-tests

c1fa78c

mohityadav766 had a problem deploying to test April 23, 2026 09:18 — with GitHub Actions Error

Merge branch 'main' into fix-flaky-tests

5960466

Copilot AI review requested due to automatic review settings April 23, 2026 10:05

Copilot started reviewing on behalf of mohityadav766 April 23, 2026 10:06 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

mohityadav766 temporarily deployed to test April 23, 2026 10:16 — with GitHub Actions Inactive

mohityadav766 temporarily deployed to test April 23, 2026 13:43 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(it): stop flaky integration tests (#27649)#27651

fix(it): stop flaky integration tests (#27649)#27651
mohityadav766 wants to merge 7 commits intomainfrom
fix-flaky-tests

mohityadav766 commented Apr 23, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

gitar-bot Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mohityadav766 commented Apr 23, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

GlossaryOntologyExportIT — add @Isolated

WorkflowDefinitionResourceIT#triggerWorkflow_SDK — drop redundant wait + diagnostic aliases

Test plan

Summary by Gitar

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (20 flaky)

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

gitar-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohityadav766 commented Apr 23, 2026 •

edited by gitar-bot Bot

Loading

`GlossaryOntologyExportIT` — add `@Isolated`

`WorkflowDefinitionResourceIT#triggerWorkflow_SDK` — drop redundant wait + diagnostic aliases

github-actions Bot commented Apr 23, 2026 •

edited

Loading

gitar-bot Bot commented Apr 23, 2026 •

edited

Loading