fix(rdf): ensure Fuseki dataset before indexing and report failures correctly by pmbrull · Pull Request #27630 · open-metadata/OpenMetadata

pmbrull · 2026-04-22T13:19:20Z

Summary

Two related reliability fixes for the RDF indexing app, discovered debugging a customer instance where the RDF app logged 404/405 errors for every entity yet reported them all as successes.

Issue 1 — silent data loss when the Fuseki dataset disappears. JenaFusekiStorage.ensureDatasetExists() runs only in the constructor (OM server startup). If the Fuseki dataset is removed later — e.g. Fuseki container restart without pod restart when FUSEKI_BASE is on an ephemeral filesystem, admin-API deletion by another process — the RDF connection silently keeps issuing SPARQL calls that come back 404 (query) / 405 (update). The indexing app keeps running and logs thousands of errors, but never recovers.

Issue 2 — false success counts. RdfRepository.createOrUpdate() caught all exceptions and only logged. RdfBatchProcessor.processEntities() wraps the call in try { ... successCount++; } catch { failedCount++; }, but because createOrUpdate never threw, every failure was counted as a success. App reported failedRecords: 0 with thousands of server-log errors.

Changes

RdfStorageInterface: new default void ensureStorageReady() (no-op for non-Fuseki implementations).
JenaFusekiStorage: stores endpoint/username/password as fields; implements ensureStorageReady() by first running a SPARQL ASK (testConnection()) — this works even when the Fuseki admin API is disabled — then, on failure, attempting ensureDatasetExists(...) and re-verifying. Throws IllegalStateException with a clear message pointing at endpoint/credentials/permissions if the dataset still isn't reachable. Also reloads the ontology when we had to (re)create the dataset.
RdfRepository: new ensureStorageReady() that delegates to the storage. createOrUpdate() now rethrows after logging so callers can count failures.
RdfIndexApp.execute(): calls rdfRepository.ensureStorageReady() right after the isEnabled() check; on failure, marks the job FAILED with a clear error message and returns, instead of running to completion with partial data.
DistributedRdfIndexExecutor.joinJob(): same call on the participant path, so distributed participants also recover between runs (and fail loudly if they can't).

RdfUpdater (the REST-API-path caller) already wraps createOrUpdate in try/catch, so user-facing API calls are not affected by the rethrow change.

Not fixed here

This is app-level idempotency; it does not fix the underlying Fuseki persistence issue that made the dataset disappear in the first place (FUSEKI_BASE/admin-created datasets need to be on a PVC, and the Fuseki container should not be OOMKilled). Those are infra-side fixes that belong outside this PR.

Test plan

RDF indexing with a missing dataset on startup: the app previously produced a knowledge graph partially populated until the dataset reappeared, then silently broke again on the next restart. With this change, either (a) the dataset is auto-recreated and indexing proceeds, or (b) the job FAILS with an RDF storage is not ready: … message surfaced in the app run record, instead of silently reporting success.
Pre-existing RDF integration tests in openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/RdfResourceIT.java continue to pass (default ensureStorageReady() no-op doesn't affect in-memory/test-mode paths; Fuseki path only throws on genuine unreachability).
With a forced failure injected in storeEntity, the batch processor now correctly reports failedRecords > 0 in the app run stats instead of 0.

…lures correctly If the Fuseki dataset disappears after server startup (container restart without pod restart, ephemeral FUSEKI_BASE, external admin deletion), the RDF connection silently keeps issuing requests that come back as 404/405, producing partial knowledge graphs. Also, RdfRepository.createOrUpdate swallowed exceptions, so RdfBatchProcessor counted every failure as a success. - Add RdfStorageInterface.ensureStorageReady() with a default no-op; Fuseki implementation first tests the connection with an ASK query (works even when the admin API is disabled), attempts dataset creation if missing, verifies again, and throws a clear IllegalStateException otherwise. - RdfIndexApp.execute() and DistributedRdfIndexExecutor.joinJob() call it before processing, so both the coordinator and participant paths recover transparently from a Fuseki dataset loss between runs. - RdfRepository.createOrUpdate() now rethrows after logging, so the batch processor correctly increments failedCount instead of reporting false successes. Entity-hook callers (RdfUpdater) already wrap in try/catch so API-path behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T13:23:22Z

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot

Pull request overview

Improves reliability of the RDF indexing app by proactively verifying RDF storage readiness (especially for Fuseki datasets that can disappear after startup) and by fixing failure reporting so indexing errors aren’t incorrectly counted as successes.

Changes:

Add a new ensureStorageReady() hook to the RDF storage interface and implement it for Fuseki to verify/recreate datasets and reload ontology when needed.
Make RdfRepository.createOrUpdate() rethrow on failures so batch indexing can correctly count failures.
Fail RDF indexing jobs early (both coordinator and distributed participant paths) when RDF storage is not ready, surfacing a clear failure message in job records.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/RdfStorageInterface.java	Adds `ensureStorageReady()` default no-op contract for readiness checks.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/JenaFusekiStorage.java	Stores connection config fields; implements readiness verification + best-effort dataset creation + ontology reload.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java	Adds repository-level `ensureStorageReady()` and rethrows from `createOrUpdate()` to propagate failures.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexExecutor.java	Ensures distributed participants check RDF storage readiness before working.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexApp.java	Aborts indexing job early and marks FAILED when RDF storage cannot be made ready.

Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexExecutor.java:136

joinJob() now invokes RdfRepository.getInstance().ensureStorageReady() on the participant path, but there’s no unit test asserting this readiness check happens (and that failures are handled as intended for participants). Please add coverage in DistributedRdfIndexExecutorTest (or participant tests) to verify ensureStorageReady() is called and to validate behavior when it throws.

  public void joinJob(RdfIndexJob job, EventPublisherJob jobConfiguration)
      throws InterruptedException {
    RdfRepository.getInstance().ensureStorageReady();
    currentJob = job;
    coordinatorOwnedJob = false;
    stopped.set(false);
    localExecutionCleaned.set(false);
    runWorkers(jobConfiguration, false);

Copilot · 2026-04-22T13:25:32Z

      LOG.error(
          "Failed to create/update entity {} in RDF - Type: {}, FQN: {}",
          entity.getId(),
          entity.getEntityReference().getType(),
          entity.getFullyQualifiedName(),


createOrUpdate now logs an error and then rethrows, but the two current callers (RdfBatchProcessor and RdfUpdater) also log on catch. This will produce duplicate error logs (and stack traces) per failure, which can be very noisy during large reindex runs. Consider logging only at one layer (e.g., remove/downgrade this LOG.error here and let callers log with entity context), or rethrow without logging and rely on upstream handling.

Copilot · 2026-04-22T13:25:33Z

+  public void ensureStorageReady() {
+    if (testConnection()) {
+      LOG.debug("Fuseki dataset at {} is accessible", endpoint);
+      return;
+    }
+
+    LOG.warn(
+        "Fuseki dataset at {} is not accessible; attempting to (re)create it before running",
+        endpoint);
+    ensureDatasetExists(endpoint, username, password);
+
+    if (!testConnection()) {


ensureStorageReady() calls testConnection() twice on the unhappy path. In JenaFusekiStorage, testConnection() currently logs at ERROR with a full stack trace for expected transient/unready states, so a single readiness check can emit multiple ERROR stack traces before the method throws its own IllegalStateException. Consider reducing testConnection() logging to DEBUG/INFO (or only logging in ensureStorageReady()), so readiness probes don’t spam logs during failures/retries.

Copilot · 2026-04-22T13:25:33Z

+    try {
+      rdfRepository.ensureStorageReady();
+    } catch (Exception e) {
+      LOG.error("RDF storage is not ready; aborting indexing job", e);
+      updateJobStatus(EventPublisherJob.Status.FAILED);
+      jobData.setFailure(
+          new IndexingError()
+              .withErrorSource(IndexingError.ErrorSource.JOB)
+              .withMessage("RDF storage is not ready: " + e.getMessage()));
+      sendUpdates(jobExecutionContext, true);
+      return;
+    }


New failure behavior was added here (abort the job and mark it FAILED when rdfRepository.ensureStorageReady() throws), but RdfIndexAppTest doesn’t appear to cover it. Please add a unit test that stubs ensureStorageReady() to throw and asserts: status becomes FAILED, jobData.failure.message is set, and updates are sent (or updateRecordToDbAndNotify invoked) so the failure is surfaced to users.

gitar-bot · 2026-04-22T14:47:06Z

Code Review ✅ Approved

Ensures the Fuseki dataset exists prior to indexing and improves error reporting for failed operations. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

github-actions · 2026-04-22T16:56:32Z

🔴 Playwright Results — 1 failure(s), 17 flaky

✅ 3693 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 89 skipped

Shard	Passed	Failed	Flaky	Skipped
🟡 Shard 1	480	0	1	4
🟡 Shard 2	652	0	4	7
🟡 Shard 3	664	0	2	1
🔴 Shard 4	642	1	5	27
✅ Shard 5	611	0	0	42
🟡 Shard 6	644	0	5	8

Genuine Failures (failed on all attempts)

❌ Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4)

Error: Wait for data contract execution to complete

�[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /(Aborted|Success|Failed|PartialSuccess|Queued)/�[39m
Received: �[31m"Running"�[39m

Call Log:
- Timeout 600000ms exceeded while waiting on the predicate

🟡 17 flaky test(s) (passed on retry)

Features/TeamsHierarchy.spec.ts › Delete Parent Team (shard 1, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/DomainFilterQueryFilter.spec.ts › Multi-nested domain hierarchy: filters should scope correctly at every level (shard 2, 1 retry)
Features/ExploreQuickFilters.spec.ts › tier with assigned asset appears in dropdown, tier without asset does not (shard 2, 1 retry)
Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
Pages/Domains.spec.ts › Data Product announcement create, edit & delete (shard 4, 1 retry)
Pages/Entity.spec.ts › Certification Add Remove (shard 4, 1 retry)
Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify Tag filter for Lineage (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-23T05:33:28Z

+  public void ensureStorageReady() {
+    if (testConnection()) {
+      LOG.debug("Fuseki dataset at {} is accessible", endpoint);
+      return;
+    }


ensureStorageReady() currently treats the storage as ready if a SPARQL query succeeds (testConnection()), but indexing requires write access too (e.g., storeEntity() calls connection.update() and connection.load()). In a configuration where queries are allowed but updates are blocked (common with Fuseki auth/permissions), ensureStorageReady() will return successfully and the job will still run and fail every entity. Consider extending readiness checks to also validate a lightweight update/write operation (e.g., a no-op/safe DELETE WHERE against a dedicated healthcheck graph) and throw with an error message that explicitly calls out missing update permissions when writes are not allowed.

Copilot · 2026-04-23T05:33:28Z

+    try {
+      rdfRepository.ensureStorageReady();
+    } catch (Exception e) {
+      LOG.error("RDF storage is not ready; aborting indexing job", e);
+      updateJobStatus(EventPublisherJob.Status.FAILED);
+      jobData.setFailure(
+          new IndexingError()


This adds a new failure path where the job aborts early if rdfRepository.ensureStorageReady() throws. Since there are existing unit tests covering execute() behavior in RdfIndexAppTest, it would be good to add/adjust tests to assert that (1) job status becomes FAILED and (2) jobData.failure is populated with an IndexingError message when storage readiness checks fail.

sonarqubecloud · 2026-04-23T06:30:51Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 22, 2026 13:19

github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026

Copilot started reviewing on behalf of pmbrull April 22, 2026 13:20 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

pmbrull temporarily deployed to test April 22, 2026 13:29 — with GitHub Actions Inactive

pmbrull had a problem deploying to test April 22, 2026 13:30 — with GitHub Actions Error

pmbrull temporarily deployed to test April 22, 2026 13:30 — with GitHub Actions Inactive

format

f65effd

pmbrull temporarily deployed to test April 22, 2026 14:55 — with GitHub Actions Inactive

pmbrull had a problem deploying to test April 22, 2026 14:57 — with GitHub Actions Failure

pmbrull temporarily deployed to test April 22, 2026 14:57 — with GitHub Actions Inactive

pmbrull requested a review from harshach April 22, 2026 17:19

Merge branch 'main' into fix/rdf-app-ensure-dataset-and-error-reporting

fb7c6fa

Copilot AI review requested due to automatic review settings April 23, 2026 05:28

Copilot started reviewing on behalf of pmbrull April 23, 2026 05:28 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

pmbrull temporarily deployed to test April 23, 2026 05:38 — with GitHub Actions Inactive

pmbrull temporarily deployed to test April 23, 2026 05:40 — with GitHub Actions Inactive

pmbrull had a problem deploying to test April 23, 2026 05:40 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rdf): ensure Fuseki dataset before indexing and report failures correctly#27630

fix(rdf): ensure Fuseki dataset before indexing and report failures correctly#27630
pmbrull wants to merge 3 commits intomainfrom
fix/rdf-app-ensure-dataset-and-error-reporting

pmbrull commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

gitar-bot Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pmbrull commented Apr 22, 2026

Summary

Changes

Not fixed here

Test plan

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Playwright Results — 1 failure(s), 17 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitar-bot Bot commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading