Skip to content

fix(rdf): ensure Fuseki dataset before indexing and report failures correctly#27630

Open
pmbrull wants to merge 3 commits intomainfrom
fix/rdf-app-ensure-dataset-and-error-reporting
Open

fix(rdf): ensure Fuseki dataset before indexing and report failures correctly#27630
pmbrull wants to merge 3 commits intomainfrom
fix/rdf-app-ensure-dataset-and-error-reporting

Conversation

@pmbrull
Copy link
Copy Markdown
Collaborator

@pmbrull pmbrull commented Apr 22, 2026

Summary

Two related reliability fixes for the RDF indexing app, discovered debugging a customer instance where the RDF app logged 404/405 errors for every entity yet reported them all as successes.

Issue 1 — silent data loss when the Fuseki dataset disappears. JenaFusekiStorage.ensureDatasetExists() runs only in the constructor (OM server startup). If the Fuseki dataset is removed later — e.g. Fuseki container restart without pod restart when FUSEKI_BASE is on an ephemeral filesystem, admin-API deletion by another process — the RDF connection silently keeps issuing SPARQL calls that come back 404 (query) / 405 (update). The indexing app keeps running and logs thousands of errors, but never recovers.

Issue 2 — false success counts. RdfRepository.createOrUpdate() caught all exceptions and only logged. RdfBatchProcessor.processEntities() wraps the call in try { ... successCount++; } catch { failedCount++; }, but because createOrUpdate never threw, every failure was counted as a success. App reported failedRecords: 0 with thousands of server-log errors.

Changes

  • RdfStorageInterface: new default void ensureStorageReady() (no-op for non-Fuseki implementations).
  • JenaFusekiStorage: stores endpoint/username/password as fields; implements ensureStorageReady() by first running a SPARQL ASK (testConnection()) — this works even when the Fuseki admin API is disabled — then, on failure, attempting ensureDatasetExists(...) and re-verifying. Throws IllegalStateException with a clear message pointing at endpoint/credentials/permissions if the dataset still isn't reachable. Also reloads the ontology when we had to (re)create the dataset.
  • RdfRepository: new ensureStorageReady() that delegates to the storage. createOrUpdate() now rethrows after logging so callers can count failures.
  • RdfIndexApp.execute(): calls rdfRepository.ensureStorageReady() right after the isEnabled() check; on failure, marks the job FAILED with a clear error message and returns, instead of running to completion with partial data.
  • DistributedRdfIndexExecutor.joinJob(): same call on the participant path, so distributed participants also recover between runs (and fail loudly if they can't).

RdfUpdater (the REST-API-path caller) already wraps createOrUpdate in try/catch, so user-facing API calls are not affected by the rethrow change.

Not fixed here

This is app-level idempotency; it does not fix the underlying Fuseki persistence issue that made the dataset disappear in the first place (FUSEKI_BASE/admin-created datasets need to be on a PVC, and the Fuseki container should not be OOMKilled). Those are infra-side fixes that belong outside this PR.

Test plan

  • RDF indexing with a missing dataset on startup: the app previously produced a knowledge graph partially populated until the dataset reappeared, then silently broke again on the next restart. With this change, either (a) the dataset is auto-recreated and indexing proceeds, or (b) the job FAILS with an RDF storage is not ready: … message surfaced in the app run record, instead of silently reporting success.
  • Pre-existing RDF integration tests in openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/RdfResourceIT.java continue to pass (default ensureStorageReady() no-op doesn't affect in-memory/test-mode paths; Fuseki path only throws on genuine unreachability).
  • With a forced failure injected in storeEntity, the batch processor now correctly reports failedRecords > 0 in the app run stats instead of 0.

…lures correctly

If the Fuseki dataset disappears after server startup (container restart
without pod restart, ephemeral FUSEKI_BASE, external admin deletion),
the RDF connection silently keeps issuing requests that come back as
404/405, producing partial knowledge graphs. Also, RdfRepository.createOrUpdate
swallowed exceptions, so RdfBatchProcessor counted every failure as a success.

- Add RdfStorageInterface.ensureStorageReady() with a default no-op; Fuseki
  implementation first tests the connection with an ASK query (works even
  when the admin API is disabled), attempts dataset creation if missing,
  verifies again, and throws a clear IllegalStateException otherwise.
- RdfIndexApp.execute() and DistributedRdfIndexExecutor.joinJob() call it
  before processing, so both the coordinator and participant paths recover
  transparently from a Fuseki dataset loss between runs.
- RdfRepository.createOrUpdate() now rethrows after logging, so the batch
  processor correctly increments failedCount instead of reporting false
  successes. Entity-hook callers (RdfUpdater) already wrap in try/catch
  so API-path behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 22, 2026 13:19
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves reliability of the RDF indexing app by proactively verifying RDF storage readiness (especially for Fuseki datasets that can disappear after startup) and by fixing failure reporting so indexing errors aren’t incorrectly counted as successes.

Changes:

  • Add a new ensureStorageReady() hook to the RDF storage interface and implement it for Fuseki to verify/recreate datasets and reload ontology when needed.
  • Make RdfRepository.createOrUpdate() rethrow on failures so batch indexing can correctly count failures.
  • Fail RDF indexing jobs early (both coordinator and distributed participant paths) when RDF storage is not ready, surfacing a clear failure message in job records.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/RdfStorageInterface.java Adds ensureStorageReady() default no-op contract for readiness checks.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/JenaFusekiStorage.java Stores connection config fields; implements readiness verification + best-effort dataset creation + ontology reload.
openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java Adds repository-level ensureStorageReady() and rethrows from createOrUpdate() to propagate failures.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexExecutor.java Ensures distributed participants check RDF storage readiness before working.
openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexApp.java Aborts indexing job early and marks FAILED when RDF storage cannot be made ready.
Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexExecutor.java:136

  • joinJob() now invokes RdfRepository.getInstance().ensureStorageReady() on the participant path, but there’s no unit test asserting this readiness check happens (and that failures are handled as intended for participants). Please add coverage in DistributedRdfIndexExecutorTest (or participant tests) to verify ensureStorageReady() is called and to validate behavior when it throws.
  public void joinJob(RdfIndexJob job, EventPublisherJob jobConfiguration)
      throws InterruptedException {
    RdfRepository.getInstance().ensureStorageReady();
    currentJob = job;
    coordinatorOwnedJob = false;
    stopped.set(false);
    localExecutionCleaned.set(false);
    runWorkers(jobConfiguration, false);

Comment on lines 151 to 155
LOG.error(
"Failed to create/update entity {} in RDF - Type: {}, FQN: {}",
entity.getId(),
entity.getEntityReference().getType(),
entity.getFullyQualifiedName(),
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createOrUpdate now logs an error and then rethrows, but the two current callers (RdfBatchProcessor and RdfUpdater) also log on catch. This will produce duplicate error logs (and stack traces) per failure, which can be very noisy during large reindex runs. Consider logging only at one layer (e.g., remove/downgrade this LOG.error here and let callers log with entity context), or rethrow without logging and rely on upstream handling.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +96
public void ensureStorageReady() {
if (testConnection()) {
LOG.debug("Fuseki dataset at {} is accessible", endpoint);
return;
}

LOG.warn(
"Fuseki dataset at {} is not accessible; attempting to (re)create it before running",
endpoint);
ensureDatasetExists(endpoint, username, password);

if (!testConnection()) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureStorageReady() calls testConnection() twice on the unhappy path. In JenaFusekiStorage, testConnection() currently logs at ERROR with a full stack trace for expected transient/unready states, so a single readiness check can emit multiple ERROR stack traces before the method throws its own IllegalStateException. Consider reducing testConnection() logging to DEBUG/INFO (or only logging in ensureStorageReady()), so readiness probes don’t spam logs during failures/retries.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +156
try {
rdfRepository.ensureStorageReady();
} catch (Exception e) {
LOG.error("RDF storage is not ready; aborting indexing job", e);
updateJobStatus(EventPublisherJob.Status.FAILED);
jobData.setFailure(
new IndexingError()
.withErrorSource(IndexingError.ErrorSource.JOB)
.withMessage("RDF storage is not ready: " + e.getMessage()));
sendUpdates(jobExecutionContext, true);
return;
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New failure behavior was added here (abort the job and mark it FAILED when rdfRepository.ensureStorageReady() throws), but RdfIndexAppTest doesn’t appear to cover it. Please add a unit test that stubs ensureStorageReady() to throw and asserts: status becomes FAILED, jobData.failure.message is set, and updates are sent (or updateRecordToDbAndNotify invoked) so the failure is surfaced to users.

Copilot uses AI. Check for mistakes.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review ✅ Approved

Ensures the Fuseki dataset exists prior to indexing and improves error reporting for failed operations. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🔴 Playwright Results — 1 failure(s), 17 flaky

✅ 3693 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 480 0 1 4
🟡 Shard 2 652 0 4 7
🟡 Shard 3 664 0 2 1
🔴 Shard 4 642 1 5 27
✅ Shard 5 611 0 0 42
🟡 Shard 6 644 0 5 8

Genuine Failures (failed on all attempts)

Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4)
Error: Wait for data contract execution to complete

�[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /(Aborted|Success|Failed|PartialSuccess|Queued)/�[39m
Received: �[31m"Running"�[39m

Call Log:
- Timeout 600000ms exceeded while waiting on the predicate
🟡 17 flaky test(s) (passed on retry)
  • Features/TeamsHierarchy.spec.ts › Delete Parent Team (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Multi-nested domain hierarchy: filters should scope correctly at every level (shard 2, 1 retry)
  • Features/ExploreQuickFilters.spec.ts › tier with assigned asset appears in dropdown, tier without asset does not (shard 2, 1 retry)
  • Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Data Product announcement create, edit & delete (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Certification Add Remove (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Tag filter for Lineage (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@pmbrull pmbrull requested a review from harshach April 22, 2026 17:19
Copilot AI review requested due to automatic review settings April 23, 2026 05:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +84 to +88
public void ensureStorageReady() {
if (testConnection()) {
LOG.debug("Fuseki dataset at {} is accessible", endpoint);
return;
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureStorageReady() currently treats the storage as ready if a SPARQL query succeeds (testConnection()), but indexing requires write access too (e.g., storeEntity() calls connection.update() and connection.load()). In a configuration where queries are allowed but updates are blocked (common with Fuseki auth/permissions), ensureStorageReady() will return successfully and the job will still run and fail every entity. Consider extending readiness checks to also validate a lightweight update/write operation (e.g., a no-op/safe DELETE WHERE against a dedicated healthcheck graph) and throw with an error message that explicitly calls out missing update permissions when writes are not allowed.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +151
try {
rdfRepository.ensureStorageReady();
} catch (Exception e) {
LOG.error("RDF storage is not ready; aborting indexing job", e);
updateJobStatus(EventPublisherJob.Status.FAILED);
jobData.setFailure(
new IndexingError()
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new failure path where the job aborts early if rdfRepository.ensureStorageReady() throws. Since there are existing unit tests covering execute() behavior in RdfIndexAppTest, it would be good to add/adjust tests to assert that (1) job status becomes FAILED and (2) jobData.failure is populated with an IndexingError message when storage readiness checks fail.

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants