Skip to content

feat: CIO - Add shared flag for COS ingestion to allow documents to be indexed without an owner#1808

Open
ricofurtado wants to merge 6 commits into
mainfrom
cio-shared-files
Open

feat: CIO - Add shared flag for COS ingestion to allow documents to be indexed without an owner#1808
ricofurtado wants to merge 6 commits into
mainfrom
cio-shared-files

Conversation

@ricofurtado

@ricofurtado ricofurtado commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

This pull request introduces a new "shared" ingestion mode for IBM COS connectors, allowing documents to be indexed without an owner so that all users in an OpenRAG instance can access them. The change is implemented across the frontend, backend, and SDKs, with UI updates to expose the option where appropriate, backend validation and propagation of the shared flag, and adjustments to document processing logic to omit owner fields when shared mode is enabled.

Frontend and UI Enhancements:

  • Added a "Make documents available to all users" toggle in the ingestion settings UI (IngestSettings and SharedBucketView components), shown only for COS ingestion, and ensured the setting is passed through to connector sync requests. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Backend and API Changes:

  • Extended the connector sync API and service logic to accept and propagate a shared flag, with validation to ensure it is only used with the IBM COS connector. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Document Processing Logic:

  • Introduced a helper function resolve_shared_owner_fields and updated document processing so that when shared mode is enabled, owner fields are omitted from indexed chunks, making them visible to all users according to OpenSearch DLS rules. [1] [2] [3] [4] [5] [6]

SDK Updates:

  • Added support for the shared flag in both Python and TypeScript SDKs, allowing clients to request shared ingestion from code. [1] [2] [3] [4] [5] [6] [7]

These changes collectively enable a temporary, instance-wide document sharing mechanism for COS-ingested documents, pending future implementation of more granular access control.l

Summary by CodeRabbit

  • New Features

    • Added ability to make Cloud Object Storage documents available to all instance users during ingestion.
    • Introduced "Make documents available to all users" toggle in ingest settings interface.
  • Tests

    • Added comprehensive unit and integration tests for shared document visibility and access control behavior.

@github-actions github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR adds a COS-scoped shared flag enabling ownerless document indexing across the frontend, API, services, and indexing pipeline. Frontend components expose a "Make documents available to all users" toggle, the API validates the flag is only used with IBM COS, and processors omit owner fields when shared. Tests verify DLS visibility behavior and serialization correctness.

Changes

Shared Document Indexing Feature

Layer / File(s) Summary
Type and contract definitions
frontend/components/cloud-picker/types.ts, frontend/components/connectors/shared-bucket-view.tsx, frontend/app/api/mutations/useSyncConnector.ts, src/api/connectors.py
Frontend and API layer types declare new shared and showShared boolean fields in IngestSettings, SharedBucketViewProps, sync mutation body, and ConnectorSyncBody request model.
Frontend UI and mutation wiring
frontend/components/cloud-picker/ingest-settings.tsx, frontend/components/connectors/shared-bucket-view.tsx, frontend/enhancements/connectors/ibm-cos/components/bucket-view.tsx
Components accept showShared prop, render "Make documents available to all users" toggle when enabled, wire ingestSettings.shared, and conditionally include shared in sync mutation payload.
API request validation and routing
src/api/connectors.py
Connector sync endpoint validates shared=true only for ibm_cos connector type (returns HTTP 400 otherwise) and threads shared flag into all three sync execution paths (explicitly selected files, bucket-filtered full ingest, and full ingest without constraints).
Service-layer shared parameter propagation
src/connectors/service.py
ConnectorService.sync_connector_files and sync_specific_files accept shared parameter and pass it to ConnectorFileProcessor instantiation.
Processor owner-field resolution logic
src/models/processors.py
Added resolve_shared_owner_fields() helper returning (None, None, None) when shared=True; TaskProcessor and ConnectorFileProcessor integrate shared flag into deletion (using broader filename query when shared) and ingestion paths (using resolved owner fields).
Index writer and filename-replace query
src/services/document_index_writer.py, src/utils/opensearch_queries.py
Index writer conditionally sets owner field only when context owner is not None; new build_replace_filename_query() targets documents matching filename where owner equals user OR owner field is absent (shared documents).
Langflow service owner serialization fixes
src/services/langflow_file_service.py
Langflow service fixes metadata and header serialization: owner metadata only added when truthy, global headers use value or "" pattern to avoid stringifying None as "None" string.
Unit and integration tests
tests/unit/test_shared_flag.py, tests/integration/core/test_shared_flag_dls.py
Comprehensive unit tests validate owner resolution, index document construction, query structures, and API guard behavior; integration tests verify DLS visibility with ownerless vs. null-owner documents.

Sequence Diagram(s)

sequenceDiagram
  participant UI as Frontend UI
  participant API as Connector API
  participant Service as ConnectorService
  participant Processor as ConnectorFileProcessor
  participant TaskProc as TaskProcessor
  participant IndexWriter as DocumentIndexWriter
  participant OpenSearch
  
  UI->>API: POST /sync body={shared: true, connector_type: "ibm_cos"}
  API->>API: validate shared only with ibm_cos
  API->>Service: sync_connector_files(..., shared=true)
  Service->>Processor: create ConnectorFileProcessor(shared=true)
  Processor->>TaskProc: process_document_standard(..., shared=true)
  TaskProc->>TaskProc: resolve_shared_owner_fields(..., shared=true) → (None, None, None)
  TaskProc->>IndexWriter: index with owner=(None), owner_name=(None), owner_email=(None)
  IndexWriter->>IndexWriter: conditionally omit "owner" key when None
  IndexWriter->>OpenSearch: index chunk without owner field
  OpenSearch-->>OpenSearch: DLS: document accessible to all users (no owner constraint)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • langflow-ai/openrag#1695: Both PRs modify src/connectors/service.py's ConnectorService.sync_specific_files method—this PR adds shared flag threading, while that PR refactors folder-expansion/sync logic in the same method.
  • langflow-ai/openrag#1694: Both PRs extend src/models/processors.py's TaskProcessor.process_document_standard(...) signature—this PR adds shared for ownerless documents, while that PR adds connector_file_id for deletion targeting.

Suggested reviewers

  • edwinjosechittilappilly
  • lucaseduoli
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a shared flag for COS ingestion to allow documents to be indexed without an owner, which is the core feature implemented across frontend, backend, and processing logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cio-shared-files

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 9, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/api/connectors.py (1)

734-755: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Forward shared in the no-selection resync branch too.

Lines 734-755 call sync methods without shared=body.shared. For ibm_cos requests where shared=true and no selected_files/sync_all/bucket_filter is provided, the flag is silently ignored.

Suggested fix
                 task_id = await connector_service.sync_specific_files(
                     working_connection.connection_id,
                     user.user_id,
                     ids_to_sync,
                     jwt_token=jwt_token,
                     replace_duplicates=_connector_sync_should_replace(connector_type),
+                    shared=body.shared,
                 )
             else:
                 # Fallback: use filename filtering (for Langflow-ingested files without document_id)
                 logger.info(
                     "Syncing files by filename filter (document_id not available)",
@@
                 task_id = await connector_service.sync_connector_files(
                     working_connection.connection_id,
                     user.user_id,
                     max_files=None,
                     jwt_token=jwt_token,
                     filename_filter=set(existing_filenames),
                     replace_duplicates=_connector_sync_should_replace(connector_type),
+                    shared=body.shared,
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/connectors.py` around lines 734 - 755, The resync branches call
connector_service.sync_specific_files and connector_service.sync_connector_files
without passing the shared flag, so when body.shared is true (e.g., ibm_cos
shared=true) it gets ignored; update both calls to include shared=body.shared
(alongside existing args like working_connection.connection_id, user.user_id,
jwt_token, filename_filter/ids_to_sync, replace_duplicates) so the connector
service receives the shared flag in the no-selection and specific-files
branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/connectors.py`:
- Around line 551-555: The 400 response for the shared-flag check returns
{"detail": ...} which the frontend expects under error.error; update the
JSONResponse in the body.shared check (the branch using JSONResponse when
connector_type != "ibm_cos") to return an object with an "error" key containing
the message (you may keep "detail" if desired) and keep status_code=400 so the
frontend can read error.error; modify the JSONResponse payload in that branch
accordingly.

In `@src/models/processors.py`:
- Line 920: The duplicate-replacement path currently filters deletes by
owner_user_id which skips shared documents (shared=self.shared) and leaves
orphaned chunks when replace_duplicates=True; update the delete logic used by
the replace_duplicates flow (the function/method that constructs the
owner-scoped delete query — look for uses of owner_user_id in the
replace_duplicates handling) to branch on self.shared: if self.shared is True,
run the delete without the owner_user_id filter (or explicitly delete rows where
owner_user_id IS NULL), otherwise keep the existing owner-scoped delete; ensure
the same branch is applied wherever replace_duplicates is handled so shared docs
are properly removed before reindexing.

In `@tests/unit/test_shared_flag.py`:
- Line 4: Remove the unused TestClient import to satisfy lint (Ruff F401):
delete the "from fastapi.testclient import TestClient" line (or replace it with
a used import) in the tests/unit/test_shared_flag.py file so that TestClient is
no longer referenced in the module; ensure no other references to TestClient
remain (e.g., in functions or fixtures) before committing.

---

Outside diff comments:
In `@src/api/connectors.py`:
- Around line 734-755: The resync branches call
connector_service.sync_specific_files and connector_service.sync_connector_files
without passing the shared flag, so when body.shared is true (e.g., ibm_cos
shared=true) it gets ignored; update both calls to include shared=body.shared
(alongside existing args like working_connection.connection_id, user.user_id,
jwt_token, filename_filter/ids_to_sync, replace_duplicates) so the connector
service receives the shared flag in the no-selection and specific-files
branches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b3b16009-78a9-413f-8102-220954eb99d3

📥 Commits

Reviewing files that changed from the base of the PR and between cb3f5ed and d3b2dac.

📒 Files selected for processing (13)
  • frontend/app/api/mutations/useSyncConnector.ts
  • frontend/components/cloud-picker/ingest-settings.tsx
  • frontend/components/cloud-picker/types.ts
  • frontend/components/connectors/shared-bucket-view.tsx
  • frontend/enhancements/connectors/ibm-cos/components/bucket-view.tsx
  • sdks/python/openrag_sdk/documents.py
  • sdks/typescript/src/documents.ts
  • src/api/connectors.py
  • src/connectors/service.py
  • src/models/processors.py
  • src/services/document_index_writer.py
  • tests/integration/core/test_shared_flag_dls.py
  • tests/unit/test_shared_flag.py

Comment thread src/api/connectors.py
Comment thread src/models/processors.py
Comment thread tests/unit/test_shared_flag.py Outdated
Improving error return

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 10, 2026
@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 10, 2026
@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 10, 2026
@edwinjosechittilappilly

Copy link
Copy Markdown
Collaborator

Good One!
Shared Feedback Async.

@github-actions github-actions Bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Jun 10, 2026
@github-actions github-actions Bot added the enhancement 🔵 New feature or request label Jun 10, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/services/langflow_file_service.py (1)

379-387: ⚠️ Potential issue | 🟡 Minor

Remove or wire metadata_tweaks in langflow_file_service.py

In src/services/langflow_file_service.py (lines 379-387), metadata_tweaks is only constructed (via the if owner* guards) and then immediately logged; it’s not used to populate tweaks, headers, or any request payload field. Either remove the list/guards or pass metadata_tweaks into the actual data structure that Langflow/OpenSearch consumes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/langflow_file_service.py` around lines 379 - 387, The
metadata_tweaks list is being built (metadata_tweaks) and logged but never
applied to the request payload; either remove the unused metadata_tweaks
construction or wire it into the outgoing data structure (e.g., merge
metadata_tweaks into the existing tweaks list or attach it to the request
headers/payload that is sent to Langflow/OpenSearch). Locate the block that
constructs tweaks (or the payload builder/send function) and either extend that
tweaks variable with metadata_tweaks or include metadata_tweaks under the
appropriate payload key before the request is made; if unused, delete the
metadata_tweaks variable and the conditional guards and the logger line
(logger.info(...)) to avoid dead code. Ensure you reference and update
metadata_tweaks, tweaks, and the payload-sending function so the metadata is
actually consumed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/services/langflow_file_service.py`:
- Around line 379-387: The metadata_tweaks list is being built (metadata_tweaks)
and logged but never applied to the request payload; either remove the unused
metadata_tweaks construction or wire it into the outgoing data structure (e.g.,
merge metadata_tweaks into the existing tweaks list or attach it to the request
headers/payload that is sent to Langflow/OpenSearch). Locate the block that
constructs tweaks (or the payload builder/send function) and either extend that
tweaks variable with metadata_tweaks or include metadata_tweaks under the
appropriate payload key before the request is made; if unused, delete the
metadata_tweaks variable and the conditional guards and the logger line
(logger.info(...)) to avoid dead code. Ensure you reference and update
metadata_tweaks, tweaks, and the payload-sending function so the metadata is
actually consumed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0eafac68-0ae2-4e3b-8c99-9befd8ab86c7

📥 Commits

Reviewing files that changed from the base of the PR and between 989c18a and 4b50bf1.

📒 Files selected for processing (1)
  • src/services/langflow_file_service.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request frontend 🟨 Issues related to the UI/UX tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants