feat: CIO - Add shared flag for COS ingestion to allow documents to be indexed without an owner#1808
feat: CIO - Add shared flag for COS ingestion to allow documents to be indexed without an owner#1808ricofurtado wants to merge 6 commits into
Conversation
…xed without an owner
WalkthroughThis PR adds a COS-scoped ChangesShared Document Indexing Feature
Sequence Diagram(s)sequenceDiagram
participant UI as Frontend UI
participant API as Connector API
participant Service as ConnectorService
participant Processor as ConnectorFileProcessor
participant TaskProc as TaskProcessor
participant IndexWriter as DocumentIndexWriter
participant OpenSearch
UI->>API: POST /sync body={shared: true, connector_type: "ibm_cos"}
API->>API: validate shared only with ibm_cos
API->>Service: sync_connector_files(..., shared=true)
Service->>Processor: create ConnectorFileProcessor(shared=true)
Processor->>TaskProc: process_document_standard(..., shared=true)
TaskProc->>TaskProc: resolve_shared_owner_fields(..., shared=true) → (None, None, None)
TaskProc->>IndexWriter: index with owner=(None), owner_name=(None), owner_email=(None)
IndexWriter->>IndexWriter: conditionally omit "owner" key when None
IndexWriter->>OpenSearch: index chunk without owner field
OpenSearch-->>OpenSearch: DLS: document accessible to all users (no owner constraint)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/api/connectors.py (1)
734-755:⚠️ Potential issue | 🟠 Major | ⚡ Quick winForward
sharedin the no-selection resync branch too.Lines 734-755 call sync methods without
shared=body.shared. Foribm_cosrequests whereshared=trueand noselected_files/sync_all/bucket_filteris provided, the flag is silently ignored.Suggested fix
task_id = await connector_service.sync_specific_files( working_connection.connection_id, user.user_id, ids_to_sync, jwt_token=jwt_token, replace_duplicates=_connector_sync_should_replace(connector_type), + shared=body.shared, ) else: # Fallback: use filename filtering (for Langflow-ingested files without document_id) logger.info( "Syncing files by filename filter (document_id not available)", @@ task_id = await connector_service.sync_connector_files( working_connection.connection_id, user.user_id, max_files=None, jwt_token=jwt_token, filename_filter=set(existing_filenames), replace_duplicates=_connector_sync_should_replace(connector_type), + shared=body.shared, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/api/connectors.py` around lines 734 - 755, The resync branches call connector_service.sync_specific_files and connector_service.sync_connector_files without passing the shared flag, so when body.shared is true (e.g., ibm_cos shared=true) it gets ignored; update both calls to include shared=body.shared (alongside existing args like working_connection.connection_id, user.user_id, jwt_token, filename_filter/ids_to_sync, replace_duplicates) so the connector service receives the shared flag in the no-selection and specific-files branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/api/connectors.py`:
- Around line 551-555: The 400 response for the shared-flag check returns
{"detail": ...} which the frontend expects under error.error; update the
JSONResponse in the body.shared check (the branch using JSONResponse when
connector_type != "ibm_cos") to return an object with an "error" key containing
the message (you may keep "detail" if desired) and keep status_code=400 so the
frontend can read error.error; modify the JSONResponse payload in that branch
accordingly.
In `@src/models/processors.py`:
- Line 920: The duplicate-replacement path currently filters deletes by
owner_user_id which skips shared documents (shared=self.shared) and leaves
orphaned chunks when replace_duplicates=True; update the delete logic used by
the replace_duplicates flow (the function/method that constructs the
owner-scoped delete query — look for uses of owner_user_id in the
replace_duplicates handling) to branch on self.shared: if self.shared is True,
run the delete without the owner_user_id filter (or explicitly delete rows where
owner_user_id IS NULL), otherwise keep the existing owner-scoped delete; ensure
the same branch is applied wherever replace_duplicates is handled so shared docs
are properly removed before reindexing.
In `@tests/unit/test_shared_flag.py`:
- Line 4: Remove the unused TestClient import to satisfy lint (Ruff F401):
delete the "from fastapi.testclient import TestClient" line (or replace it with
a used import) in the tests/unit/test_shared_flag.py file so that TestClient is
no longer referenced in the module; ensure no other references to TestClient
remain (e.g., in functions or fixtures) before committing.
---
Outside diff comments:
In `@src/api/connectors.py`:
- Around line 734-755: The resync branches call
connector_service.sync_specific_files and connector_service.sync_connector_files
without passing the shared flag, so when body.shared is true (e.g., ibm_cos
shared=true) it gets ignored; update both calls to include shared=body.shared
(alongside existing args like working_connection.connection_id, user.user_id,
jwt_token, filename_filter/ids_to_sync, replace_duplicates) so the connector
service receives the shared flag in the no-selection and specific-files
branches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3b16009-78a9-413f-8102-220954eb99d3
📒 Files selected for processing (13)
frontend/app/api/mutations/useSyncConnector.tsfrontend/components/cloud-picker/ingest-settings.tsxfrontend/components/cloud-picker/types.tsfrontend/components/connectors/shared-bucket-view.tsxfrontend/enhancements/connectors/ibm-cos/components/bucket-view.tsxsdks/python/openrag_sdk/documents.pysdks/typescript/src/documents.tssrc/api/connectors.pysrc/connectors/service.pysrc/models/processors.pysrc/services/document_index_writer.pytests/integration/core/test_shared_flag_dls.pytests/unit/test_shared_flag.py
Improving error return Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
Good One! |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/langflow_file_service.py (1)
379-387:⚠️ Potential issue | 🟡 MinorRemove or wire
metadata_tweaksinlangflow_file_service.pyIn
src/services/langflow_file_service.py(lines 379-387),metadata_tweaksis only constructed (via theif owner*guards) and then immediately logged; it’s not used to populatetweaks, headers, or any request payload field. Either remove the list/guards or passmetadata_tweaksinto the actual data structure that Langflow/OpenSearch consumes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/langflow_file_service.py` around lines 379 - 387, The metadata_tweaks list is being built (metadata_tweaks) and logged but never applied to the request payload; either remove the unused metadata_tweaks construction or wire it into the outgoing data structure (e.g., merge metadata_tweaks into the existing tweaks list or attach it to the request headers/payload that is sent to Langflow/OpenSearch). Locate the block that constructs tweaks (or the payload builder/send function) and either extend that tweaks variable with metadata_tweaks or include metadata_tweaks under the appropriate payload key before the request is made; if unused, delete the metadata_tweaks variable and the conditional guards and the logger line (logger.info(...)) to avoid dead code. Ensure you reference and update metadata_tweaks, tweaks, and the payload-sending function so the metadata is actually consumed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/services/langflow_file_service.py`:
- Around line 379-387: The metadata_tweaks list is being built (metadata_tweaks)
and logged but never applied to the request payload; either remove the unused
metadata_tweaks construction or wire it into the outgoing data structure (e.g.,
merge metadata_tweaks into the existing tweaks list or attach it to the request
headers/payload that is sent to Langflow/OpenSearch). Locate the block that
constructs tweaks (or the payload builder/send function) and either extend that
tweaks variable with metadata_tweaks or include metadata_tweaks under the
appropriate payload key before the request is made; if unused, delete the
metadata_tweaks variable and the conditional guards and the logger line
(logger.info(...)) to avoid dead code. Ensure you reference and update
metadata_tweaks, tweaks, and the payload-sending function so the metadata is
actually consumed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0eafac68-0ae2-4e3b-8c99-9befd8ab86c7
📒 Files selected for processing (1)
src/services/langflow_file_service.py
This pull request introduces a new "shared" ingestion mode for IBM COS connectors, allowing documents to be indexed without an owner so that all users in an OpenRAG instance can access them. The change is implemented across the frontend, backend, and SDKs, with UI updates to expose the option where appropriate, backend validation and propagation of the shared flag, and adjustments to document processing logic to omit owner fields when shared mode is enabled.
Frontend and UI Enhancements:
IngestSettingsandSharedBucketViewcomponents), shown only for COS ingestion, and ensured the setting is passed through to connector sync requests. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Backend and API Changes:
sharedflag, with validation to ensure it is only used with the IBM COS connector. [1] [2] [3] [4] [5] [6] [7] [8] [9]Document Processing Logic:
resolve_shared_owner_fieldsand updated document processing so that when shared mode is enabled, owner fields are omitted from indexed chunks, making them visible to all users according to OpenSearch DLS rules. [1] [2] [3] [4] [5] [6]SDK Updates:
sharedflag in both Python and TypeScript SDKs, allowing clients to request shared ingestion from code. [1] [2] [3] [4] [5] [6] [7]These changes collectively enable a temporary, instance-wide document sharing mechanism for COS-ingested documents, pending future implementation of more granular access control.l
Summary by CodeRabbit
New Features
Tests