-
Notifications
You must be signed in to change notification settings - Fork 19
Feat/relationship2 #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/relationship2 #219
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughIntroduces document relationship support to OpenRAG by adding relationship_id and parent_id fields to track document groupings (e.g., folders, email threads) and hierarchies. A new RelationshipAwareRetriever expands search results with related and ancestor documents. Includes database schema changes, new API endpoints, retriever implementation, and comprehensive tests. Changes
Sequence DiagramsequenceDiagram
participant Client
participant RelationshipAwareRetriever
participant VectorDB
participant Deduplicator
Client->>RelationshipAwareRetriever: search(query, include_related, include_ancestors)
RelationshipAwareRetriever->>VectorDB: semantic_search(query)
VectorDB-->>RelationshipAwareRetriever: base_results
alt include_related enabled
RelationshipAwareRetriever->>VectorDB: get_related_chunks(relationship_id, limit)
VectorDB-->>RelationshipAwareRetriever: related_results
end
alt include_ancestors enabled
RelationshipAwareRetriever->>VectorDB: get_ancestor_chunks(file_id, limit)
VectorDB-->>RelationshipAwareRetriever: ancestor_results
end
RelationshipAwareRetriever->>Deduplicator: deduplicate(base + related + ancestors)
Deduplicator-->>RelationshipAwareRetriever: unique_results
RelationshipAwareRetriever-->>Client: expanded_results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e8e1863 to
d207cb7
Compare
c73abd0 to
128040f
Compare
128040f to
a6fa6f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/content/docs/documentation/API.mdx (1)
249-250: Documentation may be inconsistent with implementation.The PR objectives state that
include_relatedandinclude_ancestorsparameters were "excluded from the/partition/{partition}/file/{file_id}endpoint." However, line 249 states "Query Parameters: Same as partition search," which would imply these parameters are available.Please verify and update the documentation to accurately reflect which parameters are supported for file-specific search.
🤖 Fix all issues with AI agents
In `@docs/content/docs/documentation/linked_files.md`:
- Around line 45-58: Several fenced code blocks in the linked_files.md examples
are missing language identifiers; update each fenced block by adding the
appropriate language tag (e.g., add ```text for the Email A (root) tree block,
```http for the GET /{partition}/relationships/{relationship_id} request block,
and ```json for the response block that starts with {"files": [...]}), and
similarly add ```json/```http/```text/```python where appropriate for the other
examples called out in the comment so markdownlint MD040 is satisfied.
In `@openrag/components/indexer/vectordb/utils.py`:
- Around line 66-74: The current to_dict (and similarly get_file_ancestors)
builds the dict with explicit keys first and then expands file_metadata,
allowing metadata to override authoritative fields; change the merge order so
metadata is merged first and the explicit fields (partition_name, file_id,
relationship_id, parent_id, etc.) are applied after (e.g., start from a copy of
file_metadata then set the explicit keys or use {**metadata, "partition": ...,
...}) to ensure the explicit column values win and avoid mutating the original
file_metadata.
In `@openrag/components/indexer/vectordb/vectordb.py`:
- Around line 1055-1088: get_ancestor_chunks currently assumes Milvus will
preserve the order of file IDs from
partition_file_manager.get_ancestor_file_ids, but query() does not guarantee
ordering for IN filters; after calling self._async_client.query(...) collect
results into a mapping keyed by res["file_id"] (or group by file_id) and then
iterate ancestor_file_ids in order to extend the final list of Documents so the
output order matches ancestor_file_ids; ensure you apply the limit consistently
(e.g., stop adding Documents once limit is reached) and construct Document(...)
with the same metadata filtering used now.
In `@openrag/routers/partition.py`:
- Around line 392-426: The route for get_related_files currently uses
"/{partition}/relationships/{relationship_id}" which only captures a single path
segment and will fail for relationship IDs containing slashes (e.g., folder
paths); update the route decorator on get_related_files to use the path
converter by changing the route to
"/{partition}/relationships/{relationship_id:path}" so FastAPI captures the full
relationship_id including slashes, leaving the handler signature (partition,
relationship_id) and dependencies (get_vectordb, require_partition_viewer)
unchanged.
In `@pyproject.toml`:
- Line 52: Replace the exact-version dependency string "pymilvus==2.5.12" with a
minimum-version constraint to match project conventions; update the entry to
either "pymilvus>=2.5.12" or, preferably, bump to the current stable
"pymilvus>=2.6.7" so the project uses a minimum version (not an exact pin) and
can pick up the latest fixes.
In `@tests/api_tests/test_search.py`:
- Around line 163-169: The test currently builds relationship_ids by filtering
out docs missing relationship_id, so missing fields can be ignored; update the
assertions to first assert every document in data_with["documents"] includes
"relationship_id" in its metadata (e.g., collect missing = [doc for doc in
data_with["documents"] if "relationship_id" not in doc["metadata"]] and assert
missing is empty), then build relationship_ids from metadata["relationship_id"]
for all docs and assert len(relationship_ids) == 1 to enforce both presence and
consistency; refer to the variables relationship_ids and data_with["documents"]
when making the checks.
- Around line 48-84: The indexed_folder_partition fixture can proceed when the
indexer returns no task info or when polling never reaches SUCCESS, causing
flaky tests; update the fixture (indexed_folder_partition) to: 1) check
api_client.post/response status and fail/skip if non-2xx, 2) when the response
contains neither task_status_url nor task_id immediately skip or raise so we
don't proceed blindly, 3) extend and enforce the polling timeout in the existing
loop that polls task_path (used for task_status_url/task_id) and after the loop
verify the final state is SUCCESS (otherwise pytest.skip with the task_data),
and 4) ensure consistent handling of task_state values in the break/skip logic
so indexing must complete before returning created_partition.
🧹 Nitpick comments (5)
openrag/components/retriever.py (1)
183-241: Deduplicate file_ids before ancestor expansion.
file_infoscollects a tuple per chunk; if multiple chunks come from the same file,get_ancestor_chunkscan be called repeatedly for the same file. A small dedupe saves extra DB work.♻️ Suggested refactor
- file_infos = [] # List of (partition, file_id) tuples + file_infos = set() # Unique (partition, file_id) tuples ... - if self.include_ancestors: - file_infos.append((metadata.get("partition"), metadata.get("file_id"))) + if self.include_ancestors: + file_infos.add((metadata.get("partition"), metadata.get("file_id")))docs/content/docs/documentation/linked_files.md (1)
270-276: Optional: soften “very” per style lint.LanguageTool flagged the phrase “very large threads.” Consider “large threads” instead.
openrag/routers/search.py (3)
44-54: Deduplicate ancestor lookups per file.
resultsare chunk-level, so the same file can repeat and trigger redundant ancestor fetches. A set keeps calls proportional to unique files.♻️ Proposed refactor
- file_infos = [] # List of (partition, file_id) tuples + file_infos = set() # Unique (partition, file_id) tuples @@ - if include_ancestors: - file_infos.append((metadata.get("partition"), metadata.get("file_id"))) + if include_ancestors: + file_infos.add((metadata.get("partition"), metadata.get("file_id")))Also applies to: 78-81
102-185: Remove the duplicate_expand_with_related_chunksdefinition.Two identical copies make maintenance error-prone; keep a single source of truth.
376-408: Skip expansion (and vectordb dependency) when flags are fixed false.This endpoint hardcodes the flags to false, so the helper is a no-op but still pulls the dependency and logs an expansion. Consider removing the call (or making expansion configurable here).
♻️ Proposed refactor
- vectordb=Depends(get_vectordb), @@ - # Expand with related/ancestor chunks if requested - results = await _expand_with_related_chunks( - results=results, - vectordb=vectordb, - include_related=False, - include_ancestors=False, - ) - - log.info( - "Expanded results with related/ancestor chunks.", - results=len(results), - )
| @router.get( | ||
| "/{partition}/relationships/{relationship_id}", | ||
| description="""Get all files in a relationship group. | ||
|
|
||
| **Parameters:** | ||
| - `partition`: The partition name | ||
| - `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path) | ||
|
|
||
| **Response:** | ||
| Returns all files that share the same relationship_id: | ||
| - `files`: List of file objects with metadata | ||
|
|
||
| **Use Cases:** | ||
| - Get all emails in a thread | ||
| - Get all documents in a folder | ||
| - Get all related documents in a group | ||
|
|
||
| **Permissions:** | ||
| - Requires partition viewer role or higher | ||
| """, | ||
| ) | ||
| async def get_related_files( | ||
| request: Request, | ||
| partition: str, | ||
| relationship_id: str, | ||
| vectordb=Depends(get_vectordb), | ||
| partition_viewer=Depends(require_partition_viewer), | ||
| ): | ||
| log = logger.bind(partition=partition, relationship_id=relationship_id) | ||
| files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id) | ||
| log.debug("Listed related files", file_count=len(files)) | ||
| return JSONResponse( | ||
| status_code=status.HTTP_200_OK, | ||
| content={"files": files}, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, examine the actual code in partition.py
head -n 435 openrag/routers/partition.py | tail -n 50Repository: linagora/openrag
Length of output: 1493
🏁 Script executed:
# Check if there are other similar routes in the file
rg "relationships|relationship_id" openrag/routers/partition.py -B 2 -A 5Repository: linagora/openrag
Length of output: 1041
🏁 Script executed:
# Search for any existing path converters in the codebase to see patterns
rg ":path" openrag/ -t pyRepository: linagora/openrag
Length of output: 42
🌐 Web query:
FastAPI path parameters slashes URL encoding {param:path} converter behavior
💡 Result:
-
Default behavior (
{param}/{param:str}): Starlette/FastAPI path params match only one path segment—they stop at the next/. (starlette.io)- Example:
/items/{id}matches/items/abcbut not/items/a/b.
- Example:
-
“Catch-all” behavior (
{param:path}): the:pathconverter matches the rest of the URL path, including additional/characters. (starlette.io)- Example:
/files/{file_path:path}matches/files/images/logo.pngand setsfile_path="images/logo.png". (fastapi.tiangolo.com) - If you need the captured value to start with
/, FastAPI’s docs note you can use a double slash like/files//home/johndoe/myfile.txt. (fastapi.tiangolo.com)
- Example:
-
About URL-encoding slashes (
%2F): in many setups, the incoming URL path is decoded before routing, so%2Fbecomes/and is treated as a path separator, which can prevent route matching (often resulting in 404). (stackoverflow.com)- Practical implication: you generally can’t rely on
%2Fto represent a “literal slash inside a single path parameter segment” in FastAPI/Starlette routing. If you truly need to transmit arbitrary strings that may contain/, prefer a query parameter or encode the whole value (e.g., base64url). One workaround sometimes used is double-encoding on the client and decoding twice on the server. (stackoverflow.com)
- Practical implication: you generally can’t rely on
Citations:
- 1: https://www.starlette.io/routing/?utm_source=openai
- 2: https://www.starlette.io/routing/?utm_source=openai
- 3: https://fastapi.tiangolo.com/vi/tutorial/path-params/?utm_source=openai
- 4: https://fastapi.tiangolo.com/vi/tutorial/path-params/?utm_source=openai
- 5: https://stackoverflow.com/questions/72801333/how-to-pass-url-as-a-path-parameter-to-a-fastapi-route?utm_source=openai
- 6: https://stackoverflow.com/questions/72801333/how-to-pass-url-as-a-path-parameter-to-a-fastapi-route?utm_source=openai
Use :path converter for relationship_id to support folder paths.
The route currently uses /{partition}/relationships/{relationship_id}, which only matches a single path segment. Relationship IDs like documents/2024/q1 (folder paths) will fail to match because the default path parameter stops at the first /. URL-encoding slashes as %2F won't work either—FastAPI decodes them before routing, treating them as path separators and resulting in 404 errors.
Change to /{partition}/relationships/{relationship_id:path} to capture the full relationship identifier including slashes.
🔧 Suggested route change
`@router.get`(
- "/{partition}/relationships/{relationship_id}",
+ "/{partition}/relationships/{relationship_id:path}",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @router.get( | |
| "/{partition}/relationships/{relationship_id}", | |
| description="""Get all files in a relationship group. | |
| **Parameters:** | |
| - `partition`: The partition name | |
| - `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path) | |
| **Response:** | |
| Returns all files that share the same relationship_id: | |
| - `files`: List of file objects with metadata | |
| **Use Cases:** | |
| - Get all emails in a thread | |
| - Get all documents in a folder | |
| - Get all related documents in a group | |
| **Permissions:** | |
| - Requires partition viewer role or higher | |
| """, | |
| ) | |
| async def get_related_files( | |
| request: Request, | |
| partition: str, | |
| relationship_id: str, | |
| vectordb=Depends(get_vectordb), | |
| partition_viewer=Depends(require_partition_viewer), | |
| ): | |
| log = logger.bind(partition=partition, relationship_id=relationship_id) | |
| files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id) | |
| log.debug("Listed related files", file_count=len(files)) | |
| return JSONResponse( | |
| status_code=status.HTTP_200_OK, | |
| content={"files": files}, | |
| ) | |
| `@router.get`( | |
| "/{partition}/relationships/{relationship_id:path}", | |
| description="""Get all files in a relationship group. | |
| **Parameters:** | |
| - `partition`: The partition name | |
| - `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path) | |
| **Response:** | |
| Returns all files that share the same relationship_id: | |
| - `files`: List of file objects with metadata | |
| **Use Cases:** | |
| - Get all emails in a thread | |
| - Get all documents in a folder | |
| - Get all related documents in a group | |
| **Permissions:** | |
| - Requires partition viewer role or higher | |
| """, | |
| ) | |
| async def get_related_files( | |
| request: Request, | |
| partition: str, | |
| relationship_id: str, | |
| vectordb=Depends(get_vectordb), | |
| partition_viewer=Depends(require_partition_viewer), | |
| ): | |
| log = logger.bind(partition=partition, relationship_id=relationship_id) | |
| files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id) | |
| log.debug("Listed related files", file_count=len(files)) | |
| return JSONResponse( | |
| status_code=status.HTTP_200_OK, | |
| content={"files": files}, | |
| ) |
🤖 Prompt for AI Agents
In `@openrag/routers/partition.py` around lines 392 - 426, The route for
get_related_files currently uses "/{partition}/relationships/{relationship_id}"
which only captures a single path segment and will fail for relationship IDs
containing slashes (e.g., folder paths); update the route decorator on
get_related_files to use the path converter by changing the route to
"/{partition}/relationships/{relationship_id:path}" so FastAPI captures the full
relationship_id including slashes, leaving the handler signature (partition,
relationship_id) and dependencies (get_vectordb, require_partition_viewer)
unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@docs/content/docs/documentation/linked_files.md`:
- Around line 45-58: The fenced code blocks in the linked_files.md examples
(e.g., the block starting "Email A (root)" and the other sample blocks like
"Documents/2024/Q1/", "Search: \"budget report\"", and "Query → Hybrid Search →
Reranking → Data Expansion → LLM") are missing language identifiers; add a
language tag (suggest: text) after each opening ``` to satisfy markdownlint
MD040 and improve readability, updating every similar code fence in the file
(including the other occurrences noted in the review) so each begins with
```text.
In `@openrag/components/indexer/vectordb/utils.py`:
- Around line 619-628: get_file_ancestors is vulnerable to key collisions
because file_metadata is spread after the explicit fields so metadata can
overwrite file_id/partition/parent_id/relationship_id/depth; fix by spreading
row.file_metadata (or {}) first and then specifying the explicit keys (file_id,
partition, parent_id, relationship_id, depth) so the query-derived values
override any same-named keys in row.file_metadata; update the list comprehension
in utils.py accordingly and mirror the same change pattern used to fix to_dict()
to ensure consistent behavior.
- Around line 66-75: The to_dict method in the class uses metadata spreading
after explicit keys which allows keys in file_metadata to override actual column
values; modify to_dict so metadata is merged first and then explicit columns
override it (i.e., spread self.file_metadata or {} first, then set "partition":
self.partition_name, "file_id": self.file_id, "relationship_id":
self.relationship_id, "parent_id": self.parent_id) to ensure
partition_name/file_id/relationship_id/parent_id always take precedence.
In `@openrag/components/indexer/vectordb/vectordb.py`:
- Around line 1036-1053: The current code builds filter_expr via string
interpolation (file_id_list / filter_expr) and passes it to
self._async_client.query; replace this with a parameterized filter expression
(e.g., "partition == {partition} and file_id in {file_ids}") and call
self._async_client.query with filter_params={"partition": partition, "file_ids":
file_ids} instead of interpolating file_ids into a string so the query uses
Pymilvus templating and avoids injection/malformed-expression issues while
keeping collection_name, limit and output_fields the same.
In `@openrag/components/retriever.py`:
- Around line 190-203: The deduplication set seen_ids currently includes None
when doc.metadata lacks "_id", causing later docs without "_id" to be treated as
duplicates; update the comprehension that builds seen_ids to skip None (only add
ids when doc.metadata.get("_id") is not None) and likewise update any subsequent
dedup checks that compare metadata.get("_id") (in places that inspect chunks for
duplicates) to first check for a non-None id before using it; reference the
seen_ids variable and the chunks/doc.metadata usage, and preserve existing
behavior for include_related and include_ancestors logic while avoiding adding
or matching on None.
In `@openrag/routers/search.py`:
- Around line 17-184: There are two identical definitions of the async function
_expand_with_related_chunks which causes the second to shadow the first; remove
the duplicate definition so only a single async def
_expand_with_related_chunks(results, vectordb, include_related,
include_ancestors, related_limit=20) remains (the one containing seen_ids,
expanded_results, relationship_ids, file_infos and the calls to
vectordb.get_related_chunks.remote / get_ancestor_chunks.remote), and run tests
to ensure no references break.
In `@openrag/tests/test_relationships_integration.py`:
- Around line 18-24: Docstring in class TestFolderScenario contains a missing
closing parenthesis in the phrase "(same relationship_id."; update the docstring
in TestFolderScenario to read "(same relationship_id)." so the sentence is
properly closed and punctuation is correct.
🧹 Nitpick comments (6)
tests/api_tests/test_search.py (1)
48-84: Consider extracting shared task-polling logic to reduce duplication.The task polling logic (lines 64-82) is nearly identical to the existing
indexed_partitionfixture (lines 26-46). Consider extracting this into a helper function to improve maintainability.♻️ Suggested helper extraction
def _wait_for_indexing_task(api_client, data, context_info=""): """Wait for an indexing task to complete.""" if "task_status_url" in data: task_url = data["task_status_url"] task_path = "/" + "/".join(task_url.split("/")[3:]) elif "task_id" in data: task_path = f"/indexer/task/{data['task_id']}" else: time.sleep(3) return True for _ in range(30): task_response = api_client.get(task_path) task_data = task_response.json() state = task_data.get("task_state", "") if state in ["SUCCESS", "COMPLETED", "success", "completed"]: return True elif state in ["FAILED", "failed", "FAILURE", "failure"]: pytest.skip(f"Indexing failed{context_info}: {task_data}") time.sleep(2) return FalseThen use it in both fixtures to reduce code duplication.
openrag/components/retriever.py (1)
227-248: Consider deduplicating file_infos to avoid redundant ancestor fetches.If multiple chunks from the same file exist in the initial results,
file_infoswill contain duplicate(partition, file_id)tuples, causing redundant calls toget_ancestor_chunks.♻️ Suggested optimization
# Collect unique relationship_ids and file_ids from results relationship_ids = set() - file_infos = [] # List of (partition, file_id) tuples + file_infos = set() # Set of (partition, file_id) tuples for doc in chunks: metadata = doc.metadata if self.include_related and metadata.get("relationship_id"): relationship_ids.add((metadata.get("partition"), metadata.get("relationship_id"))) if self.include_ancestors: - file_infos.append((metadata.get("partition"), metadata.get("file_id"))) + partition = metadata.get("partition") + file_id = metadata.get("file_id") + if partition and file_id: + file_infos.add((partition, file_id))openrag/routers/partition.py (1)
413-426: Unusedrequestparameter in endpoint.The
request: Requestparameter on line 414 is not used in the function body. Consider removing it if not needed for future use.♻️ Suggested cleanup
async def get_related_files( - request: Request, partition: str, relationship_id: str, vectordb=Depends(get_vectordb), partition_viewer=Depends(require_partition_viewer), ):openrag/routers/search.py (2)
376-408: Remove unnecessary expansion code from file-scoped search.The
search_fileendpoint addsvectordbdependency and calls_expand_with_related_chunkswith hardcodedinclude_related=False, include_ancestors=False, which is a no-op (the function returns early). The misleading log message at lines 405-408 claims expansion occurred when it didn't.Either:
- Remove the unused code if expansion isn't intended for file-scoped search
- Add
include_relatedandinclude_ancestorsquery parameters if the feature should be supported♻️ Proposed fix (Option 1 - remove dead code)
async def search_file( request: Request, partition: str, file_id: str, text: str = Query(..., description="Text to search semantically"), top_k: int = Query(5, description="Number of top results to return"), indexer=Depends(get_indexer), - vectordb=Depends(get_vectordb), partition_viewer=Depends(require_partition_viewer), ): - log = logger.bind( - partition=partition, - file_id=file_id, - query=text, - top_k=top_k, - include_related=False, - include_ancestors=False, - ) + log = logger.bind(partition=partition, file_id=file_id, query=text, top_k=top_k) results = await indexer.asearch.remote(query=text, top_k=top_k, partition=partition, filter={"file_id": file_id}) log.info("Semantic search on specific file completed.", result_count=len(results)) - # Expand with related/ancestor chunks if requested - results = await _expand_with_related_chunks( - results=results, - vectordb=vectordb, - include_related=False, - include_ancestors=False, - ) - - log.info( - "Expanded results with related/ancestor chunks.", - results=len(results), - ) - documents = [
40-42: Filter out None values when building seen_ids set.If any document lacks
_idin metadata,Nonegets added toseen_ids. While the check at line 67 prevents adding chunks without_id, havingNonein the set could mask issues.♻️ Suggested improvement
- seen_ids = {doc.metadata.get("_id") for doc in results} + seen_ids = {doc.metadata.get("_id") for doc in results if doc.metadata.get("_id")}openrag/tests/test_relationships_integration.py (1)
334-340: Consider using pytest.importorskip or a fixture for conditional imports.The try/except pattern for skipping tests when hydra config is unavailable is repeated in multiple tests. Consider consolidating:
♻️ Suggested improvement using pytest.importorskip
`@pytest.fixture` def relationship_aware_retriever_cls(): """Import RelationshipAwareRetriever, skip test if unavailable.""" return pytest.importorskip( "components.retriever", reason="Requires hydra config to be available" ).RelationshipAwareRetrieverThen inject this fixture into tests that need it.
| def to_dict(self): | ||
| metadata = self.file_metadata or {} | ||
| d = {"partition": self.partition_name, "file_id": self.file_id, **metadata} | ||
| d = { | ||
| "partition": self.partition_name, | ||
| "file_id": self.file_id, | ||
| "relationship_id": self.relationship_id, | ||
| "parent_id": self.parent_id, | ||
| **metadata, | ||
| } | ||
| return d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential key collision in to_dict() when spreading metadata.
If file_metadata contains keys named relationship_id or parent_id, they would overwrite the actual column values since **metadata comes after the explicit keys.
🐛 Proposed fix - ensure column values take precedence
def to_dict(self):
metadata = self.file_metadata or {}
+ # Remove potential conflicts - column values should take precedence
+ metadata = {k: v for k, v in metadata.items() if k not in ("partition", "file_id", "relationship_id", "parent_id")}
d = {
"partition": self.partition_name,
"file_id": self.file_id,
"relationship_id": self.relationship_id,
"parent_id": self.parent_id,
**metadata,
}
return dAlternatively, spread metadata first and override with column values:
def to_dict(self):
metadata = self.file_metadata or {}
d = {
**metadata,
"partition": self.partition_name,
"file_id": self.file_id,
"relationship_id": self.relationship_id,
"parent_id": self.parent_id,
}
return d🤖 Prompt for AI Agents
In `@openrag/components/indexer/vectordb/utils.py` around lines 66 - 75, The
to_dict method in the class uses metadata spreading after explicit keys which
allows keys in file_metadata to override actual column values; modify to_dict so
metadata is merged first and then explicit columns override it (i.e., spread
self.file_metadata or {} first, then set "partition": self.partition_name,
"file_id": self.file_id, "relationship_id": self.relationship_id, "parent_id":
self.parent_id) to ensure partition_name/file_id/relationship_id/parent_id
always take precedence.
| # Build filter expression for Milvus query | ||
| file_id_list = ", ".join(f'"{fid}"' for fid in file_ids) | ||
| filter_expr = f'partition == "{partition}" and file_id in [{file_id_list}]' | ||
|
|
||
| results = await self._async_client.query( | ||
| collection_name=self.collection_name, | ||
| filter=filter_expr, | ||
| limit=limit, | ||
| output_fields=["*"], | ||
| ) | ||
|
|
||
| return [ | ||
| Document( | ||
| page_content=res["text"], | ||
| metadata={k: v for k, v in res.items() if k not in ["text", "vector"]}, | ||
| ) | ||
| for res in results | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the vectordb.py file
find . -name "vectordb.py" -path "*/vectordb/*" | head -5Repository: linagora/openrag
Length of output: 109
🏁 Script executed:
# Let's check the file structure to understand the context better
git ls-files | grep -E "vectordb|indexer" | head -20Repository: linagora/openrag
Length of output: 962
🌐 Web query:
Pymilvus parameterized queries filter expression injection prevention
💡 Result:
To prevent filter expression injection in PyMilvus, avoid building expr/filter strings via concatenation (e.g., f-strings) with untrusted input. Instead:
1) Use Milvus “filter templating” (parameterized filters)
Milvus supports placeholders in the filter expression plus a filter_params dict, for search/query/delete. This is the closest equivalent to parameterized queries in PyMilvus. Build a static template and pass user values only through filter_params. [1]
expr = "age > {age} AND city IN {cities}"
params = {"age": user_age, "cities": user_cities} # user_* are untrusted
res = client.query(
"people",
filter=expr,
output_fields=["age", "city"],
filter_params=params,
)2) If you must interpolate strings, follow Milvus escape rules
When you need string literals in expressions, you must escape quotes/backslashes per Milvus’ documented rules (e.g., \\' inside single-quoted strings, \\" inside double-quoted strings, etc.). [2]
3) Treat the expression text as code: constrain it
Even with templating, don’t let users control:
- field names
- operators (
==,IN, etc.) - parentheses/boolean structure
Whitelist allowed fields/operators and map user choices to known-safe templates, because expr/filter is ultimately a boolean expression string evaluated by Milvus. [3]
Sources: Milvus “Filter Templating” docs (placeholders + filter_params) [1], Milvus filter escape rules [2], PyMilvus query API requiring a boolean expression string [3].
Citations: [1] (milvus.io) [2] (blog.milvus.io) [3] (milvus.io)
Citations:
- 1: https://milvus.io/docs/pt/filtering-templating.md
- 2: https://blog.milvus.io/docs/keyword-match.md?utm_source=openai
- 3: https://milvus.io/api-reference/pymilvus/v2.6.x/ORM/Collection/query.md?utm_source=openai
🌐 Web query:
Milvus filter expression syntax special characters escaping
💡 Result:
In Milvus filter expressions, special-character escaping is only relevant inside string constants (strings wrapped in single quotes '...' or double quotes "..."). Milvus uses backslash escape sequences inside those strings:
-
Escape backslash and control escapes:
\\→ literal\\\t→ tab\\n→ newline [1]
-
If your string is single-quoted (
'...'):- To include a literal single quote:
\\'
Example:'It\\'s milvus'[1] - A double quote can be written as
"or\\"[1]
- To include a literal single quote:
-
If your string is double-quoted (
"..."):- To include a literal double quote:
\\"
Example:"He said \\"Hi\\""[1] - A single quote can be written as
'or\\'[1]
- To include a literal double quote:
Related (JSON fields): Milvus stores JSON string values “as-is”, but unescaped quote characters that terminate the surrounding string literal make the value invalid (e.g., 'a'b' or "a"b" are invalid), while properly escaped variants like 'a"b', "a'b", 'a\\\\'b', "a\\\\"b" are valid [2]
Common gotcha: if you build the filter expression inside another language string (Python/Java/JSON), you may need to escape twice (once for your language literal, once for Milvus). [1]
Sources: [1] Milvus docs “Text Match / Escape rules in filter expressions”; [2] Milvus docs “Use JSON Fields / notes on string values and escaping”.
🏁 Script executed:
# Let's examine the vectordb.py file around the lines mentioned in the review
wc -l ./openrag/components/indexer/vectordb/vectordb.pyRepository: linagora/openrag
Length of output: 114
🏁 Script executed:
# Read the specific section mentioned in the review
sed -n '1000,1100p' ./openrag/components/indexer/vectordb/vectordb.pyRepository: linagora/openrag
Length of output: 3556
🏁 Script executed:
# Also look for other usages of filter expressions in this file to understand the pattern
rg -n "filter_expr|filter=" ./openrag/components/indexer/vectordb/vectordb.py | head -20Repository: linagora/openrag
Length of output: 872
Use filter_params to prevent filter expression injection instead of string interpolation.
Lines 1037–1038 and 1072–1073 build filter expressions via string interpolation. While file_ids from partition_file_manager are from a trusted source, Pymilvus supports filter templating with filter_params for safer parameterized queries. The codebase already uses this pattern elsewhere (e.g., lines 637, 829).
Replace the string interpolation with:
filter_expression = "partition == {partition} and file_id in {file_ids}"
results = await self._async_client.query(
collection_name=self.collection_name,
filter=filter_expression,
filter_params={"partition": partition, "file_ids": file_ids},
limit=limit,
output_fields=["*"],
)This eliminates the risk of malformed expressions if a file_id contains special characters and aligns with the safer patterns already in use throughout the module.
🤖 Prompt for AI Agents
In `@openrag/components/indexer/vectordb/vectordb.py` around lines 1036 - 1053,
The current code builds filter_expr via string interpolation (file_id_list /
filter_expr) and passes it to self._async_client.query; replace this with a
parameterized filter expression (e.g., "partition == {partition} and file_id in
{file_ids}") and call self._async_client.query with filter_params={"partition":
partition, "file_ids": file_ids} instead of interpolating file_ids into a string
so the query uses Pymilvus templating and avoids injection/malformed-expression
issues while keeping collection_name, limit and output_fields the same.
Add relationship_id and parent_id fields to support document linking:
- relationship_id: Groups related documents (email threads, folders)
- parent_id: Hierarchical parent reference (parent email, parent folder)
Changes:
- Add SQLAlchemy columns and indexes to File model
- Add Alembic migration for new database columns
- Add PartitionFileManager query methods with recursive CTE for ancestors
- Add VectorDB wrapper methods and Milvus INVERTED indexes
- Add API endpoints: GET /{partition}/relationships/{id} and ancestors
- Add include_related/include_ancestors params to search endpoints
- Add RelationshipAwareRetriever for context-aware retrieval
- Add unit tests (14) and integration tests (11)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
a6fa6f5 to
0aea2ce
Compare
Relationship-Aware Document Retrieval
This PR adds support for document relationships in OpenRAG, enabling context-aware retrieval across linked files such as email threads and folder-related document groups.
Key Changes
Introduced a new
relationshipAwareretriever configuration with Hydra defaults.Extended the
Filedatabase model withrelationship_idandparent_id, plus supporting indexes.Added new VectorDB utilities for:
Updated indexing to include inverted indexes for relationship fields.
Expanded API documentation with new parameters:
include_relatedinclude_ancestorsAdded a dedicated documentation page describing the relationship model and usage.
Outcome
OpenRAG can now retrieve not only relevant chunks, but also related and contextual documents, improving performance on threaded and grouped datasets.
Summary by CodeRabbit
Release Notes
New Features
include_relatedandinclude_ancestorssearch parameters to expand results with contextually-linked documentsDocumentation
✏️ Tip: You can customize this high-level summary in your review settings.