Skip to content

Conversation

@Ahmath-Gadji
Copy link
Collaborator

@Ahmath-Gadji Ahmath-Gadji commented Jan 20, 2026

Relationship-Aware Document Retrieval

This PR adds support for document relationships in OpenRAG, enabling context-aware retrieval across linked files such as email threads and folder-related document groups.

Key Changes

  • Introduced a new relationshipAware retriever configuration with Hydra defaults.

  • Extended the File database model with relationship_id and parent_id, plus supporting indexes.

  • Added new VectorDB utilities for:

    • Fetching documents by relationship group
    • Traversing ancestor chains via recursive SQL CTEs
  • Updated indexing to include inverted indexes for relationship fields.

  • Expanded API documentation with new parameters:

    • include_related
    • include_ancestors
  • Added a dedicated documentation page describing the relationship model and usage.

Outcome

OpenRAG can now retrieve not only relevant chunks, but also related and contextual documents, improving performance on threaded and grouped datasets.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added relationship-aware retriever for intelligent document context expansion
    • Added include_related and include_ancestors search parameters to expand results with contextually-linked documents
    • Added new endpoints to retrieve documents by relationship and view ancestor document chains
    • Support for organizing documents through relationships, enabling folder grouping and email thread hierarchies
  • Documentation

    • Added comprehensive guides for relationship modeling and linked file management

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

Introduces document relationship support to OpenRAG by adding relationship_id and parent_id fields to track document groupings (e.g., folders, email threads) and hierarchies. A new RelationshipAwareRetriever expands search results with related and ancestor documents. Includes database schema changes, new API endpoints, retriever implementation, and comprehensive tests.

Changes

Cohort / File(s) Summary
Configuration
\.hydra_config/config\.yaml, \.hydra_config/retriever/relationshipAware\.yaml
Updated default retriever to "relationshipAware"; added new retriever config with relationship-aware options (related_limit, include_related, include_ancestors) sourced from environment variables
Database Schema & Model
openrag/components/indexer/vectordb/utils\.py, openrag/components/indexer/vectordb/vectordb\.py, openrag/scripts/migrations/alembic/versions/a1b2c3d4e5f6_add_document_relationships\.py
Added relationship_id and parent_id columns with indexes to File model; introduced traversal methods (get_files_by_relationship, get_file_ancestors, get_ancestor_chunks, get_related_chunks); created Alembic migration with upgrade/downgrade logic for schema changes
Retriever Implementation
openrag/components/retriever\.py
Added RelationshipAwareRetriever class extending base retriever to optionally expand results with related and ancestor chunks; integrated into RetrieverFactory registry
API Endpoints
openrag/routers/partition\.py, openrag/routers/search\.py
Added GET endpoints for retrieving files by relationship and ancestors; extended search endpoints with include_related and include_ancestors query parameters; introduced internal helper for expanding results with deduplicated related/ancestor chunks
Documentation
docs/content/docs/documentation/API\.mdx, docs/content/docs/documentation/linked_files\.md
Added API documentation for relationship metadata during upload, semantic search parameters, and usage examples; created comprehensive linked_files guide covering relationship paradigms, endpoints, and use cases
Unit & Integration Tests
openrag/components/test_relationships\.py, openrag/tests/test_relationships_integration\.py, tests/api_tests/conftest\.py, tests/api_tests/test_search\.py
Added comprehensive test suite covering file relationship operations, ancestor traversal with recursive CTE, folder and email thread scenarios, retriever integration tests with mocked VectorDB, and API search tests
Package Configuration
pyproject\.toml, openrag/tests/__init__\.py
Replaced langchain-milvus dependency with pymilvus==2.5.12; added integration tests package marker

Sequence Diagram

sequenceDiagram
    participant Client
    participant RelationshipAwareRetriever
    participant VectorDB
    participant Deduplicator

    Client->>RelationshipAwareRetriever: search(query, include_related, include_ancestors)
    RelationshipAwareRetriever->>VectorDB: semantic_search(query)
    VectorDB-->>RelationshipAwareRetriever: base_results
    
    alt include_related enabled
        RelationshipAwareRetriever->>VectorDB: get_related_chunks(relationship_id, limit)
        VectorDB-->>RelationshipAwareRetriever: related_results
    end
    
    alt include_ancestors enabled
        RelationshipAwareRetriever->>VectorDB: get_ancestor_chunks(file_id, limit)
        VectorDB-->>RelationshipAwareRetriever: ancestor_results
    end
    
    RelationshipAwareRetriever->>Deduplicator: deduplicate(base + related + ancestors)
    Deduplicator-->>RelationshipAwareRetriever: unique_results
    RelationshipAwareRetriever-->>Client: expanded_results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • dodekapod

Poem

🐰 Hops through folders, threads align,
Relationships in schema shine!
Parents link to children dear,
Ancestors appear so clear.
Search expands with bonds so fine! 🌟

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feat/relationship2' is vague and uses a generic branch name format without clearly describing the feature or primary change being implemented. Consider renaming to a more descriptive title such as 'Add relationship-aware document retrieval with folder and thread support' or similar to clearly convey the feature's purpose.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ahmath-Gadji Ahmath-Gadji force-pushed the feat/relationship2 branch 5 times, most recently from e8e1863 to d207cb7 Compare January 26, 2026 13:05
@Ahmath-Gadji Ahmath-Gadji marked this pull request as ready for review January 28, 2026 14:51
@coderabbitai coderabbitai bot added enhancement New feature or request feat Add a new feature labels Jan 28, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/content/docs/documentation/API.mdx (1)

249-250: Documentation may be inconsistent with implementation.

The PR objectives state that include_related and include_ancestors parameters were "excluded from the /partition/{partition}/file/{file_id} endpoint." However, line 249 states "Query Parameters: Same as partition search," which would imply these parameters are available.

Please verify and update the documentation to accurately reflect which parameters are supported for file-specific search.

🤖 Fix all issues with AI agents
In `@docs/content/docs/documentation/linked_files.md`:
- Around line 45-58: Several fenced code blocks in the linked_files.md examples
are missing language identifiers; update each fenced block by adding the
appropriate language tag (e.g., add ```text for the Email A (root) tree block,
```http for the GET /{partition}/relationships/{relationship_id} request block,
and ```json for the response block that starts with {"files": [...]}), and
similarly add ```json/```http/```text/```python where appropriate for the other
examples called out in the comment so markdownlint MD040 is satisfied.

In `@openrag/components/indexer/vectordb/utils.py`:
- Around line 66-74: The current to_dict (and similarly get_file_ancestors)
builds the dict with explicit keys first and then expands file_metadata,
allowing metadata to override authoritative fields; change the merge order so
metadata is merged first and the explicit fields (partition_name, file_id,
relationship_id, parent_id, etc.) are applied after (e.g., start from a copy of
file_metadata then set the explicit keys or use {**metadata, "partition": ...,
...}) to ensure the explicit column values win and avoid mutating the original
file_metadata.

In `@openrag/components/indexer/vectordb/vectordb.py`:
- Around line 1055-1088: get_ancestor_chunks currently assumes Milvus will
preserve the order of file IDs from
partition_file_manager.get_ancestor_file_ids, but query() does not guarantee
ordering for IN filters; after calling self._async_client.query(...) collect
results into a mapping keyed by res["file_id"] (or group by file_id) and then
iterate ancestor_file_ids in order to extend the final list of Documents so the
output order matches ancestor_file_ids; ensure you apply the limit consistently
(e.g., stop adding Documents once limit is reached) and construct Document(...)
with the same metadata filtering used now.

In `@openrag/routers/partition.py`:
- Around line 392-426: The route for get_related_files currently uses
"/{partition}/relationships/{relationship_id}" which only captures a single path
segment and will fail for relationship IDs containing slashes (e.g., folder
paths); update the route decorator on get_related_files to use the path
converter by changing the route to
"/{partition}/relationships/{relationship_id:path}" so FastAPI captures the full
relationship_id including slashes, leaving the handler signature (partition,
relationship_id) and dependencies (get_vectordb, require_partition_viewer)
unchanged.

In `@pyproject.toml`:
- Line 52: Replace the exact-version dependency string "pymilvus==2.5.12" with a
minimum-version constraint to match project conventions; update the entry to
either "pymilvus>=2.5.12" or, preferably, bump to the current stable
"pymilvus>=2.6.7" so the project uses a minimum version (not an exact pin) and
can pick up the latest fixes.

In `@tests/api_tests/test_search.py`:
- Around line 163-169: The test currently builds relationship_ids by filtering
out docs missing relationship_id, so missing fields can be ignored; update the
assertions to first assert every document in data_with["documents"] includes
"relationship_id" in its metadata (e.g., collect missing = [doc for doc in
data_with["documents"] if "relationship_id" not in doc["metadata"]] and assert
missing is empty), then build relationship_ids from metadata["relationship_id"]
for all docs and assert len(relationship_ids) == 1 to enforce both presence and
consistency; refer to the variables relationship_ids and data_with["documents"]
when making the checks.
- Around line 48-84: The indexed_folder_partition fixture can proceed when the
indexer returns no task info or when polling never reaches SUCCESS, causing
flaky tests; update the fixture (indexed_folder_partition) to: 1) check
api_client.post/response status and fail/skip if non-2xx, 2) when the response
contains neither task_status_url nor task_id immediately skip or raise so we
don't proceed blindly, 3) extend and enforce the polling timeout in the existing
loop that polls task_path (used for task_status_url/task_id) and after the loop
verify the final state is SUCCESS (otherwise pytest.skip with the task_data),
and 4) ensure consistent handling of task_state values in the break/skip logic
so indexing must complete before returning created_partition.
🧹 Nitpick comments (5)
openrag/components/retriever.py (1)

183-241: Deduplicate file_ids before ancestor expansion.

file_infos collects a tuple per chunk; if multiple chunks come from the same file, get_ancestor_chunks can be called repeatedly for the same file. A small dedupe saves extra DB work.

♻️ Suggested refactor
-        file_infos = []  # List of (partition, file_id) tuples
+        file_infos = set()  # Unique (partition, file_id) tuples
...
-            if self.include_ancestors:
-                file_infos.append((metadata.get("partition"), metadata.get("file_id")))
+            if self.include_ancestors:
+                file_infos.add((metadata.get("partition"), metadata.get("file_id")))
docs/content/docs/documentation/linked_files.md (1)

270-276: Optional: soften “very” per style lint.

LanguageTool flagged the phrase “very large threads.” Consider “large threads” instead.

openrag/routers/search.py (3)

44-54: Deduplicate ancestor lookups per file.

results are chunk-level, so the same file can repeat and trigger redundant ancestor fetches. A set keeps calls proportional to unique files.

♻️ Proposed refactor
-    file_infos = []  # List of (partition, file_id) tuples
+    file_infos = set()  # Unique (partition, file_id) tuples
@@
-        if include_ancestors:
-            file_infos.append((metadata.get("partition"), metadata.get("file_id")))
+        if include_ancestors:
+            file_infos.add((metadata.get("partition"), metadata.get("file_id")))

Also applies to: 78-81


102-185: Remove the duplicate _expand_with_related_chunks definition.

Two identical copies make maintenance error-prone; keep a single source of truth.


376-408: Skip expansion (and vectordb dependency) when flags are fixed false.

This endpoint hardcodes the flags to false, so the helper is a no-op but still pulls the dependency and logs an expansion. Consider removing the call (or making expansion configurable here).

♻️ Proposed refactor
-    vectordb=Depends(get_vectordb),
@@
-    # Expand with related/ancestor chunks if requested
-    results = await _expand_with_related_chunks(
-        results=results,
-        vectordb=vectordb,
-        include_related=False,
-        include_ancestors=False,
-    )
-
-    log.info(
-        "Expanded results with related/ancestor chunks.",
-        results=len(results),
-    )

Comment on lines +392 to +449
@router.get(
"/{partition}/relationships/{relationship_id}",
description="""Get all files in a relationship group.

**Parameters:**
- `partition`: The partition name
- `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path)

**Response:**
Returns all files that share the same relationship_id:
- `files`: List of file objects with metadata

**Use Cases:**
- Get all emails in a thread
- Get all documents in a folder
- Get all related documents in a group

**Permissions:**
- Requires partition viewer role or higher
""",
)
async def get_related_files(
request: Request,
partition: str,
relationship_id: str,
vectordb=Depends(get_vectordb),
partition_viewer=Depends(require_partition_viewer),
):
log = logger.bind(partition=partition, relationship_id=relationship_id)
files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id)
log.debug("Listed related files", file_count=len(files))
return JSONResponse(
status_code=status.HTTP_200_OK,
content={"files": files},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the actual code in partition.py
head -n 435 openrag/routers/partition.py | tail -n 50

Repository: linagora/openrag

Length of output: 1493


🏁 Script executed:

# Check if there are other similar routes in the file
rg "relationships|relationship_id" openrag/routers/partition.py -B 2 -A 5

Repository: linagora/openrag

Length of output: 1041


🏁 Script executed:

# Search for any existing path converters in the codebase to see patterns
rg ":path" openrag/ -t py

Repository: linagora/openrag

Length of output: 42


🌐 Web query:

FastAPI path parameters slashes URL encoding {param:path} converter behavior

💡 Result:

  • Default behavior ({param} / {param:str}): Starlette/FastAPI path params match only one path segment—they stop at the next /. (starlette.io)

    • Example: /items/{id} matches /items/abc but not /items/a/b.
  • “Catch-all” behavior ({param:path}): the :path converter matches the rest of the URL path, including additional / characters. (starlette.io)

    • Example: /files/{file_path:path} matches /files/images/logo.png and sets file_path="images/logo.png". (fastapi.tiangolo.com)
    • If you need the captured value to start with /, FastAPI’s docs note you can use a double slash like /files//home/johndoe/myfile.txt. (fastapi.tiangolo.com)
  • About URL-encoding slashes (%2F): in many setups, the incoming URL path is decoded before routing, so %2F becomes / and is treated as a path separator, which can prevent route matching (often resulting in 404). (stackoverflow.com)

    • Practical implication: you generally can’t rely on %2F to represent a “literal slash inside a single path parameter segment” in FastAPI/Starlette routing. If you truly need to transmit arbitrary strings that may contain /, prefer a query parameter or encode the whole value (e.g., base64url). One workaround sometimes used is double-encoding on the client and decoding twice on the server. (stackoverflow.com)

Citations:


Use :path converter for relationship_id to support folder paths.

The route currently uses /{partition}/relationships/{relationship_id}, which only matches a single path segment. Relationship IDs like documents/2024/q1 (folder paths) will fail to match because the default path parameter stops at the first /. URL-encoding slashes as %2F won't work either—FastAPI decodes them before routing, treating them as path separators and resulting in 404 errors.

Change to /{partition}/relationships/{relationship_id:path} to capture the full relationship identifier including slashes.

🔧 Suggested route change
`@router.get`(
-    "/{partition}/relationships/{relationship_id}",
+    "/{partition}/relationships/{relationship_id:path}",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@router.get(
"/{partition}/relationships/{relationship_id}",
description="""Get all files in a relationship group.
**Parameters:**
- `partition`: The partition name
- `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path)
**Response:**
Returns all files that share the same relationship_id:
- `files`: List of file objects with metadata
**Use Cases:**
- Get all emails in a thread
- Get all documents in a folder
- Get all related documents in a group
**Permissions:**
- Requires partition viewer role or higher
""",
)
async def get_related_files(
request: Request,
partition: str,
relationship_id: str,
vectordb=Depends(get_vectordb),
partition_viewer=Depends(require_partition_viewer),
):
log = logger.bind(partition=partition, relationship_id=relationship_id)
files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id)
log.debug("Listed related files", file_count=len(files))
return JSONResponse(
status_code=status.HTTP_200_OK,
content={"files": files},
)
`@router.get`(
"/{partition}/relationships/{relationship_id:path}",
description="""Get all files in a relationship group.
**Parameters:**
- `partition`: The partition name
- `relationship_id`: The relationship group identifier (e.g., email thread ID, folder path)
**Response:**
Returns all files that share the same relationship_id:
- `files`: List of file objects with metadata
**Use Cases:**
- Get all emails in a thread
- Get all documents in a folder
- Get all related documents in a group
**Permissions:**
- Requires partition viewer role or higher
""",
)
async def get_related_files(
request: Request,
partition: str,
relationship_id: str,
vectordb=Depends(get_vectordb),
partition_viewer=Depends(require_partition_viewer),
):
log = logger.bind(partition=partition, relationship_id=relationship_id)
files = await vectordb.get_files_by_relationship.remote(partition=partition, relationship_id=relationship_id)
log.debug("Listed related files", file_count=len(files))
return JSONResponse(
status_code=status.HTTP_200_OK,
content={"files": files},
)
🤖 Prompt for AI Agents
In `@openrag/routers/partition.py` around lines 392 - 426, The route for
get_related_files currently uses "/{partition}/relationships/{relationship_id}"
which only captures a single path segment and will fail for relationship IDs
containing slashes (e.g., folder paths); update the route decorator on
get_related_files to use the path converter by changing the route to
"/{partition}/relationships/{relationship_id:path}" so FastAPI captures the full
relationship_id including slashes, leaving the handler signature (partition,
relationship_id) and dependencies (get_vectordb, require_partition_viewer)
unchanged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@docs/content/docs/documentation/linked_files.md`:
- Around line 45-58: The fenced code blocks in the linked_files.md examples
(e.g., the block starting "Email A (root)" and the other sample blocks like
"Documents/2024/Q1/", "Search: \"budget report\"", and "Query → Hybrid Search →
Reranking → Data Expansion → LLM") are missing language identifiers; add a
language tag (suggest: text) after each opening ``` to satisfy markdownlint
MD040 and improve readability, updating every similar code fence in the file
(including the other occurrences noted in the review) so each begins with
```text.

In `@openrag/components/indexer/vectordb/utils.py`:
- Around line 619-628: get_file_ancestors is vulnerable to key collisions
because file_metadata is spread after the explicit fields so metadata can
overwrite file_id/partition/parent_id/relationship_id/depth; fix by spreading
row.file_metadata (or {}) first and then specifying the explicit keys (file_id,
partition, parent_id, relationship_id, depth) so the query-derived values
override any same-named keys in row.file_metadata; update the list comprehension
in utils.py accordingly and mirror the same change pattern used to fix to_dict()
to ensure consistent behavior.
- Around line 66-75: The to_dict method in the class uses metadata spreading
after explicit keys which allows keys in file_metadata to override actual column
values; modify to_dict so metadata is merged first and then explicit columns
override it (i.e., spread self.file_metadata or {} first, then set "partition":
self.partition_name, "file_id": self.file_id, "relationship_id":
self.relationship_id, "parent_id": self.parent_id) to ensure
partition_name/file_id/relationship_id/parent_id always take precedence.

In `@openrag/components/indexer/vectordb/vectordb.py`:
- Around line 1036-1053: The current code builds filter_expr via string
interpolation (file_id_list / filter_expr) and passes it to
self._async_client.query; replace this with a parameterized filter expression
(e.g., "partition == {partition} and file_id in {file_ids}") and call
self._async_client.query with filter_params={"partition": partition, "file_ids":
file_ids} instead of interpolating file_ids into a string so the query uses
Pymilvus templating and avoids injection/malformed-expression issues while
keeping collection_name, limit and output_fields the same.

In `@openrag/components/retriever.py`:
- Around line 190-203: The deduplication set seen_ids currently includes None
when doc.metadata lacks "_id", causing later docs without "_id" to be treated as
duplicates; update the comprehension that builds seen_ids to skip None (only add
ids when doc.metadata.get("_id") is not None) and likewise update any subsequent
dedup checks that compare metadata.get("_id") (in places that inspect chunks for
duplicates) to first check for a non-None id before using it; reference the
seen_ids variable and the chunks/doc.metadata usage, and preserve existing
behavior for include_related and include_ancestors logic while avoiding adding
or matching on None.

In `@openrag/routers/search.py`:
- Around line 17-184: There are two identical definitions of the async function
_expand_with_related_chunks which causes the second to shadow the first; remove
the duplicate definition so only a single async def
_expand_with_related_chunks(results, vectordb, include_related,
include_ancestors, related_limit=20) remains (the one containing seen_ids,
expanded_results, relationship_ids, file_infos and the calls to
vectordb.get_related_chunks.remote / get_ancestor_chunks.remote), and run tests
to ensure no references break.

In `@openrag/tests/test_relationships_integration.py`:
- Around line 18-24: Docstring in class TestFolderScenario contains a missing
closing parenthesis in the phrase "(same relationship_id."; update the docstring
in TestFolderScenario to read "(same relationship_id)." so the sentence is
properly closed and punctuation is correct.
🧹 Nitpick comments (6)
tests/api_tests/test_search.py (1)

48-84: Consider extracting shared task-polling logic to reduce duplication.

The task polling logic (lines 64-82) is nearly identical to the existing indexed_partition fixture (lines 26-46). Consider extracting this into a helper function to improve maintainability.

♻️ Suggested helper extraction
def _wait_for_indexing_task(api_client, data, context_info=""):
    """Wait for an indexing task to complete."""
    if "task_status_url" in data:
        task_url = data["task_status_url"]
        task_path = "/" + "/".join(task_url.split("/")[3:])
    elif "task_id" in data:
        task_path = f"/indexer/task/{data['task_id']}"
    else:
        time.sleep(3)
        return True

    for _ in range(30):
        task_response = api_client.get(task_path)
        task_data = task_response.json()
        state = task_data.get("task_state", "")
        if state in ["SUCCESS", "COMPLETED", "success", "completed"]:
            return True
        elif state in ["FAILED", "failed", "FAILURE", "failure"]:
            pytest.skip(f"Indexing failed{context_info}: {task_data}")
        time.sleep(2)
    return False

Then use it in both fixtures to reduce code duplication.

openrag/components/retriever.py (1)

227-248: Consider deduplicating file_infos to avoid redundant ancestor fetches.

If multiple chunks from the same file exist in the initial results, file_infos will contain duplicate (partition, file_id) tuples, causing redundant calls to get_ancestor_chunks.

♻️ Suggested optimization
         # Collect unique relationship_ids and file_ids from results
         relationship_ids = set()
-        file_infos = []  # List of (partition, file_id) tuples
+        file_infos = set()  # Set of (partition, file_id) tuples

         for doc in chunks:
             metadata = doc.metadata
             if self.include_related and metadata.get("relationship_id"):
                 relationship_ids.add((metadata.get("partition"), metadata.get("relationship_id")))
             if self.include_ancestors:
-                file_infos.append((metadata.get("partition"), metadata.get("file_id")))
+                partition = metadata.get("partition")
+                file_id = metadata.get("file_id")
+                if partition and file_id:
+                    file_infos.add((partition, file_id))
openrag/routers/partition.py (1)

413-426: Unused request parameter in endpoint.

The request: Request parameter on line 414 is not used in the function body. Consider removing it if not needed for future use.

♻️ Suggested cleanup
 async def get_related_files(
-    request: Request,
     partition: str,
     relationship_id: str,
     vectordb=Depends(get_vectordb),
     partition_viewer=Depends(require_partition_viewer),
 ):
openrag/routers/search.py (2)

376-408: Remove unnecessary expansion code from file-scoped search.

The search_file endpoint adds vectordb dependency and calls _expand_with_related_chunks with hardcoded include_related=False, include_ancestors=False, which is a no-op (the function returns early). The misleading log message at lines 405-408 claims expansion occurred when it didn't.

Either:

  1. Remove the unused code if expansion isn't intended for file-scoped search
  2. Add include_related and include_ancestors query parameters if the feature should be supported
♻️ Proposed fix (Option 1 - remove dead code)
 async def search_file(
     request: Request,
     partition: str,
     file_id: str,
     text: str = Query(..., description="Text to search semantically"),
     top_k: int = Query(5, description="Number of top results to return"),
     indexer=Depends(get_indexer),
-    vectordb=Depends(get_vectordb),
     partition_viewer=Depends(require_partition_viewer),
 ):
-    log = logger.bind(
-        partition=partition,
-        file_id=file_id,
-        query=text,
-        top_k=top_k,
-        include_related=False,
-        include_ancestors=False,
-    )
+    log = logger.bind(partition=partition, file_id=file_id, query=text, top_k=top_k)
     results = await indexer.asearch.remote(query=text, top_k=top_k, partition=partition, filter={"file_id": file_id})
     log.info("Semantic search on specific file completed.", result_count=len(results))

-    # Expand with related/ancestor chunks if requested
-    results = await _expand_with_related_chunks(
-        results=results,
-        vectordb=vectordb,
-        include_related=False,
-        include_ancestors=False,
-    )
-
-    log.info(
-        "Expanded results with related/ancestor chunks.",
-        results=len(results),
-    )
-
     documents = [

40-42: Filter out None values when building seen_ids set.

If any document lacks _id in metadata, None gets added to seen_ids. While the check at line 67 prevents adding chunks without _id, having None in the set could mask issues.

♻️ Suggested improvement
-    seen_ids = {doc.metadata.get("_id") for doc in results}
+    seen_ids = {doc.metadata.get("_id") for doc in results if doc.metadata.get("_id")}
openrag/tests/test_relationships_integration.py (1)

334-340: Consider using pytest.importorskip or a fixture for conditional imports.

The try/except pattern for skipping tests when hydra config is unavailable is repeated in multiple tests. Consider consolidating:

♻️ Suggested improvement using pytest.importorskip
`@pytest.fixture`
def relationship_aware_retriever_cls():
    """Import RelationshipAwareRetriever, skip test if unavailable."""
    return pytest.importorskip(
        "components.retriever",
        reason="Requires hydra config to be available"
    ).RelationshipAwareRetriever

Then inject this fixture into tests that need it.

Comment on lines 66 to 75
def to_dict(self):
metadata = self.file_metadata or {}
d = {"partition": self.partition_name, "file_id": self.file_id, **metadata}
d = {
"partition": self.partition_name,
"file_id": self.file_id,
"relationship_id": self.relationship_id,
"parent_id": self.parent_id,
**metadata,
}
return d
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential key collision in to_dict() when spreading metadata.

If file_metadata contains keys named relationship_id or parent_id, they would overwrite the actual column values since **metadata comes after the explicit keys.

🐛 Proposed fix - ensure column values take precedence
     def to_dict(self):
         metadata = self.file_metadata or {}
+        # Remove potential conflicts - column values should take precedence
+        metadata = {k: v for k, v in metadata.items() if k not in ("partition", "file_id", "relationship_id", "parent_id")}
         d = {
             "partition": self.partition_name,
             "file_id": self.file_id,
             "relationship_id": self.relationship_id,
             "parent_id": self.parent_id,
             **metadata,
         }
         return d

Alternatively, spread metadata first and override with column values:

     def to_dict(self):
         metadata = self.file_metadata or {}
         d = {
             **metadata,
             "partition": self.partition_name,
             "file_id": self.file_id,
             "relationship_id": self.relationship_id,
             "parent_id": self.parent_id,
         }
         return d
🤖 Prompt for AI Agents
In `@openrag/components/indexer/vectordb/utils.py` around lines 66 - 75, The
to_dict method in the class uses metadata spreading after explicit keys which
allows keys in file_metadata to override actual column values; modify to_dict so
metadata is merged first and then explicit columns override it (i.e., spread
self.file_metadata or {} first, then set "partition": self.partition_name,
"file_id": self.file_id, "relationship_id": self.relationship_id, "parent_id":
self.parent_id) to ensure partition_name/file_id/relationship_id/parent_id
always take precedence.

Comment on lines +1036 to +1053
# Build filter expression for Milvus query
file_id_list = ", ".join(f'"{fid}"' for fid in file_ids)
filter_expr = f'partition == "{partition}" and file_id in [{file_id_list}]'

results = await self._async_client.query(
collection_name=self.collection_name,
filter=filter_expr,
limit=limit,
output_fields=["*"],
)

return [
Document(
page_content=res["text"],
metadata={k: v for k, v in res.items() if k not in ["text", "vector"]},
)
for res in results
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the vectordb.py file
find . -name "vectordb.py" -path "*/vectordb/*" | head -5

Repository: linagora/openrag

Length of output: 109


🏁 Script executed:

# Let's check the file structure to understand the context better
git ls-files | grep -E "vectordb|indexer" | head -20

Repository: linagora/openrag

Length of output: 962


🌐 Web query:

Pymilvus parameterized queries filter expression injection prevention

💡 Result:

To prevent filter expression injection in PyMilvus, avoid building expr/filter strings via concatenation (e.g., f-strings) with untrusted input. Instead:

1) Use Milvus “filter templating” (parameterized filters)

Milvus supports placeholders in the filter expression plus a filter_params dict, for search/query/delete. This is the closest equivalent to parameterized queries in PyMilvus. Build a static template and pass user values only through filter_params. [1]

expr = "age > {age} AND city IN {cities}"
params = {"age": user_age, "cities": user_cities}  # user_* are untrusted

res = client.query(
    "people",
    filter=expr,
    output_fields=["age", "city"],
    filter_params=params,
)

2) If you must interpolate strings, follow Milvus escape rules

When you need string literals in expressions, you must escape quotes/backslashes per Milvus’ documented rules (e.g., \\' inside single-quoted strings, \\" inside double-quoted strings, etc.). [2]

3) Treat the expression text as code: constrain it

Even with templating, don’t let users control:

  • field names
  • operators (==, IN, etc.)
  • parentheses/boolean structure

Whitelist allowed fields/operators and map user choices to known-safe templates, because expr/filter is ultimately a boolean expression string evaluated by Milvus. [3]

Sources: Milvus “Filter Templating” docs (placeholders + filter_params) [1], Milvus filter escape rules [2], PyMilvus query API requiring a boolean expression string [3].

Citations: [1] (milvus.io) [2] (blog.milvus.io) [3] (milvus.io)

Citations:


🌐 Web query:

Milvus filter expression syntax special characters escaping

💡 Result:

In Milvus filter expressions, special-character escaping is only relevant inside string constants (strings wrapped in single quotes '...' or double quotes "..."). Milvus uses backslash escape sequences inside those strings:

  • Escape backslash and control escapes:

    • \\ → literal \
    • \\t → tab
    • \\n → newline  [1]
  • If your string is single-quoted ('...'):

    • To include a literal single quote: \\'
      Example: 'It\\'s milvus'  [1]
    • A double quote can be written as " or \\"  [1]
  • If your string is double-quoted ("..."):

    • To include a literal double quote: \\"
      Example: "He said \\"Hi\\""  [1]
    • A single quote can be written as ' or \\'  [1]

Related (JSON fields): Milvus stores JSON string values “as-is”, but unescaped quote characters that terminate the surrounding string literal make the value invalid (e.g., 'a'b' or "a"b" are invalid), while properly escaped variants like 'a"b', "a'b", 'a\\\\'b', "a\\\\"b" are valid  [2]

Common gotcha: if you build the filter expression inside another language string (Python/Java/JSON), you may need to escape twice (once for your language literal, once for Milvus).  [1]

Sources: [1] Milvus docs “Text Match / Escape rules in filter expressions”; [2] Milvus docs “Use JSON Fields / notes on string values and escaping”.


🏁 Script executed:

# Let's examine the vectordb.py file around the lines mentioned in the review
wc -l ./openrag/components/indexer/vectordb/vectordb.py

Repository: linagora/openrag

Length of output: 114


🏁 Script executed:

# Read the specific section mentioned in the review
sed -n '1000,1100p' ./openrag/components/indexer/vectordb/vectordb.py

Repository: linagora/openrag

Length of output: 3556


🏁 Script executed:

# Also look for other usages of filter expressions in this file to understand the pattern
rg -n "filter_expr|filter=" ./openrag/components/indexer/vectordb/vectordb.py | head -20

Repository: linagora/openrag

Length of output: 872


Use filter_params to prevent filter expression injection instead of string interpolation.

Lines 1037–1038 and 1072–1073 build filter expressions via string interpolation. While file_ids from partition_file_manager are from a trusted source, Pymilvus supports filter templating with filter_params for safer parameterized queries. The codebase already uses this pattern elsewhere (e.g., lines 637, 829).

Replace the string interpolation with:

filter_expression = "partition == {partition} and file_id in {file_ids}"
results = await self._async_client.query(
    collection_name=self.collection_name,
    filter=filter_expression,
    filter_params={"partition": partition, "file_ids": file_ids},
    limit=limit,
    output_fields=["*"],
)

This eliminates the risk of malformed expressions if a file_id contains special characters and aligns with the safer patterns already in use throughout the module.

🤖 Prompt for AI Agents
In `@openrag/components/indexer/vectordb/vectordb.py` around lines 1036 - 1053,
The current code builds filter_expr via string interpolation (file_id_list /
filter_expr) and passes it to self._async_client.query; replace this with a
parameterized filter expression (e.g., "partition == {partition} and file_id in
{file_ids}") and call self._async_client.query with filter_params={"partition":
partition, "file_ids": file_ids} instead of interpolating file_ids into a string
so the query uses Pymilvus templating and avoids injection/malformed-expression
issues while keeping collection_name, limit and output_fields the same.

@Ahmath-Gadji Ahmath-Gadji marked this pull request as draft January 29, 2026 15:40
paultranvan and others added 3 commits January 29, 2026 20:47
Add relationship_id and parent_id fields to support document linking:
- relationship_id: Groups related documents (email threads, folders)
- parent_id: Hierarchical parent reference (parent email, parent folder)

Changes:
- Add SQLAlchemy columns and indexes to File model
- Add Alembic migration for new database columns
- Add PartitionFileManager query methods with recursive CTE for ancestors
- Add VectorDB wrapper methods and Milvus INVERTED indexes
- Add API endpoints: GET /{partition}/relationships/{id} and ancestors
- Add include_related/include_ancestors params to search endpoints
- Add RelationshipAwareRetriever for context-aware retrieval
- Add unit tests (14) and integration tests (11)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Ahmath-Gadji Ahmath-Gadji deleted the feat/relationship2 branch January 30, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feat Add a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants