Skip to content

Conversation

smurching
Copy link
Collaborator

@smurching smurching commented Sep 25, 2025

Add Dynamic Filter Support to VectorSearchRetrieverTool

Summary

This PR adds opt-in support for LLM-generated filter parameters in VectorSearchRetrieverTool, enabling LLMs to dynamically construct filters based on natural language queries. This feature is controlled by a new dynamic_filter parameter (default: False) that exposes filter parameters in the tool schema when enabled.

Key Feature: The filter parameter description includes guidance for LLMs to use a fallback strategy - searching WITHOUT filters first to get broad results, then optionally adding filters to narrow down. This approach helps avoid zero results due to incorrect filter values while maintaining filtering flexibility.

Changes

Core Changes

1. New dynamic_filter Parameter

  • Added dynamic_filter: bool field to VectorSearchRetrieverToolMixin (default: False)
  • When True, exposes filter parameters in the tool schema for LLM-generated filters
  • When False (default), maintains backward-compatible behavior with no filter parameters exposed

2. Mutual Exclusivity Validation

  • Added @model_validator to ensure dynamic_filter=True and predefined filters cannot be used together
  • Prevents ambiguous filter configuration by enforcing one approach or the other
  • Clear error message guides users to the correct usage pattern

3. Enhanced Filter Parameter Descriptions with Fallback Strategy

  • Extracts column metadata from Unity Catalog (workspace_client.tables.get())
  • Includes available columns with types in the filter parameter description
  • Example: "Available columns for filtering: product_category (STRING), product_sub_category (STRING)..."
  • NEW: Includes guidance to search WITHOUT filters first: "IMPORTANT: If unsure about filter values, try searching WITHOUT filters first to get broad results, then optionally add filters to narrow down if needed. This ensures you don't miss relevant results due to incorrect filter values."
  • Provides comprehensive operator documentation and examples

Integration Updates

OpenAI Integration (integrations/openai/src/databricks_openai/vector_search_retriever_tool.py)

  • Conditionally creates EnhancedVectorSearchRetrieverToolInput (with optional filters) or BasicVectorSearchRetrieverToolInput (without filters) based on dynamic_filter setting
  • Filter parameter is marked as Optional[List[FilterItem]] with default=None
  • Inlines column metadata extraction during tool schema generation
  • Fixed bug: Originally tried to use index.describe()["columns"] which doesn't exist; now uses Unity Catalog tables API

LangChain Integration (integrations/langchain/src/databricks_langchain/vector_search_retriever_tool.py)

  • Similar conditional args_schema creation based on dynamic_filter setting
  • Filter parameter is marked as optional (Optional[List[FilterItem]] with default=None)
  • Maintains compatibility with LangChain's tool invocation patterns

Tests

New Test Coverage:

  • test_cannot_use_both_dynamic_filter_and_predefined_filters - Validates mutual exclusivity
  • test_predefined_filters_work_without_dynamic_filter - Ensures predefined filters work without dynamic mode
  • test_enhanced_filter_description_with_column_metadata - Verifies column info is included
  • test_enhanced_filter_description_without_column_metadata - Handles missing column info gracefully
  • test_filter_item_serialization - Tests FilterItem schema

Test Results:

  • ✅ OpenAI Integration: 48 tests passing
  • ✅ LangChain Integration: 37 tests passing

Usage

Basic Usage (OpenAI)

from databricks_openai import VectorSearchRetrieverTool

# Enable dynamic filters
tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.my_index",
    dynamic_filter=True  # Exposes optional filter parameters to LLM
)

# LLM receives guidance to try without filters first
# Then can optionally generate filters like:
# {"query": "wireless headphones", "filters": [{"key": "price <", "value": 100}]}
result = tool.execute(
    query="wireless headphones",
    filters=[{"key": "price <", "value": 100}]  # Optional!
)

Basic Usage (LangChain)

from databricks_langchain import VectorSearchRetrieverTool

tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.my_index",
    dynamic_filter=True
)

# Use with LangChain agents - filter parameter is optional
result = tool.invoke({
    "query": "wireless headphones",
    "filters": [{"key": "price <", "value": 100}]  # Optional!
})

🎯 Recommendations

When to use dynamic_filter=True:

  • Column values are discoverable from context (in retrieved documents)
  • Filter requirements are simple and commonly understood (e.g., date ranges, numeric comparisons)
  • Acceptable to have some queries return zero results due to filter mismatches
  • Users can iteratively refine queries based on results
  • LLM can follow the fallback strategy (search without filters first)

When to use predefined filters:

  • Column values are constrained enums (product categories, status values, etc.)
  • Filter logic is deterministic and known in advance
  • Zero tolerance for LLM hallucination in filter values
  • Consistent, predictable behavior is required

Implementation Details

Fallback Strategy Mechanism

The fallback strategy is implemented through tool description guidance rather than execution-time logic:

  1. Filter Parameter Description includes: "IMPORTANT: If unsure about filter values, try searching WITHOUT filters first..."
  2. Filters are Optional: Marked as Optional[List[FilterItem]] with default=None
  3. LLM Follows Guidance: When the LLM sees this description, it learns to:
    • First invoke the tool without filters to get broad results
    • Examine the results to understand available filter values
    • Optionally invoke again with accurate filters to narrow results

This approach:

  • ✅ Leverages LLM's ability to follow instructions in tool descriptions
  • ✅ Doesn't require shipping complex merge/fallback logic
  • ✅ Simple to implement (text-based guidance)
  • ✅ Backward compatible
  • ✅ Educates LLMs on best practices without changing execution

Validation

Testing in Practice

When tested with LangChain AgentExecutor, we observe that the LLM intelligently decides when to generate filters based on the user prompt and the guidance in the tool description. The fallback strategy works as intended - LLMs learn from the IMPORTANT guidance to search without filters first when unsure about filter values, then retry with filters if appropriate.

Observed LLM Behavior

When tested with LangChain AgentExecutor, the LLM demonstrates intelligent filter usage:

Example Query: "Find documentation about Data Engineering products"

LLM Actions:

  1. First attempt: Tries with a filter based on the query:

    {'query': 'Data Engineering products',
     'filters': [{'key': 'product_category', 'value': 'Data Engineering'}]}

    Result: Empty (0 results) - the category doesn't exist in the index

  2. Second attempt: Following the IMPORTANT guidance, automatically retries WITHOUT filters:

    {'query': 'Data Engineering'}

    Result: Success! Returns relevant results from actual categories (Software, Computers, etc.)

Key Observation: The LLM learns from the guidance to:

  • Try with filters when the user query suggests specific filter criteria
  • Automatically fall back to searching without filters when the first attempt fails
  • Get broader, more relevant results instead of returning zero results

smurching and others added 10 commits September 24, 2025 21:01
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
- Update filter parameter description to include IMPORTANT guidance encouraging LLMs to search without filters first when unsure about filter values
- Refactor OpenAI and LangChain integrations to use mixin's _get_filter_param_description() method for consistency
- Add column metadata extraction from Unity Catalog to filter descriptions
- Update demo scripts to demonstrate fallback pattern
- When tested with LangChain AgentExecutor, LLMs intelligently decide when to generate filters and automatically fall back to no-filter searches when needed

This guidance-based approach avoids zero-result scenarios due to hallucinated filter values while maintaining filtering flexibility.

Inspired by Databricks Knowledge Assistants team's (Cindy Wang et al.) findings that searching with and without filters and merging results improves accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@smurching smurching requested a review from jennsun October 21, 2025 03:33
smurching and others added 5 commits October 20, 2025 20:33
Demo scripts were used for testing and validation but are not part of the final PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
- Move input model creation logic to mixin class (_create_enhanced_input_model, _create_basic_input_model)
- Update OpenAI and LangChain integrations to use shared methods
- Revert unrelated vectorstores.py change (workspace_client credential passing)
- Revert corresponding test expectation change

This consolidates duplicated logic between integrations and removes changes unrelated to the dynamic filter feature.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…gration

- Revert workspace_url and personal_access_token logic in VectorSearchClient initialization
- Revert corresponding test expectations for workspace client credentials
- These changes were unrelated to the dynamic filter feature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
if not name.startswith("__"):
column_info.append((name, col_type))
except Exception:
pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably at least log a warning here

- Log a warning message when Unity Catalog table metadata cannot be fetched
- Helps diagnose why column information may be missing from filter descriptions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@smurching smurching changed the title [WIP] Add support for LLM-generated filter parameters in VectorSearchRetrieverTool Add support for LLM-generated filter parameters in VectorSearchRetrieverTool Oct 21, 2025
smurching and others added 4 commits October 20, 2025 23:11
- Remove try-catch logic in _get_filter_param_description() to fail loudly
  when table metadata is unavailable for dynamic_filter=True
- Remove tests for missing column metadata scenario (no longer supported)
- Run ruff format to fix lint issues
- Better to fail loudly than have low quality filter descriptions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Signed-off-by: Sid Murching <[email protected]>
Copy link
Contributor

@jennsun jennsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! only significant comment was handling errors during the table information retrieval logic. let's be sure to update documentation afterwards to do education around the dynamic_filter parameter including these points off the top of my head:

  1. When are appropriate cases to use dynamic_filter vs when predefined filters work just fine
  2. Example usage

"Optional filters to refine vector search results as an array of key-value pairs. "
"IMPORTANT: If unsure about filter values, try searching WITHOUT filters first to get broad results, "
"then optionally add filters to narrow down if needed. This ensures you don't miss relevant results due to incorrect filter values. "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a review but wanted to comment: interesting insight here to search broadly then narrow down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you done testing on your own to confirm this retry filtering works as intended?

col_type = column_info_item.type_name.name
if not name.startswith("__"):
column_info.append((name, col_type))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the dynamic filter requires column metadata to be useful, can we raise a value error if the column_info is still empty at the end of this process?

may also be safe to add error handling/wrap the table information retrieval logic in a try catch if there are failures re permission checks/table doesn't exist or does not have entries

k=vector_search_tool.num_results,
query_type=vector_search_tool.query_type,
filter=expected_filters,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add error handling per my comment in src/databricks_ai_bridge/vector_search_retriever_tool.py - can we add a test that simulates failure during WorkspaceClient.tables.get() process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants