-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Issue: Backend Leaking Internal Processing Data to Frontend (10-20x Payload Bloat)
Priority: HIGH
Category: Performance, Security, Data Privacy
Impact: Production-ready system exposing internal logic and wasting bandwidth
Problem Statement
The conversation API (POST /api/conversations/{id}/messages) is returning massive amounts of internal processing data that should never leave the backend. Each message response is 10-20KB when it should be 1-2KB.
What's Being Leaked
The backend currently exposes:
- Full Chain of Thought (CoT) reasoning steps with complete LLM context
- Complete document chunks in sources (not excerpts)
- Internal processing metadata (enhanced questions, reasoning traces)
- LLM prompts and intermediate answers
- Source attributions with internal document IDs
- Complete context used for reasoning (thousands of tokens)
Example of Leaked Data
Current response structure exposes:
{
"metadata": {
"search_metadata": {
"enhanced_question": "...", // ❌ Internal processing - should not be exposed
"cot_steps": [ // ❌ Full reasoning steps with complete context
{
"step_number": 1,
"question": "...",
"context_used": ["...5000+ characters of internal context..."],
"intermediate_answer": "...",
"confidence_score": 0.9,
"reasoning_trace": "..."
}
],
"cot_output": { // ❌ Duplicates cot_steps with even more detail
"original_question": "...",
"final_answer": "...",
"reasoning_steps": [...], // Full reasoning with context
"source_summary": {
"all_sources": [...],
"primary_sources": [...],
"source_usage_by_step": {...}
}
}
},
"sources": [ // ❌ Full document chunks (not excerpts)
{
"document_name": "...",
"content": "...entire document chunk, sometimes 1000+ characters...",
"metadata": {
"document_id": "uuid", // ❌ Internal ID exposed
"chunk_id": null
}
}
],
"cot_output": { // ❌ THIRD copy of CoT data
"reasoning_steps": [...],
"source_summary": {...}
}
}
}Impact
1. Performance Impact
- 10-20x larger payloads than necessary
- Wasting user bandwidth (especially mobile users)
- Slower page loads and rendering
- Increased CDN/network costs
- Frontend has to parse and store unnecessary data
2. Security/Privacy Impact
- Exposing internal LLM prompts (prompt engineering IP)
- Exposing reasoning logic (proprietary algorithms)
- Potentially leaking sensitive document content (full chunks vs safe excerpts)
- Revealing internal document IDs (security through obscurity broken)
3. User Experience Impact
- Slower response times
- More memory usage in browser
- Unnecessary data stored in conversation history
Root Cause Analysis
Location of Issue
The problem originates in the message processing flow:
-
MessageProcessingOrchestrator._serialize_response()(line ~220-280 inmessage_processing_orchestrator.py)- Creates
serialized_responsefromsearch_result - Includes ALL search metadata without filtering
- Creates
-
MessageProcessingOrchestrator._store_assistant_message()(line ~280-340)- Stores full metadata in conversation message
- No sanitization before storing or returning
-
ConversationMessageOutput.from_db_message()(conversation_schema.pyline ~259-319)- Reconstructs full metadata from database
- Returns everything without filtering
Why It Happens
- No response sanitization layer between backend processing and API response
- Defensive programming: "Include everything just in case" mentality
- CoT data duplicated 3 times: in
cot_steps,cot_output, andsearch_metadata.cot_output - No distinction between internal metadata (backend only) and user-facing metadata (frontend)
Proposed Solution
High-Level Approach
Create a response sanitization layer that filters metadata before sending to frontend.
Detailed Implementation
1. Add Response Sanitizer (message_processing_orchestrator.py)
def sanitize_for_frontend(
search_metadata: dict[str, Any],
show_cot_steps: bool = False
) -> dict[str, Any]:
"""Remove internal processing data before sending to frontend.
Args:
search_metadata: Full search metadata from backend
show_cot_steps: Whether user explicitly requested CoT visibility
Returns:
Sanitized metadata safe for frontend consumption
"""
sanitized = {
"execution_time": search_metadata.get("execution_time"),
"token_count": search_metadata.get("token_count"),
"model_used": search_metadata.get("model_used"),
"confidence_score": search_metadata.get("confidence_score"),
}
# Only include CoT summary if explicitly requested by user
if show_cot_steps and search_metadata.get("cot_output"):
cot_output = search_metadata["cot_output"]
sanitized["cot_summary"] = {
"steps_count": len(cot_output.get("reasoning_steps", [])),
"total_confidence": cot_output.get("total_confidence"),
"reasoning_strategy": cot_output.get("reasoning_strategy"),
# Include only high-level step info, not full context
"steps": [
{
"step_number": step["step_number"],
"question": step["question"],
"confidence_score": step.get("confidence_score")
# ❌ DO NOT include: context_used, intermediate_answer, reasoning_trace
}
for step in cot_output.get("reasoning_steps", [])
]
}
# Include structured_answer if present (contains citations)
if search_metadata.get("structured_answer"):
sanitized["structured_answer"] = search_metadata["structured_answer"]
# structured_answer already has proper citations, no full context
# ❌ DO NOT include:
# - enhanced_question (internal processing)
# - cot_steps (full reasoning with context)
# - source_summary (internal attribution tracking)
# - integration_seamless, conversation_ui_used, etc. (internal flags)
return sanitized
def sanitize_sources(sources: list[dict[str, Any]]) -> list[dict[str, Any]]:
"""Sanitize source documents before sending to frontend.
Args:
sources: Full source documents with complete chunks
Returns:
Sanitized sources with excerpts only
"""
sanitized_sources = []
for source in sources:
# Limit content to 150 characters (excerpt only)
content = source.get("content", "")
excerpt = content[:150] + "..." if len(content) > 150 else content
sanitized_sources.append({
"document_name": source.get("document_name"),
"excerpt": excerpt, # Shortened content
"metadata": {
"score": source.get("metadata", {}).get("score"),
"page_number": source.get("metadata", {}).get("page_number"),
# ❌ DO NOT include: document_id, chunk_id (internal IDs)
}
})
return sanitized_sources2. Apply Sanitization in _store_assistant_message()
async def _store_assistant_message(
self,
session_id: UUID,
search_result: SearchResult,
serialized_response: str,
assistant_response_tokens: int,
user_token_count: int,
user_id: UUID,
) -> ConversationMessageOutput:
"""Store assistant message with SANITIZED metadata."""
# Extract user's show_cot_steps preference
show_cot_steps = False
if search_result.metadata:
show_cot_steps = search_result.metadata.get("show_cot_steps", False)
# Sanitize search metadata before storing
sanitized_search_metadata = sanitize_for_frontend(
search_result.metadata or {},
show_cot_steps=show_cot_steps
)
# Sanitize sources before storing
sanitized_sources = sanitize_sources(search_result.documents or [])
# Build metadata with only sanitized data
metadata_dict = {
"source_documents": [doc["document_name"] for doc in search_result.documents] if search_result.documents else None,
"search_metadata": sanitized_search_metadata, # ✅ Sanitized
"cot_used": bool(search_result.cot_output),
"conversation_aware": True,
"execution_time": search_result.execution_time,
"token_count": assistant_response_tokens,
"token_analysis": search_result.token_warning,
# ❌ DO NOT store full cot_output at top level (redundant)
}
# Create message input with sanitized metadata
assistant_message_input = ConversationMessageInput(
session_id=session_id,
content=serialized_response,
role=MessageRole.ASSISTANT,
message_type=MessageType.ANSWER,
metadata=metadata_dict,
token_count=assistant_response_tokens,
execution_time=search_result.execution_time,
)
# Store in database
db_message = self.repository.create_message(assistant_message_input)
# Convert to output schema
message_output = ConversationMessageOutput.from_db_message(db_message)
# Add sanitized sources (not stored in DB, but included in response)
message_output.sources = sanitized_sources # ✅ Sanitized excerpts only
return message_output3. Update Schema to Support Sanitized Data
No schema changes required - we're just filtering what we put into the existing fields.
Benefits of This Approach
- 10-20x smaller payloads (1-2KB instead of 10-20KB per message)
- Protects internal IP (LLM prompts, reasoning logic)
- Better security (no document IDs, limited content exposure)
- Faster response times (less data to transmit and parse)
- Backward compatible (same schema, just filtered data)
- Respects user preferences (CoT details only if
show_cot_steps: true)
Implementation Plan
Phase 1: Add Sanitization Functions ✅
- Create
sanitize_for_frontend()inmessage_processing_orchestrator.py - Create
sanitize_sources()inmessage_processing_orchestrator.py - Add unit tests for sanitization logic
Phase 2: Apply Sanitization ✅
- Update
_store_assistant_message()to use sanitization - Update
_serialize_response()to prepare data for sanitization - Ensure
show_cot_stepsflag is respected
Phase 3: Verify Data Flow ✅
- Add logging to confirm sanitization is applied
- Test with
show_cot_steps: true(should include summary) - Test with
show_cot_steps: false(should exclude CoT entirely) - Verify payload sizes reduced by 10-20x
Phase 4: Audit Other Endpoints ✅
- Check
/api/searchendpoint for similar issues - Check WebSocket message handling
- Check conversation export functionality
Testing Strategy
Manual Testing
-
Send message with
show_cot_steps: false(default)- Verify response < 2KB
- Verify no
cot_stepsorreasoning_stepsin response - Verify sources are excerpts only (< 150 chars)
-
Send message with
show_cot_steps: true- Verify
cot_summaryis present - Verify NO full
context_usedorintermediate_answer - Verify still < 5KB (much smaller than current 10-20KB)
- Verify
-
Check database
- Verify stored metadata is also sanitized
- Verify no full document chunks in DB
Automated Testing
def test_sanitize_for_frontend_removes_internal_data():
"""Verify sanitization removes internal processing data."""
full_metadata = {
"execution_time": 10.5,
"token_count": 500,
"enhanced_question": "internal processing", # Should be removed
"cot_steps": [...], # Should be removed
"cot_output": {
"reasoning_steps": [
{
"step_number": 1,
"question": "...",
"context_used": ["...1000 chars..."], # Should be removed
"intermediate_answer": "..." # Should be removed
}
]
}
}
sanitized = sanitize_for_frontend(full_metadata, show_cot_steps=False)
assert "execution_time" in sanitized
assert "token_count" in sanitized
assert "enhanced_question" not in sanitized
assert "cot_steps" not in sanitized
assert "cot_summary" not in sanitized # show_cot_steps=False
def test_sanitize_sources_limits_content():
"""Verify sources are limited to excerpts."""
full_sources = [
{
"document_name": "test.pdf",
"content": "A" * 1000, # 1000 characters
"metadata": {
"score": 0.95,
"page_number": 42,
"document_id": "internal-uuid-123",
"chunk_id": "chunk-456"
}
}
]
sanitized = sanitize_sources(full_sources)
assert len(sanitized[0]["excerpt"]) <= 153 # 150 + "..."
assert "document_id" not in sanitized[0]["metadata"]
assert "chunk_id" not in sanitized[0]["metadata"]
assert sanitized[0]["metadata"]["score"] == 0.95Files to Modify
-
backend/rag_solution/services/message_processing_orchestrator.py- Add
sanitize_for_frontend()method (lines ~300-350) - Add
sanitize_sources()method (lines ~350-380) - Update
_store_assistant_message()to apply sanitization (lines ~280-340)
- Add
-
tests/unit/services/test_message_processing_orchestrator.py- Add
test_sanitize_for_frontend_removes_internal_data() - Add
test_sanitize_for_frontend_includes_cot_summary_when_requested() - Add
test_sanitize_sources_limits_content() - Add
test_sanitize_sources_removes_internal_ids()
- Add
-
tests/integration/test_conversation_api.py- Add test for payload size verification
- Add test for CoT visibility control
Success Criteria
- ✅ Message payload size reduced from 10-20KB to 1-2KB (default)
- ✅ Message payload size < 5KB (with
show_cot_steps: true) - ✅ No
enhanced_question,cot_steps, orcontext_usedin responses - ✅ Sources limited to 150-char excerpts
- ✅ No internal document IDs or chunk IDs exposed
- ✅ CoT summary only present when
show_cot_steps: true - ✅ All tests passing
- ✅ Manual verification with network inspector
Additional Considerations
Backward Compatibility
- Frontend already handles missing fields gracefully
- No schema changes required
- Existing clients will just receive less data (not a breaking change)
Database Storage
- Consider also sanitizing data BEFORE storing in DB
- Reduces database size and query performance impact
- Prevents accidental exposure if DB is compromised
Future Improvements
- Add response compression (gzip) for even smaller payloads
- Consider paginating conversation history (lazy load old messages)
- Add telemetry to track actual payload sizes in production
Related Issues
- Implement Structured Output with JSON Schema Validation #604 #626 - Structured output with citations (this PR exposed the data leak)
- Performance optimization efforts
- Security audit recommendations
Priority Justification
HIGH Priority because:
- Production Impact: Affects all conversation API calls
- Security Risk: Exposing internal logic and potentially sensitive content
- Performance Impact: 10-20x unnecessary bandwidth usage
- User Experience: Slower page loads, wasted mobile data
- Cost Impact: Higher CDN/network costs
Estimated Effort
- Implementation: 4-6 hours
- Testing: 2-3 hours
- Review and deployment: 1-2 hours
- Total: 1 day
References
- Current payload example: See user's browser network tab showing 10-20KB responses
- Industry best practices: REST APIs should minimize payload size
- Security principle: Never expose internal processing details to clients