Skip to content

Backend leaking internal processing data to frontend (10-20x payload bloat) #628

@manavgup

Description

@manavgup

Issue: Backend Leaking Internal Processing Data to Frontend (10-20x Payload Bloat)

Priority: HIGH

Category: Performance, Security, Data Privacy
Impact: Production-ready system exposing internal logic and wasting bandwidth

Problem Statement

The conversation API (POST /api/conversations/{id}/messages) is returning massive amounts of internal processing data that should never leave the backend. Each message response is 10-20KB when it should be 1-2KB.

What's Being Leaked

The backend currently exposes:

  1. Full Chain of Thought (CoT) reasoning steps with complete LLM context
  2. Complete document chunks in sources (not excerpts)
  3. Internal processing metadata (enhanced questions, reasoning traces)
  4. LLM prompts and intermediate answers
  5. Source attributions with internal document IDs
  6. Complete context used for reasoning (thousands of tokens)

Example of Leaked Data

Current response structure exposes:

{
  "metadata": {
    "search_metadata": {
      "enhanced_question": "...",  // ❌ Internal processing - should not be exposed
      "cot_steps": [  // ❌ Full reasoning steps with complete context
        {
          "step_number": 1,
          "question": "...",
          "context_used": ["...5000+ characters of internal context..."],
          "intermediate_answer": "...",
          "confidence_score": 0.9,
          "reasoning_trace": "..."
        }
      ],
      "cot_output": {  // ❌ Duplicates cot_steps with even more detail
        "original_question": "...",
        "final_answer": "...",
        "reasoning_steps": [...],  // Full reasoning with context
        "source_summary": {
          "all_sources": [...],
          "primary_sources": [...],
          "source_usage_by_step": {...}
        }
      }
    },
    "sources": [  // ❌ Full document chunks (not excerpts)
      {
        "document_name": "...",
        "content": "...entire document chunk, sometimes 1000+ characters...",
        "metadata": {
          "document_id": "uuid",  // ❌ Internal ID exposed
          "chunk_id": null
        }
      }
    ],
    "cot_output": {  // ❌ THIRD copy of CoT data
      "reasoning_steps": [...],
      "source_summary": {...}
    }
  }
}

Impact

1. Performance Impact

  • 10-20x larger payloads than necessary
  • Wasting user bandwidth (especially mobile users)
  • Slower page loads and rendering
  • Increased CDN/network costs
  • Frontend has to parse and store unnecessary data

2. Security/Privacy Impact

  • Exposing internal LLM prompts (prompt engineering IP)
  • Exposing reasoning logic (proprietary algorithms)
  • Potentially leaking sensitive document content (full chunks vs safe excerpts)
  • Revealing internal document IDs (security through obscurity broken)

3. User Experience Impact

  • Slower response times
  • More memory usage in browser
  • Unnecessary data stored in conversation history

Root Cause Analysis

Location of Issue

The problem originates in the message processing flow:

  1. MessageProcessingOrchestrator._serialize_response() (line ~220-280 in message_processing_orchestrator.py)

    • Creates serialized_response from search_result
    • Includes ALL search metadata without filtering
  2. MessageProcessingOrchestrator._store_assistant_message() (line ~280-340)

    • Stores full metadata in conversation message
    • No sanitization before storing or returning
  3. ConversationMessageOutput.from_db_message() (conversation_schema.py line ~259-319)

    • Reconstructs full metadata from database
    • Returns everything without filtering

Why It Happens

  • No response sanitization layer between backend processing and API response
  • Defensive programming: "Include everything just in case" mentality
  • CoT data duplicated 3 times: in cot_steps, cot_output, and search_metadata.cot_output
  • No distinction between internal metadata (backend only) and user-facing metadata (frontend)

Proposed Solution

High-Level Approach

Create a response sanitization layer that filters metadata before sending to frontend.

Detailed Implementation

1. Add Response Sanitizer (message_processing_orchestrator.py)

def sanitize_for_frontend(
    search_metadata: dict[str, Any],
    show_cot_steps: bool = False
) -> dict[str, Any]:
    """Remove internal processing data before sending to frontend.

    Args:
        search_metadata: Full search metadata from backend
        show_cot_steps: Whether user explicitly requested CoT visibility

    Returns:
        Sanitized metadata safe for frontend consumption
    """
    sanitized = {
        "execution_time": search_metadata.get("execution_time"),
        "token_count": search_metadata.get("token_count"),
        "model_used": search_metadata.get("model_used"),
        "confidence_score": search_metadata.get("confidence_score"),
    }

    # Only include CoT summary if explicitly requested by user
    if show_cot_steps and search_metadata.get("cot_output"):
        cot_output = search_metadata["cot_output"]
        sanitized["cot_summary"] = {
            "steps_count": len(cot_output.get("reasoning_steps", [])),
            "total_confidence": cot_output.get("total_confidence"),
            "reasoning_strategy": cot_output.get("reasoning_strategy"),
            # Include only high-level step info, not full context
            "steps": [
                {
                    "step_number": step["step_number"],
                    "question": step["question"],
                    "confidence_score": step.get("confidence_score")
                    # ❌ DO NOT include: context_used, intermediate_answer, reasoning_trace
                }
                for step in cot_output.get("reasoning_steps", [])
            ]
        }

    # Include structured_answer if present (contains citations)
    if search_metadata.get("structured_answer"):
        sanitized["structured_answer"] = search_metadata["structured_answer"]
        # structured_answer already has proper citations, no full context

    # ❌ DO NOT include:
    # - enhanced_question (internal processing)
    # - cot_steps (full reasoning with context)
    # - source_summary (internal attribution tracking)
    # - integration_seamless, conversation_ui_used, etc. (internal flags)

    return sanitized


def sanitize_sources(sources: list[dict[str, Any]]) -> list[dict[str, Any]]:
    """Sanitize source documents before sending to frontend.

    Args:
        sources: Full source documents with complete chunks

    Returns:
        Sanitized sources with excerpts only
    """
    sanitized_sources = []

    for source in sources:
        # Limit content to 150 characters (excerpt only)
        content = source.get("content", "")
        excerpt = content[:150] + "..." if len(content) > 150 else content

        sanitized_sources.append({
            "document_name": source.get("document_name"),
            "excerpt": excerpt,  # Shortened content
            "metadata": {
                "score": source.get("metadata", {}).get("score"),
                "page_number": source.get("metadata", {}).get("page_number"),
                # ❌ DO NOT include: document_id, chunk_id (internal IDs)
            }
        })

    return sanitized_sources

2. Apply Sanitization in _store_assistant_message()

async def _store_assistant_message(
    self,
    session_id: UUID,
    search_result: SearchResult,
    serialized_response: str,
    assistant_response_tokens: int,
    user_token_count: int,
    user_id: UUID,
) -> ConversationMessageOutput:
    """Store assistant message with SANITIZED metadata."""

    # Extract user's show_cot_steps preference
    show_cot_steps = False
    if search_result.metadata:
        show_cot_steps = search_result.metadata.get("show_cot_steps", False)

    # Sanitize search metadata before storing
    sanitized_search_metadata = sanitize_for_frontend(
        search_result.metadata or {},
        show_cot_steps=show_cot_steps
    )

    # Sanitize sources before storing
    sanitized_sources = sanitize_sources(search_result.documents or [])

    # Build metadata with only sanitized data
    metadata_dict = {
        "source_documents": [doc["document_name"] for doc in search_result.documents] if search_result.documents else None,
        "search_metadata": sanitized_search_metadata,  # ✅ Sanitized
        "cot_used": bool(search_result.cot_output),
        "conversation_aware": True,
        "execution_time": search_result.execution_time,
        "token_count": assistant_response_tokens,
        "token_analysis": search_result.token_warning,
        # ❌ DO NOT store full cot_output at top level (redundant)
    }

    # Create message input with sanitized metadata
    assistant_message_input = ConversationMessageInput(
        session_id=session_id,
        content=serialized_response,
        role=MessageRole.ASSISTANT,
        message_type=MessageType.ANSWER,
        metadata=metadata_dict,
        token_count=assistant_response_tokens,
        execution_time=search_result.execution_time,
    )

    # Store in database
    db_message = self.repository.create_message(assistant_message_input)

    # Convert to output schema
    message_output = ConversationMessageOutput.from_db_message(db_message)

    # Add sanitized sources (not stored in DB, but included in response)
    message_output.sources = sanitized_sources  # ✅ Sanitized excerpts only

    return message_output

3. Update Schema to Support Sanitized Data

No schema changes required - we're just filtering what we put into the existing fields.

Benefits of This Approach

  1. 10-20x smaller payloads (1-2KB instead of 10-20KB per message)
  2. Protects internal IP (LLM prompts, reasoning logic)
  3. Better security (no document IDs, limited content exposure)
  4. Faster response times (less data to transmit and parse)
  5. Backward compatible (same schema, just filtered data)
  6. Respects user preferences (CoT details only if show_cot_steps: true)

Implementation Plan

Phase 1: Add Sanitization Functions ✅

  • Create sanitize_for_frontend() in message_processing_orchestrator.py
  • Create sanitize_sources() in message_processing_orchestrator.py
  • Add unit tests for sanitization logic

Phase 2: Apply Sanitization ✅

  • Update _store_assistant_message() to use sanitization
  • Update _serialize_response() to prepare data for sanitization
  • Ensure show_cot_steps flag is respected

Phase 3: Verify Data Flow ✅

  • Add logging to confirm sanitization is applied
  • Test with show_cot_steps: true (should include summary)
  • Test with show_cot_steps: false (should exclude CoT entirely)
  • Verify payload sizes reduced by 10-20x

Phase 4: Audit Other Endpoints ✅

  • Check /api/search endpoint for similar issues
  • Check WebSocket message handling
  • Check conversation export functionality

Testing Strategy

Manual Testing

  1. Send message with show_cot_steps: false (default)

    • Verify response < 2KB
    • Verify no cot_steps or reasoning_steps in response
    • Verify sources are excerpts only (< 150 chars)
  2. Send message with show_cot_steps: true

    • Verify cot_summary is present
    • Verify NO full context_used or intermediate_answer
    • Verify still < 5KB (much smaller than current 10-20KB)
  3. Check database

    • Verify stored metadata is also sanitized
    • Verify no full document chunks in DB

Automated Testing

def test_sanitize_for_frontend_removes_internal_data():
    """Verify sanitization removes internal processing data."""
    full_metadata = {
        "execution_time": 10.5,
        "token_count": 500,
        "enhanced_question": "internal processing",  # Should be removed
        "cot_steps": [...],  # Should be removed
        "cot_output": {
            "reasoning_steps": [
                {
                    "step_number": 1,
                    "question": "...",
                    "context_used": ["...1000 chars..."],  # Should be removed
                    "intermediate_answer": "..."  # Should be removed
                }
            ]
        }
    }

    sanitized = sanitize_for_frontend(full_metadata, show_cot_steps=False)

    assert "execution_time" in sanitized
    assert "token_count" in sanitized
    assert "enhanced_question" not in sanitized
    assert "cot_steps" not in sanitized
    assert "cot_summary" not in sanitized  # show_cot_steps=False

def test_sanitize_sources_limits_content():
    """Verify sources are limited to excerpts."""
    full_sources = [
        {
            "document_name": "test.pdf",
            "content": "A" * 1000,  # 1000 characters
            "metadata": {
                "score": 0.95,
                "page_number": 42,
                "document_id": "internal-uuid-123",
                "chunk_id": "chunk-456"
            }
        }
    ]

    sanitized = sanitize_sources(full_sources)

    assert len(sanitized[0]["excerpt"]) <= 153  # 150 + "..."
    assert "document_id" not in sanitized[0]["metadata"]
    assert "chunk_id" not in sanitized[0]["metadata"]
    assert sanitized[0]["metadata"]["score"] == 0.95

Files to Modify

  1. backend/rag_solution/services/message_processing_orchestrator.py

    • Add sanitize_for_frontend() method (lines ~300-350)
    • Add sanitize_sources() method (lines ~350-380)
    • Update _store_assistant_message() to apply sanitization (lines ~280-340)
  2. tests/unit/services/test_message_processing_orchestrator.py

    • Add test_sanitize_for_frontend_removes_internal_data()
    • Add test_sanitize_for_frontend_includes_cot_summary_when_requested()
    • Add test_sanitize_sources_limits_content()
    • Add test_sanitize_sources_removes_internal_ids()
  3. tests/integration/test_conversation_api.py

    • Add test for payload size verification
    • Add test for CoT visibility control

Success Criteria

  • ✅ Message payload size reduced from 10-20KB to 1-2KB (default)
  • ✅ Message payload size < 5KB (with show_cot_steps: true)
  • ✅ No enhanced_question, cot_steps, or context_used in responses
  • ✅ Sources limited to 150-char excerpts
  • ✅ No internal document IDs or chunk IDs exposed
  • ✅ CoT summary only present when show_cot_steps: true
  • ✅ All tests passing
  • ✅ Manual verification with network inspector

Additional Considerations

Backward Compatibility

  • Frontend already handles missing fields gracefully
  • No schema changes required
  • Existing clients will just receive less data (not a breaking change)

Database Storage

  • Consider also sanitizing data BEFORE storing in DB
  • Reduces database size and query performance impact
  • Prevents accidental exposure if DB is compromised

Future Improvements

  • Add response compression (gzip) for even smaller payloads
  • Consider paginating conversation history (lazy load old messages)
  • Add telemetry to track actual payload sizes in production

Related Issues

Priority Justification

HIGH Priority because:

  1. Production Impact: Affects all conversation API calls
  2. Security Risk: Exposing internal logic and potentially sensitive content
  3. Performance Impact: 10-20x unnecessary bandwidth usage
  4. User Experience: Slower page loads, wasted mobile data
  5. Cost Impact: Higher CDN/network costs

Estimated Effort

  • Implementation: 4-6 hours
  • Testing: 2-3 hours
  • Review and deployment: 1-2 hours
  • Total: 1 day

References

  • Current payload example: See user's browser network tab showing 10-20KB responses
  • Industry best practices: REST APIs should minimize payload size
  • Security principle: Never expose internal processing details to clients

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityperformancePerformance optimizationsecuritySecurity related issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions