Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 2.64 KB

File metadata and controls

74 lines (53 loc) · 2.64 KB

📚 RAG Layer Security

🎯 Definition

Retrieval-Augmented Generation (RAG) bridges the gap between static LLMs and dynamic data. The RAG layer includes vector databases, embedding models, and retrieval pipelines.


🏗️ The RAG Attack Surface

sequenceDiagram
    participant A as Attacker
    participant D as Vector DB
    participant R as Retriever
    participant L as LLM
    
    A->>D: Data Poisoning (Inject Docs)
    R->>D: Fetch Context
    D-->>R: Returns Malicious Docs
    R->>L: Context + Prompt
    L-->>A: Malicious Output / Hijack
Loading

🧨 Key Vulnerabilities

1. Data Poisoning (Injection at Source)

Attackers insert malicious documents into the knowledge base.

  • Payload: Hidden instructions, misinformation, or sensitive data.
  • Impact: Permanent alteration of the AI's "knowledge."

2. Retrieval Hijacking (Embedding Collisions)

Attackers craft documents whose embeddings are mathematically close to common user queries.

  • Payload: Documents that "win" the similarity search and get retrieved every time.
  • Impact: Dominating the AI's response window.

3. Context Injection (Semantic Hijacking)

Instructions embedded within retrieved text that override the system prompt.

  • Impact: Turning a "Helpful Assistant" into a "Malicious Proxy."

📊 RAG Integrity Metrics

Metric Ideal Score Risk Indicator
Retrieval Precision > 95% Irrelevant data retrieved
Poisoning Resistance High Unvalidated document ingestion
Source Diversity Balanced One source dominates context
Delimiter Consistency Strict Model confuses data with instructions

🧪 Testing Checklist

  • Ingestion Validation: Can I upload a PDF with hidden instructions?
  • Embedding Sensitivity: Does changing one word in a query drastically change retrieval?
  • Context Isolation: If a retrieved doc says "IGNORE ALL RULES", does the model comply?
  • Search Manipulation: Can I craft a document that is always retrieved for specific keywords?

🛡️ Mitigation Strategies

Tip

Trust but Verify: Treat all retrieved context as "Untrusted User Input."

  1. Semantic Filtering: Use a secondary LLM to scan retrieved chunks for instructions before passing them to the main model.
  2. Metadata Validation: Ensure retrieval only occurs from trusted, authenticated sources.
  3. Strict Delimiting: Use clear, non-spoofable boundaries between "Retrieved Context" and "System Instructions."
  4. Ranker Tuning: Use a cross-encoder to re-rank results and filter out low-confidence or anomalous matches.