Fred Deployment Guide

This guide provides a single entry point for teams deploying Fred beyond a local developer setup.

It is intended for:

DevOps / Platform engineers in charge of provisioning infrastructure (Kubernetes, databases, OpenSearch, object storage, etc.).
Technical leads / architects who need to understand the main moving parts and dependencies.

For day-to-day developer onboarding, refer to the main README.md.

Important

Mandatory access bootstrap after fresh install: Keycloak app roles do not replace team-level ReBAC rights. Your deployment process must include a post-install automated bootstrap that assigns at least one owner or manager per team in ReBAC/OpenFGA. This must not rely on manual first-use actions in the UI. For the complete Fred access model and terminology, see REBAC.md.

1. Scope

This document focuses on production-like deployments where:

Multiple users access Fred simultaneously.
Data and documents must be persisted and backed up.
External services (LLM providers, OpenSearch, object stores, IdP, etc.) are managed by a DevOps team.

It does not prescribe a specific orchestrator (Kubernetes, VMs, Docker Compose), but highlights the requirements that must be satisfied.

2. Fred Components to Deploy

Fred is composed of three main runtime components:

Frontend UI (./frontend)
- React single-page application (Vite dev server in dev, static assets in prod).
- Talks to the agentic backend via HTTP(S) and WebSocket.
Agentic backend (./agentic-backend)
- FastAPI application.
- Hosts the multi-agent runtime (LangGraph + LangChain).
- Integrates with:
  - LLM providers (OpenAI, Azure OpenAI, Ollama, etc.).
  - Storage: SQLite (dev/laptop default) or PostgreSQL (prod); OpenSearch is optional.
Knowledge Flow backend (./knowledge-flow-backend)
- FastAPI application focused on:
  - Document ingestion (PDF, DOCX, PPTX, CSV, etc.).
  - Chunking and vectorization.
  - Document search / retrieval APIs.
- Integrates with:
  - LLM/embedding providers.
  - Either SQLite + ChromaDB (dev default) or PostgreSQL + pgvector (prod); OpenSearch is optional.
  - Optional object store (e.g., MinIO, S3) for raw documents.

In local dev, these run with no external data services (SQLite + ChromaDB embedded).
In production, you typically deploy:

Frontend as static assets served by a reverse proxy (NGINX, ingress, etc.).
agentic-backend and knowledge-flow-backend as separate services (Kubernetes deployments, ECS services, etc.).
A shared persistence stack: PostgreSQL/pgvector or OpenSearch, plus an object store if you need externalized files.

3. Environment Configuration

Each backend has two configuration layers:

.env files
- Secrets and environment-specific credentials:
  - API keys for OpenAI / Azure / Ollama.
  - Database credentials (PostgreSQL and/or OpenSearch).
  - Object store keys.
- Never committed to Git.
configuration.yaml files
- Functional / structural configuration:
  - Model providers, model names, temperature, timeouts.
  - Feature flags (frontend behavior, optional agents).
  - Backend integration options (SQLite vs PostgreSQL, OpenSearch optional).

Files of interest:

agentic-backend/config/.env
agentic-backend/config/configuration.yaml
knowledge-flow-backend/config/.env
knowledge-flow-backend/config/configuration.yaml
Password mapping to remember:
- Core Postgres (metadata/tags/vector, etc.) → storage.postgres → FRED_POSTGRES_PASSWORD (or password: in YAML).

Tabular runtime → storage.tabular_store + shared content_storage.

In this recommended dataset-centric design, there is no dedicated tabular SQL database or TABULAR_POSTGRES_PASSWORD.
For MinIO/S3-compatible deployments, provide object-store credentials under content_storage; storage.tabular_store tunes artifact layout and query/runtime limits.

For concrete examples of model configuration, see the main README.md.

4. External Dependencies Overview

In a production-like setup you will typically manage:

LLM & Embedding Providers
- OpenAI / Azure OpenAI.
- Ollama (self-hosted models).
- Azure APIM fronting Azure OpenAI.
- Requirements: stable network access, quotas, and API keys / credentials.
Vector + metadata storage
- Dev default: SQLite + ChromaDB (embedded, zero external services).
- Production: choose one:
  - PostgreSQL + pgvector (recommended, no OpenSearch dependency).
  - OpenSearch (only if your platform standardizes on OS/Lucene).
    Strict mapping and version requirements are enforced. See DEPLOYMENT_GUIDE_OPENSEARCH.md.
- Both store document metadata and vectors; pick based on platform standards.
Object Storage (optional but recommended)
- MinIO, S3, or equivalent.
- Holds raw ingested documents (PDF, DOCX, PPTX, CSV…).
- Knowledge Flow stores metadata and vectors in your chosen backend (PostgreSQL/pgvector or OpenSearch), and content in the object store.
- The same object store also holds tabular Parquet artifacts under the configured storage.tabular_store.artifacts_prefix.
- Remote tabular queries read those artifacts through DuckDB httpfs using short-lived presigned URLs generated by Knowledge Flow.
- This object-store path is the recommended tabular SQL mode.
Identity Provider (optional)
- Keycloak or another OIDC provider.
- Used to harden authentication and authorization in multi-user environments.
- See docs/KEYCLOAK.md for details when enabled.

5. Production Settings

5.1 HTTP Clients for LLM Providers

This chapter explains how Fred configures HTTP clients used to communicate with LLM and embedding providers (OpenAI, Azure OpenAI, Azure APIM, Ollama), and how these settings should be tuned in production environments.

The goal is to ensure:

Predictable latency and failure modes.
Safe behavior under load (no hidden hangs).
Efficient connection reuse.
Consistent behavior across all model connectors.

5.1.1 OpenAIDesign Principles

Fred applies the following principles for HTTP communication with LLM providers:

Single shared connection pool per backend process
- Each backend (agentic-backend, knowledge-flow-backend) creates one shared HTTP client stack.
- All chat, embedding, and vision models reuse this pool.
- This makes concurrency, resource usage, and failures observable and controllable.
Explicit timeouts for all phases
- No implicit or library-default timeouts.
- Separate budgets for:
  - Connection establishment
  - Reading responses
  - Writing requests
  - Waiting for a free connection from the pool
YAML-driven configuration
- Operators configure networking behavior in configuration.yaml.
- Secrets (API keys, tokens) remain in .env / Kubernetes Secrets.
- No code changes are required to adjust networking behavior.
Provider-agnostic behavior
- The same timeout and pooling concepts apply to:
  - OpenAI
  - Azure OpenAI
  - Azure APIM
  - Ollama (self-hosted)
- Differences between providers are handled internally by Fred.

5.1.2 Where HTTP Client Configuration Lives

HTTP client settings are defined per model in configuration.yaml, under the settings section.

Example (OpenAI chat model):

default_chat_model:
  provider: openai
  name: gpt-4o-mini
  settings:
    temperature: 0.0
    max_retries: 1
    timeout:
      connect: 10.0
      read: 120.0
      write: 30.0
      pool: 5.0
    http_client_limits:
      max_connections: 200
      max_keepalive_connections: 50
      keepalive_expiry_seconds: 10

Although the settings are declared per model, the first model initialized defines the shared HTTP client pool for the entire backend process.

Subsequent models reuse the same pool.

5.1.3 Timeout Configuration (Critical)

Why explicit timeouts matter

Without explicit timeouts, production systems can:

Hang indefinitely on network issues.
Exhaust worker threads under load.
Become impossible to diagnose.

Fred uses four distinct timeout phases, mapped directly to the underlying HTTP client:

Timeout key	Meaning
`connect`	Maximum time to establish a TCP/TLS connection
`read`	Maximum time waiting for response data
`write`	Maximum time to send the request
`pool`	Maximum time waiting for a free connection from the pool

Recommended production defaults

timeout:
  connect: 10.0
  read: 120.0
  write: 30.0
  pool: 5.0

Rationale

connect (10s): Handles cold DNS/TLS paths without blocking indefinitely.
read (120s): Allows long generations, tool calls, and large RAG responses.
write (30s): Prevents stalled uploads.
pool (5s): Ensures overload fails fast instead of silently queueing forever.

⚠️ Avoid using a single flat timeout like request_timeout: 30 in production.
It hides the root cause of failures and behaves poorly under concurrency.

5.1.4 Connection Pool Limits

Fred uses a shared HTTP connection pool to control concurrency and resource usage.

Configuration example:

http_client_limits:
  max_connections: 200
  max_keepalive_connections: 50
  keepalive_expiry_seconds: 10

Meaning of each parameter

Parameter	Description
`max_connections`	Maximum concurrent HTTP connections
`max_keepalive_connections`	Maximum idle keep-alive connections
`keepalive_expiry_seconds`	Time before idle connections are closed

Recommended values

Scenario	max_connections	keepalive_connections
Small deployment (1–2 replicas)	50–100	10–20
Medium deployment	100–300	20–50
High concurrency	300–500	50–100

Notes

Keep-alives should generally be enabled in production.
Set max_keepalive_connections: 0 only if you have a known issue with:
- Reverse proxies
- Azure APIM connection reuse
- Load balancers that mishandle keep-alives

If you disable keep-alives, expect higher latency and CPU usage.

5.1.5 Retries and Failure Semantics

Fred supports max_retries at the model level:

settings:
  max_retries: 0

Production recommendation

Default: max_retries: 1
Avoid high retry counts:
- They amplify load during incidents.
- They increase tail latency.
Prefer fast failure + retry at higher layers (agent logic, UI retry).

Retries apply only to transient network or provider errors, not semantic failures.

5.1.6 Ollama (Self-Hosted Models)

For Ollama, Fred does not inject a shared HTTP client instance.
Instead, it passes timeout and limit values directly to the Ollama client.

Example:

chat_model:
  provider: ollama
  name: qwen2.5:3b-instruct
  settings:
    base_url: http://ollama:11434
    temperature: 0.0
    timeout:
      connect: 5.0
      read: 300.0
      write: 30.0
      pool: 5.0

Notes

Ollama calls may require longer read timeouts, especially for larger local models.
Connection limits are less critical for Ollama but still enforced for safety.

5.1.7 Operational Recommendations

Tune per environment:
- Dev: lower limits, shorter timeouts.
- Prod: higher limits, explicit pool timeout.
Align with replica count:
- More replicas → lower max_connections per pod.
Monitor failures:
- Timeouts are expected failure modes; observe and tune, don’t eliminate them.
Document deviations:
- If you disable keep-alives or increase timeouts significantly, document why.

5.2 KPIs and Metrics

5.2.1 Overview

Fred is well equiped in KPIs and provides tools for developers and devops to stress and tune it on local laptop as well as on deployed integration or production platforms.

Scraping endpoint: Prometheus-format metrics are exposed by default on http://<pod-ip>:9000/metrics (override via app.metrics_address / app.metrics_port in the YAML). Point Prometheus or curl there to scrape.
Key KPI names you’ll see: chat.exchange_latency_ms, chat.phase_latency_ms (with agent_step dims such as session_get_create, agent_init, history_restore, stream, persist, total), chat.user_message_total, agent.cache_*, plus process gauges like process.cpu.percent, process.memory.rss_mb, process.open_fds.
Bench-mode logging: For local load tests with ws_bench, you can enable verbose KPI summaries and process metrics via config, e.g.:
```
app:
  kpi_log_summary_interval_sec: 5    # >0 to enable; 0 to disable (default prod)
  kpi_log_summary_top_n: 20          # 0 to log all
  kpi_process_metrics_interval_sec: 10
```
In production keep these at 0 to avoid extra log volume; Kubernetes scraping covers runtime metrics.

5.2.1.1 Prometheus-Only KPI Profile (No OpenSearch KPI Writes)

If Grafana is fed by Prometheus and you want to avoid extra OpenSearch traffic:

Set storage.kpi_store to log mode.
Keep app.metrics_enabled: true (Prometheus scraping).
Keep KPI summary/process log emitters disabled in production (0).

Recommended baseline:

app:
  metrics_enabled: true
  kpi_process_metrics_interval_sec: 0
  kpi_log_summary_interval_sec: 0.0
  kpi_log_summary_top_n: 0

storage:
  kpi_store:
    type: "log"
    level: "DEBUG"

Important behavior:

Prometheus metrics are still exported and usable in Grafana.
/kpi/query does not return persisted historical KPI rows in this mode.

Local laptop foreground profile (configuration.yaml):

Keep app.metrics_enabled: false to avoid local port collisions.
Keep KPI summaries enabled for quick feedback: kpi_log_summary_interval_sec: 10.0, kpi_log_summary_top_n: 20.

5.2.2 Agentic KPIs

Metric name	Type	Important dimensions	Meaning / usage
`chat.user_message_total`	counter	`actor_type`, `user_id`, `agent_id`	Count of user messages ingested
`chat.phase_latency_ms`	gauge	`agent_step`, `status`	Per-phase latency (steps like `session_get_create`, `agent_init`, `history_restore`, `stream`, `persist`, `total`)
`chat.exchange_latency_ms`	timer	`status`, `agent_id`, `session_id`	End-to-end latency per exchange
`chat.exchange_total`	counter	`status`, `agent_id`, `session_id`	Count of exchanges
`agent.cache_lookup_total`	counter	`result` (hit/miss)	Cache lookups
`agent.cache_entries`	gauge	—	Entries in agent cache
`agent.cache_inflight_total`	gauge	—	In-flight cache ops
`agent.cache_evictions_total`	gauge	—	Cache evictions
`agent.cache_blocked_evictions_total`	gauge	—	Blocked evictions
`process.cpu.percent`	gauge	`actor_type=system`	Process CPU percent
`process.memory.rss_mb`	gauge	`actor_type=system`	Resident memory in MB
`process.memory.vms_mb`	gauge	`actor_type=system`	Virtual memory in MB
`process.memory.rss_percent`	gauge	`actor_type=system`	RSS as fraction of container limit
`process.memory.limit_mb`	gauge	`actor_type=system`	Memory limit seen by process
`process.open_fds`	gauge	`actor_type=system`	Open file descriptors

5.2.3 Knowledge-Flow KPIs

Scraping endpoint: http://<pod-ip>:9111/metrics by default (see app.metrics_port in the knowledge-flow config).
Key KPI names (knowledge-flow backend):

Metric name	Type	Important dimensions	Meaning / usage
`api.request_latency_ms`	timer	`route`, `method`, `status`	API latency for ingestion endpoints
`ingestion.document_duration_ms`	timer	`file_type`, `status`, `source`	Per-document end-to-end ingestion time
`rag.search_latency_ms`	timer	`index`, `user_id`, `agent_id` (dims vary)	Latency of RAG search
`rag.search_total`	counter	`status` (implicit via dims)	Count of RAG searches
`rag.search_hits_total`	counter	—	Documents returned across searches
`rag.search_hit_ratio`	gauge	—	Hit ratio for searches
`rag.search_empty_total`	counter	—	Searches with no results
`rag.search_error_total`	counter	—	Failed searches
`rag.search_top_k_total`	counter	—	Top-K docs retrieved
`rag.rerank_latency_ms`	timer	—	Latency of reranking step
`rag.rerank_total`	counter	—	Rerank operations
`rag.rerank_docs_total`	counter	—	Docs evaluated in rerank
`rag.rerank_top_r_total`	counter	—	Docs selected after rerank

Bench-mode summary logging can also be enabled for knowledge-flow via the same app settings as above (use non-zero values in a bench YAML; keep 0 in production).

5.3 Logs and Log Store

5.3.1 Recommended Profile (Stdout Shipping + In-Memory Query Store)

Use two distinct paths:

Operational logging path: write application logs to stdout and forward them with your platform collector (Fluent Bit, Vector, OpenTelemetry Collector, etc.) to Loki/Grafana.
In-app query path: keep storage.log_store as in_memory to serve Fred UI log panels and LogGenius (/logs/query) without extra external writes.

Recommended baseline:

storage:
  log_store:
    type: "in_memory"

Important behavior:

App logs are still emitted with normal levels (DEBUG/INFO/WARNING/ERROR) to console/stdout.
log_store: in_memory keeps only a recent per-process buffer (capacity 1000 events), meant for troubleshooting and UI queries.
If log_store is set to OpenSearch, each log event is additionally indexed remotely by the app, which increases write traffic.

Current recommendation: keep log_store: in_memory in Fred backend configs and rely on stdout forwarding for production retention/search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fred Deployment Guide

1. Scope

2. Fred Components to Deploy

3. Environment Configuration

4. External Dependencies Overview

5. Production Settings

5.1 HTTP Clients for LLM Providers

5.1.1 OpenAIDesign Principles

5.1.2 Where HTTP Client Configuration Lives

5.1.3 Timeout Configuration (Critical)

Why explicit timeouts matter

Recommended production defaults

Rationale

5.1.4 Connection Pool Limits

Meaning of each parameter

Recommended values

Notes

5.1.5 Retries and Failure Semantics

Production recommendation

5.1.6 Ollama (Self-Hosted Models)

Notes

5.1.7 Operational Recommendations

5.2 KPIs and Metrics

5.2.1 Overview

5.2.1.1 Prometheus-Only KPI Profile (No OpenSearch KPI Writes)

5.2.2 Agentic KPIs

5.2.3 Knowledge-Flow KPIs

5.3 Logs and Log Store

5.3.1 Recommended Profile (Stdout Shipping + In-Memory Query Store)

FilesExpand file tree

DEPLOYMENT_GUIDE.md

Latest commit

History

DEPLOYMENT_GUIDE.md

File metadata and controls

Fred Deployment Guide

1. Scope

2. Fred Components to Deploy

3. Environment Configuration

4. External Dependencies Overview

5. Production Settings

5.1 HTTP Clients for LLM Providers

5.1.1 OpenAIDesign Principles

5.1.2 Where HTTP Client Configuration Lives

5.1.3 Timeout Configuration (Critical)

Why explicit timeouts matter

Recommended production defaults

Rationale

5.1.4 Connection Pool Limits

Meaning of each parameter

Recommended values

Notes

5.1.5 Retries and Failure Semantics

Production recommendation

5.1.6 Ollama (Self-Hosted Models)

Notes

5.1.7 Operational Recommendations

5.2 KPIs and Metrics

5.2.1 Overview

5.2.1.1 Prometheus-Only KPI Profile (No OpenSearch KPI Writes)

5.2.2 Agentic KPIs

5.2.3 Knowledge-Flow KPIs

5.3 Logs and Log Store

5.3.1 Recommended Profile (Stdout Shipping + In-Memory Query Store)