From f17d5d56a9c04c1d9a14da26fa8f13f77719e71a Mon Sep 17 00:00:00 2001
From: Dimos Tzimotoudis <dtzimotoudis@gmail.com>
Date: Thu, 14 May 2026 13:35:56 +0000
Subject: [PATCH] Update docs for refactored workflow

---
 README.md                               | 15 +++-----
 docs/configuration/cli-options.md       | 18 ++++++---
 docs/configuration/custom-prompts.md    |  5 ++-
 docs/configuration/environment.md       |  5 +++
 docs/configuration/providers.md         | 31 +++++++++++----
 docs/developer/gpu-settings.md          |  6 ++-
 docs/getting-started/installation.md    | 18 +++++----
 docs/getting-started/quick-start.md     |  9 +++--
 docs/how-it-works/architecture.md       | 13 ++++---
 docs/how-it-works/evaluation.md         |  2 +-
 docs/how-it-works/iteration-planning.md | 15 ++++----
 docs/index.md                           |  2 +-
 docs/reference/foundation-models.md     | 38 +++++++++----------
 docs/reference/workspace-structure.md   | 50 ++++++++++++-------------
 docs/user-guide/datasets.md             | 10 ++---
 docs/user-guide/forking.md              |  2 +-
 docs/user-guide/inference.md            |  8 ++--
 docs/user-guide/outputs.md              | 17 +++++----
 docs/user-guide/running-agent.md        | 20 +++++++---
 docs/user-guide/training.md             |  2 +-
 run.sh                                  |  4 +-
 21 files changed, 164 insertions(+), 126 deletions(-)

diff --git a/README.md b/README.md
index adb94a56..82953c31 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ Made for biomedical data, Agentomics outperformed human experts and created new
 
 How it works
 1) Input is a CSV training dataset + optional data description
-2) Agentomics autonomously experments with various ML models and strategies
+2) Agentomics autonomously experiments with various ML models and strategies
 3) Output is a trained model ready for inference and a detailed PDF report summarizing the development process and achieved metrics
 
 For more details see: [preprint](https://www.biorxiv.org/content/10.64898/2026.01.27.702049v1)
@@ -25,7 +25,7 @@ For more details see: [preprint](https://www.biorxiv.org/content/10.64898/2026.0
 git clone https://github.com/BioGeMT/agentomics-ml.git
 cd agentomics-ml
 cp .env.example .env
-# Edit .env and set at least one API key (OPENROUTER_API_KEY or OPENAI_API_KEY)
+# Edit .env and set at least one provider key, such as OPENROUTER_API_KEY or OPENAI_API_KEY
 
 # Download example dataset
 ./scripts/download_example_dataset.sh
@@ -35,7 +35,7 @@ cp .env.example .env
 
 Recommended model: `gpt-5.1-codex-max`
 
-Outputs are saved to `outputs/<agent_id>/`, including PDF reports in `outputs/<agent_id>/pdf_reports`.
+Outputs are saved to `outputs/<agent_id>/`, including PDF reports in `outputs/<agent_id>/reports/pdf/`.
 
 ### Installation Requirements
 
@@ -52,9 +52,9 @@ For more details visit **https://biogemt.github.io/agentomics-ml/**
 - Generic: Agentomics can crunch any classification and regression datasets in CSV format.
 - Secure: Agents execute code securely in Docker with read-only mounts to your file system and are only allowed to write in a Docker Volume.
 - Reproducible: Outputs include models, scripts, and conda environments needed to run inference or re-train models with one bash command.
-- Trustworthy: If you provide a test set, Agentomics fully abstracts LLMs from accessing it, allowing you to rely on programmaticly computed and reported test set metrics.
-- Foundation models: Agentomics can leverage foundation models from huggingface for both embeddings and fine-tuning.
-- Various LLM providers: OpenAI, OpenRouter, or local models via Ollama
+- Trustworthy: If you provide a test set, Agentomics keeps it hidden from the LLM and reports programmatically computed test metrics.
+- Foundation models: Agentomics can leverage foundation models from Hugging Face for both embeddings and fine-tuning.
+- Various LLM providers: OpenAI, Anthropic, OpenRouter, Codex/ChatGPT OAuth, or local models via Ollama.
 - Reliability: Thanks to our functional validators, Agentomics creates a working model 100% of the time (when using recommended settings).
 
 ## Roadmap
@@ -62,7 +62,6 @@ Agentomics is in active development. We welcome any raised Issues and suggestion
 
 Features coming soon:
 - Support for any data type (currently only CSV datasets)
-- Run forking and continuing
 - Better local model support and configuration
 - Remote GPU support for GCP
 
@@ -80,5 +79,3 @@ bioRxiv (preprint) https://www.biorxiv.org/content/10.64898/2026.01.27.702049v1
 ## License
 
 MIT. See `LICENSE`.
-
-
diff --git a/docs/configuration/cli-options.md b/docs/configuration/cli-options.md
index 141e07f8..9fb68cdc 100644
--- a/docs/configuration/cli-options.md
+++ b/docs/configuration/cli-options.md
@@ -7,10 +7,13 @@ Complete reference for `run.sh` command-line options.
 | Option | Description | Default |
 |--------|-------------|---------|
 | `--model <name>` | LLM model to use | Interactive selection |
+| `--provider <name>` | Provider to use when multiple providers are configured | Prompted if multiple providers are available in an interactive terminal |
 | `--dataset <name>` | Dataset name | Interactive selection |
 | `--iterations <n>` | Number of iterations | Prompted in interactive mode (default 5) |
 | `--val-metric <metric>` | Validation metric to optimize | Task-based default (`AUROC` for classification, `MAE` for regression) |
+| `--task-type <classification\|regression>` | Task type to use while preparing the selected raw dataset | Dataset config or interactive preparation prompt |
 | `--timeout <seconds>` | Time limit for entire run | None |
+| `--split-timeout <seconds>` | Time limit after which the agent may no longer change train/validation split | None |
 | `--run-python-timeout <seconds>` | Timeout in seconds for each run_python tool execution - this will determine the maximum training time | `21600` (6 hours) |
 
 The run stops when either the iteration count is reached or the timeout expires.
@@ -22,7 +25,9 @@ The run stops when either the iteration count is reached or the timeout expires.
 | `--build-images` | Build Docker images locally |
 | `--local` | Run without Docker (uses conda) |
 | `--cpu-only` | Disable GPU acceleration |
-| `--ollama` | Use local Ollama for LLM |
+| `--ollama` | Enable Docker host networking for a host Ollama server |
+| `--test` | Run the integrated test suite in Docker mode |
+| `--all-iterations-test` | Evaluate every archived iteration on the held-out test set after the run |
 
 ## Listing Options
 
@@ -39,14 +44,14 @@ The run stops when either the iteration count is reached or the timeout expires.
 |--------|-------------|
 | `--user-prompt <text>` | Custom prompt for the agent |
 | `--iteration-plan-model <name>` | LLM model used for generating the iteration plan (defaults to `--model`) |
-| `--foundation-model-type <type>` | Pre-download foundation models (`dna`, `rna`, `protein`, `molecule`, `all`) |
+| `--foundation-models-type <type>` | Enable foundation models (`dna`, `rna`, `protein`, `molecule`, `all`) |
 | `--use-provisioning-key` | Use OpenRouter temporary API key |
 | `--spend-limit <n>` | Spend limit for provisioning key (requires `--use-provisioning-key`) |
 | `--verbosity <summary\|full>` | How much agent interaction detail is printed during the run (default: `full`) |
 | `--disable-training-reporting` | Disable the TrainingReporter helper that emits structured training progress updates from the agent's training script |
 | `--split-allowed-iterations <n>` | Iterations that can modify train/val split (default 1) |
 | `--exploration-iterations <n>` | Baseline exploration iterations (default 4) |
-| `--run-python-timeout <seconds>` | Per-training timeout for `run_python` tool (default 21600) |
+| `--tags <tag...>` | Space-separated tags for W&B logging |
 
 ## Forking
 
@@ -95,7 +100,8 @@ See [Forking a Run](../user-guide/forking.md) for a full guide and examples.
 ### Using Ollama
 
 ```bash
-./run.sh --ollama
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --ollama --provider ollama --model llama3.1 --dataset my_data
 ```
 
 ### CPU Only
@@ -104,10 +110,10 @@ See [Forking a Run](../user-guide/forking.md) for a full guide and examples.
 ./run.sh --cpu-only --model openai/gpt-4 --dataset my_data
 ```
 
-### Pre-download Foundation Models
+### Enable Foundation Models
 
 ```bash
-./run.sh --foundation-model-type protein --model openai/gpt-4
+./run.sh --foundation-models-type protein --model openai/gpt-4 --dataset my_data
 ```
 
 ### Run with locally built Docker images
diff --git a/docs/configuration/custom-prompts.md b/docs/configuration/custom-prompts.md
index 6f5df252..a31bea2d 100644
--- a/docs/configuration/custom-prompts.md
+++ b/docs/configuration/custom-prompts.md
@@ -50,7 +50,7 @@ Without customization, the agent uses:
 
 ## What Custom Prompts Affect
 
-The user prompt influences all agent steps:
+The user prompt influences the agentic steps:
 
 | Step | How It's Used |
 |------|---------------|
@@ -61,7 +61,8 @@ The user prompt influences all agent steps:
 | Model Architecture | Model selection and design |
 | Model Training | Training approach and hyperparameters |
 | Model Inference | Prediction pipeline design |
-| Validation Evaluation | What success criteria matter most |
+
+Validation evaluation itself is deterministic: it runs the generated inference script on train/validation data and scores the configured `--val-metric`.
 
 ## Prompt Tips
 
diff --git a/docs/configuration/environment.md b/docs/configuration/environment.md
index 50cdf8a1..4307c84b 100644
--- a/docs/configuration/environment.md
+++ b/docs/configuration/environment.md
@@ -30,6 +30,9 @@ At least one API key is required:
 |----------|----------|---------|
 | `OPENROUTER_API_KEY` | OpenRouter | [openrouter.ai](https://openrouter.ai/) |
 | `OPENAI_API_KEY` | OpenAI | [platform.openai.com](https://platform.openai.com/) |
+| `ANTHROPIC_API_KEY` | Anthropic | [console.anthropic.com](https://console.anthropic.com/) |
+
+The Codex provider uses `codex login` and reads `~/.codex/auth.json`. Ollama does not use an API key, but set `OLLAMA_BASE_URL` so the provider is selectable.
 
 ### Provisioning Key (Optional)
 
@@ -115,6 +118,7 @@ export CUDA_VISIBLE_DEVICES=0,1
 # LLM Provider (choose one or more)
 OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxx
 # OPENAI_API_KEY=sk-xxxxxxxxxxxx
+# ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxx
 
 # Weights & Biases (optional)
 WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
@@ -128,6 +132,7 @@ WANDB_ENTITY=my-team
 
 # Ollama (optional)
 # Configure in src/utils/providers/configured_providers.yaml
+# OLLAMA_BASE_URL=http://localhost:11434/v1
 ```
 
 ## Security Notes
diff --git a/docs/configuration/providers.md b/docs/configuration/providers.md
index 4096cc8e..fa582a8c 100644
--- a/docs/configuration/providers.md
+++ b/docs/configuration/providers.md
@@ -8,6 +8,7 @@ Agentomics-ML supports multiple LLM providers out of the box.
 |----------|---------------------|--------|
 | [OpenRouter](https://openrouter.ai/) | `OPENROUTER_API_KEY` | 100+ models |
 | [OpenAI](https://openai.com/) | `OPENAI_API_KEY` | Use `--list-models` to see available models |
+| [Anthropic](https://anthropic.com/) | `ANTHROPIC_API_KEY` | Claude models available to your account |
 | OpenAI Codex | `codex login` | Uses your local Codex/ChatGPT login |
 | [Ollama](https://ollama.ai/) | Local setup | Local models |
 
@@ -60,6 +61,21 @@ Use `./run.sh --list-models` to see what your API key can access.
 
 ---
 
+## Anthropic
+
+Direct access to Anthropic models.
+
+### Setup
+
+```bash
+export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"
+./run.sh --provider anthropic --list-models
+```
+
+Use `--provider anthropic` explicitly when other provider keys are also set.
+
+---
+
 ## Codex (ChatGPT OAuth)
 
 Experimental support for the local Codex CLI login flow.
@@ -76,7 +92,7 @@ Then run Agentomics with the `codex` provider:
 
 ```bash
 ./run.sh --provider codex --list-models
-./run.sh --provider codex --model gpt-5.4
+./run.sh --provider codex --model gpt-5.4 --dataset my_data
 ```
 
 This provider reads your local Codex auth state from `~/.codex/auth.json` and
@@ -98,15 +114,16 @@ Run models locally for privacy or offline use.
 
 ### Docker Mode (Recommended)
 
-Run with:
+Set `OLLAMA_BASE_URL` so Agentomics considers Ollama available, then run with:
 
 ```bash
-./run.sh --ollama
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --ollama --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 Docker mode connects to the Ollama base URL defined in
 `src/utils/providers/configured_providers.yaml`
-(default: `http://host.docker.internal:11434/v1`).
+(default: `http://localhost:11434/v1`) and uses host networking when `--ollama` is passed.
 Ensure your Ollama server is reachable from the host at `:11434`.
 
 ### Local Mode
@@ -115,7 +132,8 @@ For local mode, set the Ollama base URL in `src/utils/providers/configured_provi
 to `http://localhost:11434/v1`, then run:
 
 ```bash
-./run.sh --local
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --local --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 ### Popular Models
@@ -197,5 +215,4 @@ Ensure Ollama is running:
 ollama list  # Should show pulled models
 ```
 
-For Docker mode, verify that `host.docker.internal:11434` is reachable from
-containers (run with `./run.sh --ollama`).
+For Docker mode, run with `--ollama` so the container uses host networking, and verify the configured Ollama URL is reachable on the host.
diff --git a/docs/developer/gpu-settings.md b/docs/developer/gpu-settings.md
index c655d069..d6aacc43 100644
--- a/docs/developer/gpu-settings.md
+++ b/docs/developer/gpu-settings.md
@@ -122,12 +122,14 @@ Agentomics-ML supports multi-GPU training:
 - Agent-generated scripts may use DataParallel or DistributedDataParallel
 - All available GPUs are passed to containers by default
 
-To limit GPUs:
+To limit GPUs in local mode:
 
 ```bash
-CUDA_VISIBLE_DEVICES=0,1 ./run.sh  # Use only first 2 GPUs
+CUDA_VISIBLE_DEVICES=0,1 ./run.sh --local  # Use only first 2 GPUs
 ```
 
+In Docker mode, Agentomics passes all available GPUs to the container; selecting a subset requires running Docker manually with custom GPU flags.
+
 ## Docker GPU Flags
 
 When running containers manually, you can limit GPUs with Docker flags:
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 1b27c93b..bb6ed070 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -24,7 +24,7 @@ cd Agentomics-ML
 ```bash
 # Create a .env file (required for Docker mode)
 cp .env.example .env
-# Edit .env and set at least one API key
+# Edit .env and set at least one provider key
 
 # Run with pre-built images
 ./run.sh
@@ -47,13 +47,13 @@ The images will be downloaded automatically on first run. All subsequent runs wi
 ```bash
 # Create a .env file (required for Docker mode)
 cp .env.example .env
-# Edit .env and set at least one API key
+# Edit .env and set at least one provider key
 
 # Run while building images locally
 ./run.sh --build-images
 ```
 
-On first run, you'll be prompted to build the Docker images. This takes a few minutes but only needs to be done once.
+With `--build-images`, the Docker images are built locally before the run starts. This takes a few minutes but only needs to be repeated when dependencies or Dockerfiles change.
 
 ---
 
@@ -103,15 +103,16 @@ Run with local models using Ollama for privacy or offline use.
 
 ### Docker Mode Setup
 
-1. Ensure Ollama listens on the host (e.g., `0.0.0.0:11434`).
-2. Run with the `--ollama` flag:
+1. Ensure Ollama is running on the host.
+2. Make the Ollama provider selectable and choose it explicitly:
 
     ```bash
-    ./run.sh --ollama
+    export OLLAMA_BASE_URL=http://localhost:11434/v1
+    ./run.sh --ollama --provider ollama --model <ollama-model> --dataset <dataset>
     ```
 
 Docker mode connects to the URL configured in `src/utils/providers/configured_providers.yaml`
-(default: `http://host.docker.internal:11434/v1`).
+(default: `http://localhost:11434/v1`) and uses host networking when `--ollama` is passed.
 
 ### Local Mode Setup
 
@@ -119,7 +120,8 @@ For local mode, set the Ollama base URL in `src/utils/providers/configured_provi
 to `http://localhost:11434/v1`, then run:
 
 ```bash
-./run.sh --local
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --local --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 ---
diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md
index 6b6a31dd..d4464e6c 100644
--- a/docs/getting-started/quick-start.md
+++ b/docs/getting-started/quick-start.md
@@ -5,7 +5,7 @@ Get Agentomics-ML running in under 5 minutes using pre-built Docker images.
 ## Prerequisites
 
 - [Docker](https://docs.docker.com/get-docker/) installed and running
-- An API key from [OpenRouter](https://openrouter.ai/) or [OpenAI](https://platform.openai.com/)
+- An API key from a configured provider, such as [OpenRouter](https://openrouter.ai/) or [OpenAI](https://platform.openai.com/)
 
 ## Steps
 
@@ -22,8 +22,8 @@ Docker mode requires a `.env` file in the repo root.
 
 ```bash
 cp .env.example .env
-# Edit .env and set at least one API key:
-# OPENROUTER_API_KEY or OPENAI_API_KEY
+# Edit .env and set at least one provider key:
+# OPENROUTER_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
 ```
 
 ### 3. Run the Agent
@@ -39,7 +39,8 @@ The agent will prompt you to:
 1. **Select a model** - Choose from available LLMs
 2. **Select a dataset** - Use your own or download examples
 3. **Configure iterations** - How many optimization cycles to run
-4. **Choose validation metric** - see `./run.sh --list-metrics`
+
+The validation metric defaults to `AUROC` for classification and `MAE` for regression. To choose one explicitly, pass `--val-metric`; see `./run.sh --list-metrics`.
 
 ## Using Your Own Dataset
 
diff --git a/docs/how-it-works/architecture.md b/docs/how-it-works/architecture.md
index a97a1b89..e21f63e9 100644
--- a/docs/how-it-works/architecture.md
+++ b/docs/how-it-works/architecture.md
@@ -49,11 +49,12 @@ The agent analyzes the dataset:
 
 ### 3. Data Split
 
-Creates or modifies train/validation split:
+Creates, reuses, or modifies train/validation split files:
 
 - Stratified splitting for classification
 - Considers data distribution
 - May adjust split based on previous iterations
+- Reuses supplied `validation.csv` when present
 
 **Output:** Captured in structured outputs and iteration reports
 
@@ -144,7 +145,7 @@ Each step uses an LLM agent that:
 1. Receives context (data info, previous results, and the iteration plan when applicable)
 2. Generates a structured output (validated by Pydantic)
 3. Validates the output meets requirements
-4. Retries if validation fails (up to 10 times)
+4. Retries if validation fails (up to 5 times by default)
 
 ## Iteration Flow
 
@@ -186,11 +187,11 @@ Key architecture parameters in `src/utils/config.py`:
 
 | Parameter | Default | Description |
 |-----------|---------|-------------|
-| `temperature` | 1.0 | LLM creativity level |
+| `temperature` | 0.7 | LLM creativity level |
 | `max_steps` | 100 | Max steps per agent |
-| `max_validation_retries` | 10 | Output validation retries |
-| `llm_response_timeout` | 900s | LLM response timeout |
-| `bash_tool_timeout` | 300s | Bash command timeout |
+| `max_validation_retries` | 5 | Output validation retries |
+| `llm_response_timeout` | 600s | LLM response timeout |
+| `bash_tool_timeout` | 180s | Bash command timeout |
 | `run_python_tool_timeout` | 21600s | Training timeout (6 hours, configurable via `--run-python-timeout`) |
 
 ## Next Steps
diff --git a/docs/how-it-works/evaluation.md b/docs/how-it-works/evaluation.md
index 3862100b..a64a57f1 100644
--- a/docs/how-it-works/evaluation.md
+++ b/docs/how-it-works/evaluation.md
@@ -8,7 +8,7 @@ Models are evaluated at multiple stages:
 
 | Stage | Data Used | Purpose |
 |-------|-----------|---------|
-| **Dry Run** | Small sample | Validate inference script works |
+| **Dry Run** | Prepared training data without labels | Validate inference script shape and metrics compatibility |
 | **Validation** | Validation set | Guide optimization |
 | **Train** | Training set | Detect overfitting |
 | **Test** | Hidden test set | Final unbiased evaluation |
diff --git a/docs/how-it-works/iteration-planning.md b/docs/how-it-works/iteration-planning.md
index 33f79566..d8676fbf 100644
--- a/docs/how-it-works/iteration-planning.md
+++ b/docs/how-it-works/iteration-planning.md
@@ -4,12 +4,12 @@ The iteration-plan step guides improvements between iterations.
 
 ## How It Works
 
-After each iteration completes:
+At the start of each iteration:
 
-1. **Results Collected** - Metrics, model outputs, and analysis from all steps
-2. **Plan Generated** - A separate LLM analyzes what worked and what didn't
-3. **Instructions Created** - Specific guidance for each step in the next iteration
-4. **Next Iteration Starts** - Agents receive the iteration plan as part of their context
+1. **Archived Results Loaded** - Metrics, model outputs, split history, and analysis from prior iterations
+2. **Plan Generated** - The iteration-planning model analyzes what worked and what did not
+3. **Instructions Created** - Specific guidance is produced for the remaining steps in the current iteration
+4. **Step Agents Run** - Later steps receive the iteration plan as part of their context
 
 ## Iteration-Plan Structure
 
@@ -120,11 +120,11 @@ For early iterations, the agent may suggest split changes:
 > Validation set may be too small for reliable metrics.
 > Consider increasing validation size to 30% for better estimates.
 
-After `--split-allowed-iterations`, splits are frozen to ensure fair comparison.
+After `--split-allowed-iterations`, or after `--split-timeout` when set, splits are frozen to ensure fair comparison.
 
 ## Iteration-Planning Model
 
-The iteration-planning step uses the same LLM as the main agent.
+The iteration-planning step uses `--iteration-plan-model` when provided. If it is omitted, it uses the same model as `--model`.
 
 ## Viewing Iteration Plans
 
@@ -145,6 +145,7 @@ Each report shows:
 |-----------|---------|-------------|
 | `--exploration-iterations` | 4 | Iterations for broad exploration |
 | `--split-allowed-iterations` | 1 | Iterations that can modify splits |
+| `--split-timeout` | None | Time deadline after which splits cannot change |
 
 ## Next Steps
 
diff --git a/docs/index.md b/docs/index.md
index 9ffe96fa..2cc6a85e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -30,7 +30,7 @@ Agentomics-ML works like an ML engineer:
 
 | Feature | Description |
 |---------|-------------|
-| **Any LLM** | Works with OpenAI, OpenRouter, or local models via Ollama |
+| **Any LLM** | Works with OpenAI, Anthropic, OpenRouter, Codex/ChatGPT OAuth, or local models via Ollama |
 | **Any Dataset** | Supports classification or regression datasets in CSV format |
 | **Secure Execution** | Docker containers with read-only access to code and isolated execution |
 | **Reproducible** | Outputs include trained models, scripts, and conda environments |
diff --git a/docs/reference/foundation-models.md b/docs/reference/foundation-models.md
index 9f629488..a88be9dc 100644
--- a/docs/reference/foundation-models.md
+++ b/docs/reference/foundation-models.md
@@ -4,7 +4,7 @@ Pre-trained models for specialized omics domains.
 
 ## Overview
 
-Foundation models are large pre-trained models specialized for specific data types. Agentomics-ML can pre-download these models to speed up agent runs.
+Foundation models are large pre-trained models specialized for specific data types. Agentomics-ML can enable selected model families so the agent can use their catalog and local/downloaded weights during a run.
 
 ## Available Types
 
@@ -15,30 +15,27 @@ Foundation models are large pre-trained models specialized for specific data typ
 | `protein` | Proteomics | Protein sequences, structure |
 | `molecule` | Chemistry | Small molecules, drugs |
 
-## Pre-downloading Models
+## Enabling Models
 
-Download foundation models before running:
+Enable foundation models before running:
 
 ```bash
-./run.sh --foundation-model-type dna
+./run.sh --foundation-models-type dna
 ```
 
-This downloads relevant models to the Docker image, avoiding download delays during agent execution.
+This enables the selected family for the run and makes its catalog available to the agent. In Docker mode, the launcher pulls or builds an image tagged for that foundation-model type. In local mode, models are downloaded into the local workspace cache.
 
-You can also use `--foundation-model-type all` to include every type.
+You can also use `--foundation-models-type all` to include every type.
 
 ## Multiple Types
 
-Download multiple types by running multiple times:
+Enable a different type per run, or use `all` when the agent should see every configured type:
 
 ```bash
-./run.sh --foundation-model-type dna
-./run.sh --foundation-model-type protein
+./run.sh --foundation-models-type dna
+./run.sh --foundation-models-type protein
 ```
 
-In local mode (`--local`), models are downloaded into the workspace instead of
-being baked into a Docker image.
-
 ## DNA Models
 
 For genomic sequence data:
@@ -103,17 +100,18 @@ Example datasets:
 Foundation model configurations are in:
 
 ```
-foundation_models/
+foundation_models/models.yaml
 ```
 
-Each type has a configuration specifying:
+The YAML file lists each model family with:
 - Model names and sources
-- Download locations
+- Domain type (`dna`, `rna`, `protein`, or `molecule`)
+- Brief summaries
 - Usage instructions for the agent
 
-## Without Pre-downloading
+## Without Foundation Models
 
-If you don't pre-download, the agent can still use foundation models but will download them during execution (slower first run).
+If `--foundation-models-type` is omitted on a fresh run, no foundation models are made available to the agent. Forked runs inherit the source run's foundation-model type when the flag is omitted.
 
 ## Storage Requirements
 
@@ -141,9 +139,9 @@ Use `--cpu-only` if GPU unavailable, but expect longer run times for foundation
 
 To add custom foundation models:
 
-1. Add configuration to `foundation_models/`
-2. Update the download script
-3. Add usage instructions for the agent
+1. Add the family to `foundation_models/models.yaml`
+2. Add a companion usage guide under `foundation_models/`
+3. Ensure the listed model names can be downloaded by `src/utils/download_foundation_models.py`
 
 ## Related
 
diff --git a/docs/reference/workspace-structure.md b/docs/reference/workspace-structure.md
index 1f311aa6..d65d63e5 100644
--- a/docs/reference/workspace-structure.md
+++ b/docs/reference/workspace-structure.md
@@ -9,15 +9,18 @@ agentomics-ml/
 ├── datasets/                 # Raw input datasets
 ├── prepared_datasets/        # Prepared training data
 ├── prepared_test_sets/       # Prepared test data (hidden)
-├── workspace/                # Active execution workspace
-│   ├── run/                  # Current run files
-│   ├── best_iteration_snapshot/ # Best iteration snapshot
-│   ├── reports/              # Iteration reports
-│   ├── extras/               # Logs and extra artifacts
-│   └── fallbacks/            # Backup for recovery
 └── outputs/                  # Final results
+
+../workspace/runs/<agent_id>/ # Local-mode active workspace
+├── run/                      # Current run files
+├── best_iteration_snapshot/  # Best iteration snapshot
+├── reports/                  # Iteration reports
+├── extras/                   # Logs and extra artifacts
+└── fallbacks/                # Reserved recovery area
 ```
 
+Docker mode uses an internal temporary workspace volume with the same layout and copies it to `outputs/<agent_id>/` when the run ends.
+
 ## datasets/
 
 Your raw input datasets:
@@ -37,7 +40,7 @@ After preparation, datasets are formatted for the agent:
 ```
 prepared_datasets/my_dataset/
 ├── train.csv              # Processed training data
-├── validation.csv         # Processed validation data
+├── validation.csv         # Processed validation data, if supplied
 ├── dataset_description.md # Copied/created description
 └── metadata.json          # Task info (type, classes, etc.)
 ```
@@ -54,16 +57,16 @@ prepared_test_sets/my_dataset/
 
 The agent never sees files in this directory during training.
 
-## workspace/
+## Active Workspace
 
-Active execution area:
+Active execution area. In local mode this is `../workspace/runs/<agent_id>/`; in Docker mode it is the temporary `/workspace` volume.
 
-### workspace/run/
+### run/
 
 Current run working directory:
 
 ```
-workspace/run/
+<workspace_root>/run/
 ├── shared/
 │   ├── .conda/                  # Shared Conda environment
 │   ├── config.json
@@ -77,12 +80,12 @@ workspace/run/
 └── ...
 ```
 
-### workspace/best_iteration_snapshot/
+### best_iteration_snapshot/
 
 Best iteration snapshot:
 
 ```
-workspace/best_iteration_snapshot/
+<workspace_root>/best_iteration_snapshot/
 ├── model_training/
 │   ├── train.py
 │   └── training_artifacts/
@@ -95,25 +98,22 @@ workspace/best_iteration_snapshot/
 
 Updated whenever a new best iteration is achieved.
 
-### workspace/fallbacks/
+### fallbacks/
 
-Recovery backup for split changes:
+Reserved recovery area:
 
 ```
-workspace/fallbacks/<agent_id>/
-├── train.csv
-├── validation.csv
-└── split_fingerprint.json
+<workspace_root>/fallbacks/
 ```
 
-Used to restore data if a split change causes issues.
+This directory may be empty for normal runs.
 
-### workspace/reports/
+### reports/
 
 Iteration reports are written here during runs. These are copied to
 `outputs/<agent_id>/reports/` after completion.
 
-### workspace/extras/
+### extras/
 
 Logs and auxiliary artifacts (metrics, run logs) are stored here and copied to
 `outputs/<agent_id>/extras/`.
@@ -174,16 +174,14 @@ rm -rf outputs/<agent_id>
 ### Clean Workspace
 
 ```bash
-rm -rf workspace/run/*
-rm -rf workspace/best_iteration_snapshot/*
-rm -rf workspace/fallbacks/*
+rm -rf ../workspace/runs/<agent_id>
 ```
 
 ### Clean Everything
 
 ```bash
 rm -rf outputs/*
-rm -rf workspace/*
+rm -rf ../workspace/*
 rm -rf prepared_datasets/*
 rm -rf prepared_test_sets/*
 ```
diff --git a/docs/user-guide/datasets.md b/docs/user-guide/datasets.md
index a2e67b9c..64b4f8d6 100644
--- a/docs/user-guide/datasets.md
+++ b/docs/user-guide/datasets.md
@@ -29,7 +29,7 @@ feature1,feature2,feature3,target
 
 ### validation.csv (Optional)
 
-Separate validation data. If not provided, the agent creates a train/validation split from `train.csv`.
+Separate validation data. If not provided, the agent creates train/validation split files during the `data_split` step and stores them under the run's `shared/splits/` directory.
 
 ### test.csv (Optional)
 
@@ -75,8 +75,8 @@ Fields:
 
 - `task_type` (optional): `"classification"` or `"regression"`; if omitted, you will be prompted during dataset preparation.
 - `target_col` (optional): column name to predict; auto-detected if omitted.
-- `positive_class` (optional): value that counts as "positive"; only applicable for some binary classification metrics, auto-detected if omitted.
-- `negative_class` (optional): value that counts as "negative"; only applicable for some binary classification metrics, auto-detected if omitted.
+- `positive_class` (optional): value that counts as "positive"; only applicable for binary classification label mapping.
+- `negative_class` (optional): value that counts as "negative"; only applicable for binary classification label mapping.
 
 Include only the fields you need — at minimum just `task_type`. Values from this file take precedence over auto-detection, but CLI flags (`--task-type`, `--target-col`, etc.) override the config file.
 
@@ -85,7 +85,7 @@ Include only the fields you need — at minimum just `task_type`. Values from th
 The target column is resolved in this order:
 1. CLI flag (`--target-col`)
 2. `dataset_config.json` (`target_col` field)
-3. Auto-detection from common names: `class`, `target`, `label`, `y`
+3. Auto-detection from common names: `class`, `target`, `label`, `y` and their uppercase variants
 4. Interactive prompt (if running interactively)
 
 If all of the above fail, preparation will raise an error.
@@ -128,7 +128,7 @@ After preparation, datasets are stored in:
 ```
 prepared_datasets/my_dataset/
 ├── train.csv              # Training data
-├── validation.csv         # Validation data (created if not provided)
+├── validation.csv         # Validation data (only if provided in the raw dataset)
 ├── dataset_description.md # Copied/created description
 └── metadata.json          # Task type, classes, etc.
 
diff --git a/docs/user-guide/forking.md b/docs/user-guide/forking.md
index 58b7cb91..55d2ea20 100644
--- a/docs/user-guide/forking.md
+++ b/docs/user-guide/forking.md
@@ -69,7 +69,7 @@ Dataset and validation metric are locked to keep all iterations comparable acros
 
 When a fork is set up, the following happens before the new run starts:
 
-1. The entire source workspace is copied (models, splits, conda environment, reports).
+1. The source workspace state is copied, excluding generated reports/extras and untracked Conda environments.
 2. The git history in the run directory is checked out at the requested checkpoint — files added in later commits are removed.
 3. Absolute paths stored in step outputs are rewritten to point to the new workspace.
 4. The shared conda environment is renamed for the new run ID and updated from the stored `environment.yml`.
diff --git a/docs/user-guide/inference.md b/docs/user-guide/inference.md
index d7186668..b1dfe0a7 100644
--- a/docs/user-guide/inference.md
+++ b/docs/user-guide/inference.md
@@ -25,6 +25,8 @@ Use trained models to make predictions on new data with `scripts/inference.sh`.
 |----------|-------------|
 | `--cpu-only` | Run without GPU |
 | `--local` | Run locally without Docker |
+| `--code-path` | Relative path inside `--agent-dir` containing generated code; defaults to `best_iteration_snapshot` |
+| `--remove-conda-env` | Remove the generated inference Conda environment after the run |
 | `--help` | Show help message |
 
 ## Example
@@ -53,9 +55,7 @@ feature1,feature2,feature3
 
 ## Output Format
 
-The output format is defined by the generated `inference.py` script. For
-classification tasks, the output often includes a `numeric_label` column with
-scores in `[0, 1]`, but you should treat the exact schema as run-specific.
+The generated `inference.py` must preserve the input `id` column and write a prediction for every row. Classification runs produce `prediction` plus `probability_<class_id>` columns. Regression runs produce `prediction`. Additional columns are run-specific.
 
 ## Docker vs Local Mode
 
@@ -99,7 +99,7 @@ outputs/<agent_id>/best_iteration_snapshot/
 
 ### "Docker image not found"
 
-Run `./run.sh` once to build the Docker image, or use `--local` mode.
+The script first looks for a local `agentomics_img`, then for the matching pre-built Docker Hub image, and builds locally if needed. Use `--local` to avoid Docker.
 
 ### "Column mismatch"
 
diff --git a/docs/user-guide/outputs.md b/docs/user-guide/outputs.md
index 6b88a3cc..34dced81 100644
--- a/docs/user-guide/outputs.md
+++ b/docs/user-guide/outputs.md
@@ -14,10 +14,13 @@ outputs/<agent_id>/
 │   │   └── inference.py      # Inference script
 │   ├── validation_evaluation/
 │   │   ├── eval_predictions_train.csv
-│   │   └── eval_predictions_validation.csv
+│   │   ├── eval_predictions_validation.csv
+│   │   └── output.json
 │   ├── runtime_info/
 │   │   └── iteration_metadata.json
 │   ├── environment.yml
+│   ├── eval_predictions_test.csv      # If a held-out test set was provided
+│   ├── test_metrics.json              # If final test evaluation succeeded
 │   └── .conda/
 ├── run/                      # All iterations + shared run state
 │   ├── shared/
@@ -103,15 +106,15 @@ Metrics depend on the selected validation metric and task type. See
 During execution, the agent uses a workspace:
 
 ```
-workspace/
+<workspace_root>/
 ├── run/                     # Active run directory
 ├── best_iteration_snapshot/    # Best iteration snapshot
 ├── reports/                 # Iteration reports
 ├── extras/                  # Logs and metrics
-└── fallbacks/               # Backup for recovery
+└── fallbacks/               # Reserved recovery area
 ```
 
-After completion, everything is copied to `outputs/`.
+In Docker mode, this is an internal temporary volume. In local mode, it lives under `../workspace/runs/<agent_id>/`. After completion, the run workspace is copied to `outputs/<agent_id>/`.
 
 ## W&B Logging
 
@@ -146,12 +149,10 @@ rm -rf outputs/<agent_id>
 rm -rf outputs/*
 ```
 
-In Docker mode, the temporary workspace volume is removed after a run. In local
-mode, you can manually clean:
+In Docker mode, the temporary workspace volume is removed after a run. In local mode, you can manually remove the active workspace:
 
 ```bash
-rm -rf workspace/run/*
-rm -rf workspace/best_iteration_snapshot/*
+rm -rf ../workspace/runs/<agent_id>
 ```
 
 ## Next Steps
diff --git a/docs/user-guide/running-agent.md b/docs/user-guide/running-agent.md
index a340968a..2fdc0053 100644
--- a/docs/user-guide/running-agent.md
+++ b/docs/user-guide/running-agent.md
@@ -17,7 +17,8 @@ You'll be prompted to select:
 1. **LLM Model** - Choose from available models
 2. **Dataset** - Select a prepared dataset
 3. **Iterations** - Number of optimization cycles (default prompt: 5)
-4. **Validation Metric** - Optional metric to optimize (defaults: `AUROC` for classification, `MAE` for regression)
+
+The validation metric is not prompted interactively; pass `--val-metric` to override the task-based default (`AUROC` for classification, `MAE` for regression).
 
 ## Non-Interactive Mode
 
@@ -30,13 +31,14 @@ Supply parameters directly to skip prompts:
   --iterations 10
 ```
 
-For non-interactive runs, provide at least `--model`, `--dataset`, and `--iterations`.
+For non-interactive fresh runs, provide at least `--model` and `--dataset`. If you omit `--iterations`, the default is 5.
 
 ## Common Options
 
 | Option | Description | Example |
 |--------|-------------|---------|
 | `--model` | LLM model to use | `--model openai/gpt-4` |
+| `--provider` | Provider to use when multiple providers are configured | `--provider openai` |
 | `--dataset` | Dataset name | `--dataset my_data` |
 | `--iterations` | Number of iterations | `--iterations 15` |
 | `--val-metric` | Validation metric (optional, task-based default if omitted) | `--val-metric AUROC` |
@@ -65,16 +67,16 @@ For non-interactive runs, provide at least `--model`, `--dataset`, and `--iterat
 | `--build-images` | Build Docker images locally |
 | `--local` | Run without Docker (uses conda) |
 | `--cpu-only` | Disable GPU acceleration |
-| `--ollama` | Use local Ollama models |
+| `--ollama` | Enable Docker host networking for a host Ollama server |
 
 ## Advanced Options
 
 ### Foundation Models
 
-Pre-download domain-specific foundation models:
+Enable domain-specific foundation models:
 
 ```bash
-./run.sh --foundation-model-type dna
+./run.sh --foundation-models-type dna
 ```
 
 Available types: `dna`, `rna`, `protein`, `molecule`
@@ -104,6 +106,12 @@ Set timeout for each training execution (default is 6 hours):
 ./run.sh --run-python-timeout 43200  # 12 hours per training run
 ```
 
+You can also set a separate split deadline:
+
+```bash
+./run.sh --split-timeout 3600  # stop allowing split changes after 1 hour
+```
+
 ### Custom User Prompt
 
 Override the default optimization goal:
@@ -136,7 +144,7 @@ See [Forking a Run](forking.md) for the full guide.
 
 ## What Happens During a Run
 
-1. **Dataset Preparation** - Validates and prepares data in `prepared_datasets/`
+1. **Dataset Preparation** - Validates data, writes training/validation inputs to `prepared_datasets/`, and separates held-out tests into `prepared_test_sets/`
 2. **Iterative Development** - Agent runs exploration, training, and evaluation cycles
 3. **Snapshot Best Model** - Tracks the best-performing iteration
 4. **Final Evaluation** - Tests on held-out test set (if provided)
diff --git a/docs/user-guide/training.md b/docs/user-guide/training.md
index 638d108f..2090818c 100644
--- a/docs/user-guide/training.md
+++ b/docs/user-guide/training.md
@@ -96,7 +96,7 @@ GPU is used automatically if available. To disable:
 
 ### "Docker image not found"
 
-Run `./run.sh` once to build the Docker image, or use `--local` mode.
+The script first looks for a local `agentomics_img`, then for the matching pre-built Docker Hub image, and builds locally if needed. Use `--local` to avoid Docker.
 
 ### "Agent directory not found"
 
diff --git a/run.sh b/run.sh
index 3c6a5a06..ac7fbf6a 100755
--- a/run.sh
+++ b/run.sh
@@ -68,7 +68,7 @@ Optional Arguments:
                       Providing it means N more exploration iterations from the fork point.
   --val-metric <name> Metric to optimize. Defaults: AUROC (classification), MAE (regression).
   --user-prompt <str> The main prompt/goal for the agent.
-                      (Default: "Create the best possible machine learning model that will generalize to new unseen data.")
+                      (Default: "Develop a machine learning model that generalizes well to new unseen data.")
 
 Forking:
   --fork-from-run <path>  Path to the source run workspace directory (the 'outputs/<run_id>' folder).
@@ -79,7 +79,7 @@ Forking:
                             --dataset    (tied to the data the source run was trained on)
                             --val-metric (must stay consistent to compare iterations across the fork)
   --fork-from-step <step> Only used with --fork-from-run. Step ID to fork from (e.g. 'model_training').
-                          Defaults to the latest completed step or iteraiton end checkpoint in the source run.
+                          Defaults to the latest completed step or iteration end checkpoint in the source run.
   --fork-from-iteration <N>
                           Only used with --fork-from-run. Iteration to fork from.
                           Defaults to the latest iteration containing the specified step or iteration end checkpoint.