BioGeMT · dimostzim · May 14, 2026
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ Made for biomedical data, Agentomics outperformed human experts and created new
 
 How it works
 1) Input is a CSV training dataset + optional data description
-2) Agentomics autonomously experments with various ML models and strategies
+2) Agentomics autonomously experiments with various ML models and strategies
 3) Output is a trained model ready for inference and a detailed PDF report summarizing the development process and achieved metrics
 
 For more details see: [preprint](https://www.biorxiv.org/content/10.64898/2026.01.27.702049v1)
@@ -25,7 +25,7 @@ For more details see: [preprint](https://www.biorxiv.org/content/10.64898/2026.0
 git clone https://github.com/BioGeMT/agentomics-ml.git
 cd agentomics-ml
 cp .env.example .env
-# Edit .env and set at least one API key (OPENROUTER_API_KEY or OPENAI_API_KEY)
+# Edit .env and set at least one provider key, such as OPENROUTER_API_KEY or OPENAI_API_KEY
 
 # Download example dataset
 ./scripts/download_example_dataset.sh
@@ -35,7 +35,7 @@ cp .env.example .env
 
 Recommended model: `gpt-5.1-codex-max`
 
-Outputs are saved to `outputs/<agent_id>/`, including PDF reports in `outputs/<agent_id>/pdf_reports`.
+Outputs are saved to `outputs/<agent_id>/`, including PDF reports in `outputs/<agent_id>/reports/pdf/`.
 
 ### Installation Requirements
 
@@ -52,17 +52,16 @@ For more details visit **https://biogemt.github.io/agentomics-ml/**
 - Generic: Agentomics can crunch any classification and regression datasets in CSV format.
 - Secure: Agents execute code securely in Docker with read-only mounts to your file system and are only allowed to write in a Docker Volume.
 - Reproducible: Outputs include models, scripts, and conda environments needed to run inference or re-train models with one bash command.
-- Trustworthy: If you provide a test set, Agentomics fully abstracts LLMs from accessing it, allowing you to rely on programmaticly computed and reported test set metrics.
-- Foundation models: Agentomics can leverage foundation models from huggingface for both embeddings and fine-tuning.
-- Various LLM providers: OpenAI, OpenRouter, or local models via Ollama
+- Trustworthy: If you provide a test set, Agentomics keeps it hidden from the LLM and reports programmatically computed test metrics.
+- Foundation models: Agentomics can leverage foundation models from Hugging Face for both embeddings and fine-tuning.
+- Various LLM providers: OpenAI, Anthropic, OpenRouter, Codex/ChatGPT OAuth, or local models via Ollama.
 - Reliability: Thanks to our functional validators, Agentomics creates a working model 100% of the time (when using recommended settings).
 
 ## Roadmap
 Agentomics is in active development. We welcome any raised Issues and suggestions. You can also [Email Us](mailto:martinekvlastimil95@gmail.com).
 
 Features coming soon:
 - Support for any data type (currently only CSV datasets)
-- Run forking and continuing
 - Better local model support and configuration
 - Remote GPU support for GCP
 
@@ -80,5 +79,3 @@ bioRxiv (preprint) https://www.biorxiv.org/content/10.64898/2026.01.27.702049v1
 ## License
 
 MIT. See `LICENSE`.
-
-
diff --git a/docs/configuration/cli-options.md b/docs/configuration/cli-options.md
@@ -7,10 +7,13 @@ Complete reference for `run.sh` command-line options.
 | Option | Description | Default |
 |--------|-------------|---------|
 | `--model <name>` | LLM model to use | Interactive selection |
+| `--provider <name>` | Provider to use when multiple providers are configured | Prompted if multiple providers are available in an interactive terminal |
 | `--dataset <name>` | Dataset name | Interactive selection |
 | `--iterations <n>` | Number of iterations | Prompted in interactive mode (default 5) |
 | `--val-metric <metric>` | Validation metric to optimize | Task-based default (`AUROC` for classification, `MAE` for regression) |
+| `--task-type <classification\|regression>` | Task type to use while preparing the selected raw dataset | Dataset config or interactive preparation prompt |
 | `--timeout <seconds>` | Time limit for entire run | None |
+| `--split-timeout <seconds>` | Time limit after which the agent may no longer change train/validation split | None |
 | `--run-python-timeout <seconds>` | Timeout in seconds for each run_python tool execution - this will determine the maximum training time | `21600` (6 hours) |
 
 The run stops when either the iteration count is reached or the timeout expires.
@@ -22,7 +25,9 @@ The run stops when either the iteration count is reached or the timeout expires.
 | `--build-images` | Build Docker images locally |
 | `--local` | Run without Docker (uses conda) |
 | `--cpu-only` | Disable GPU acceleration |
-| `--ollama` | Use local Ollama for LLM |
+| `--ollama` | Enable Docker host networking for a host Ollama server |
+| `--test` | Run the integrated test suite in Docker mode |
+| `--all-iterations-test` | Evaluate every archived iteration on the held-out test set after the run |
 
 ## Listing Options
 
@@ -39,14 +44,14 @@ The run stops when either the iteration count is reached or the timeout expires.
 |--------|-------------|
 | `--user-prompt <text>` | Custom prompt for the agent |
 | `--iteration-plan-model <name>` | LLM model used for generating the iteration plan (defaults to `--model`) |
-| `--foundation-model-type <type>` | Pre-download foundation models (`dna`, `rna`, `protein`, `molecule`, `all`) |
+| `--foundation-models-type <type>` | Enable foundation models (`dna`, `rna`, `protein`, `molecule`, `all`) |
 | `--use-provisioning-key` | Use OpenRouter temporary API key |
 | `--spend-limit <n>` | Spend limit for provisioning key (requires `--use-provisioning-key`) |
 | `--verbosity <summary\|full>` | How much agent interaction detail is printed during the run (default: `full`) |
 | `--disable-training-reporting` | Disable the TrainingReporter helper that emits structured training progress updates from the agent's training script |
 | `--split-allowed-iterations <n>` | Iterations that can modify train/val split (default 1) |
 | `--exploration-iterations <n>` | Baseline exploration iterations (default 4) |
-| `--run-python-timeout <seconds>` | Per-training timeout for `run_python` tool (default 21600) |
+| `--tags <tag...>` | Space-separated tags for W&B logging |
 
 ## Forking
 
@@ -95,7 +100,8 @@ See [Forking a Run](../user-guide/forking.md) for a full guide and examples.
 ### Using Ollama
 
 ```bash
-./run.sh --ollama
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --ollama --provider ollama --model llama3.1 --dataset my_data
 ```
 
 ### CPU Only
@@ -104,10 +110,10 @@ See [Forking a Run](../user-guide/forking.md) for a full guide and examples.
 ./run.sh --cpu-only --model openai/gpt-4 --dataset my_data
 ```
 
-### Pre-download Foundation Models
+### Enable Foundation Models
 
 ```bash
-./run.sh --foundation-model-type protein --model openai/gpt-4
+./run.sh --foundation-models-type protein --model openai/gpt-4 --dataset my_data
 ```
 
 ### Run with locally built Docker images

diff --git a/docs/configuration/custom-prompts.md b/docs/configuration/custom-prompts.md
@@ -50,7 +50,7 @@ Without customization, the agent uses:
 
 ## What Custom Prompts Affect
 
-The user prompt influences all agent steps:
+The user prompt influences the agentic steps:
 
 | Step | How It's Used |
 |------|---------------|
@@ -61,7 +61,8 @@ The user prompt influences all agent steps:
 | Model Architecture | Model selection and design |
 | Model Training | Training approach and hyperparameters |
 | Model Inference | Prediction pipeline design |
-| Validation Evaluation | What success criteria matter most |
+
+Validation evaluation itself is deterministic: it runs the generated inference script on train/validation data and scores the configured `--val-metric`.
 
 ## Prompt Tips
 

diff --git a/docs/configuration/environment.md b/docs/configuration/environment.md
@@ -30,6 +30,9 @@ At least one API key is required:
 |----------|----------|---------|
 | `OPENROUTER_API_KEY` | OpenRouter | [openrouter.ai](https://openrouter.ai/) |
 | `OPENAI_API_KEY` | OpenAI | [platform.openai.com](https://platform.openai.com/) |
+| `ANTHROPIC_API_KEY` | Anthropic | [console.anthropic.com](https://console.anthropic.com/) |
+
+The Codex provider uses `codex login` and reads `~/.codex/auth.json`. Ollama does not use an API key, but set `OLLAMA_BASE_URL` so the provider is selectable.
 
 ### Provisioning Key (Optional)
 
@@ -115,6 +118,7 @@ export CUDA_VISIBLE_DEVICES=0,1
 # LLM Provider (choose one or more)
 OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxx
 # OPENAI_API_KEY=sk-xxxxxxxxxxxx
+# ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxx
 
 # Weights & Biases (optional)
 WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
@@ -128,6 +132,7 @@ WANDB_ENTITY=my-team
 
 # Ollama (optional)
 # Configure in src/utils/providers/configured_providers.yaml
+# OLLAMA_BASE_URL=http://localhost:11434/v1
 ```
 
 ## Security Notes

diff --git a/docs/configuration/providers.md b/docs/configuration/providers.md
@@ -8,6 +8,7 @@ Agentomics-ML supports multiple LLM providers out of the box.
 |----------|---------------------|--------|
 | [OpenRouter](https://openrouter.ai/) | `OPENROUTER_API_KEY` | 100+ models |
 | [OpenAI](https://openai.com/) | `OPENAI_API_KEY` | Use `--list-models` to see available models |
+| [Anthropic](https://anthropic.com/) | `ANTHROPIC_API_KEY` | Claude models available to your account |
 | OpenAI Codex | `codex login` | Uses your local Codex/ChatGPT login |
 | [Ollama](https://ollama.ai/) | Local setup | Local models |
 
@@ -60,6 +61,21 @@ Use `./run.sh --list-models` to see what your API key can access.
 
 ---
 
+## Anthropic
+
+Direct access to Anthropic models.
+
+### Setup
+
+```bash
+export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"
+./run.sh --provider anthropic --list-models
+```
+
+Use `--provider anthropic` explicitly when other provider keys are also set.
+
+---
+
 ## Codex (ChatGPT OAuth)
 
 Experimental support for the local Codex CLI login flow.
@@ -76,7 +92,7 @@ Then run Agentomics with the `codex` provider:
 
 ```bash
 ./run.sh --provider codex --list-models
-./run.sh --provider codex --model gpt-5.4
+./run.sh --provider codex --model gpt-5.4 --dataset my_data
 ```
 
 This provider reads your local Codex auth state from `~/.codex/auth.json` and
@@ -98,15 +114,16 @@ Run models locally for privacy or offline use.
 
 ### Docker Mode (Recommended)
 
-Run with:
+Set `OLLAMA_BASE_URL` so Agentomics considers Ollama available, then run with:
 
 ```bash
-./run.sh --ollama
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --ollama --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 Docker mode connects to the Ollama base URL defined in
 `src/utils/providers/configured_providers.yaml`
-(default: `http://host.docker.internal:11434/v1`).
+(default: `http://localhost:11434/v1`) and uses host networking when `--ollama` is passed.
 Ensure your Ollama server is reachable from the host at `:11434`.
 
 ### Local Mode
@@ -115,7 +132,8 @@ For local mode, set the Ollama base URL in `src/utils/providers/configured_provi
 to `http://localhost:11434/v1`, then run:
 
 ```bash
-./run.sh --local
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --local --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 ### Popular Models
@@ -197,5 +215,4 @@ Ensure Ollama is running:
 ollama list  # Should show pulled models
 ```
 
-For Docker mode, verify that `host.docker.internal:11434` is reachable from
-containers (run with `./run.sh --ollama`).
+For Docker mode, run with `--ollama` so the container uses host networking, and verify the configured Ollama URL is reachable on the host.
diff --git a/docs/developer/gpu-settings.md b/docs/developer/gpu-settings.md
@@ -122,12 +122,14 @@ Agentomics-ML supports multi-GPU training:
 - Agent-generated scripts may use DataParallel or DistributedDataParallel
 - All available GPUs are passed to containers by default
 
-To limit GPUs:
+To limit GPUs in local mode:
 
 ```bash
-CUDA_VISIBLE_DEVICES=0,1 ./run.sh  # Use only first 2 GPUs
+CUDA_VISIBLE_DEVICES=0,1 ./run.sh --local  # Use only first 2 GPUs
 ```
 
+In Docker mode, Agentomics passes all available GPUs to the container; selecting a subset requires running Docker manually with custom GPU flags.
+
 ## Docker GPU Flags
 
 When running containers manually, you can limit GPUs with Docker flags:

diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
@@ -24,7 +24,7 @@ cd Agentomics-ML
 ```bash
 # Create a .env file (required for Docker mode)
 cp .env.example .env
-# Edit .env and set at least one API key
+# Edit .env and set at least one provider key
 
 # Run with pre-built images
 ./run.sh
@@ -47,13 +47,13 @@ The images will be downloaded automatically on first run. All subsequent runs wi
 ```bash
 # Create a .env file (required for Docker mode)
 cp .env.example .env
-# Edit .env and set at least one API key
+# Edit .env and set at least one provider key
 
 # Run while building images locally
 ./run.sh --build-images
 ```
 
-On first run, you'll be prompted to build the Docker images. This takes a few minutes but only needs to be done once.
+With `--build-images`, the Docker images are built locally before the run starts. This takes a few minutes but only needs to be repeated when dependencies or Dockerfiles change.
 
 ---
 
@@ -103,23 +103,25 @@ Run with local models using Ollama for privacy or offline use.
 
 ### Docker Mode Setup
 
-1. Ensure Ollama listens on the host (e.g., `0.0.0.0:11434`).
-2. Run with the `--ollama` flag:
+1. Ensure Ollama is running on the host.
+2. Make the Ollama provider selectable and choose it explicitly:
 
     ```bash
-    ./run.sh --ollama
+    export OLLAMA_BASE_URL=http://localhost:11434/v1
+    ./run.sh --ollama --provider ollama --model <ollama-model> --dataset <dataset>
     ```
 
 Docker mode connects to the URL configured in `src/utils/providers/configured_providers.yaml`
-(default: `http://host.docker.internal:11434/v1`).
+(default: `http://localhost:11434/v1`) and uses host networking when `--ollama` is passed.
 
 ### Local Mode Setup
 
 For local mode, set the Ollama base URL in `src/utils/providers/configured_providers.yaml`
 to `http://localhost:11434/v1`, then run:
 
 ```bash
-./run.sh --local
+export OLLAMA_BASE_URL=http://localhost:11434/v1
+./run.sh --local --provider ollama --model <ollama-model> --dataset <dataset>
 ```
 
 ---

diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md
@@ -5,7 +5,7 @@ Get Agentomics-ML running in under 5 minutes using pre-built Docker images.
 ## Prerequisites
 
 - [Docker](https://docs.docker.com/get-docker/) installed and running
-- An API key from [OpenRouter](https://openrouter.ai/) or [OpenAI](https://platform.openai.com/)
+- An API key from a configured provider, such as [OpenRouter](https://openrouter.ai/) or [OpenAI](https://platform.openai.com/)
 
 ## Steps
 
@@ -22,8 +22,8 @@ Docker mode requires a `.env` file in the repo root.
 
 ```bash
 cp .env.example .env
-# Edit .env and set at least one API key:
-# OPENROUTER_API_KEY or OPENAI_API_KEY
+# Edit .env and set at least one provider key:
+# OPENROUTER_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
 ```
 
 ### 3. Run the Agent
@@ -39,7 +39,8 @@ The agent will prompt you to:
 1. **Select a model** - Choose from available LLMs
 2. **Select a dataset** - Use your own or download examples
 3. **Configure iterations** - How many optimization cycles to run
-4. **Choose validation metric** - see `./run.sh --list-metrics`
+
+The validation metric defaults to `AUROC` for classification and `MAE` for regression. To choose one explicitly, pass `--val-metric`; see `./run.sh --list-metrics`.
 
 ## Using Your Own Dataset
 

diff --git a/docs/how-it-works/architecture.md b/docs/how-it-works/architecture.md
@@ -49,11 +49,12 @@ The agent analyzes the dataset:
 
 ### 3. Data Split
 
-Creates or modifies train/validation split:
+Creates, reuses, or modifies train/validation split files:
 
 - Stratified splitting for classification
 - Considers data distribution
 - May adjust split based on previous iterations
+- Reuses supplied `validation.csv` when present
 
 **Output:** Captured in structured outputs and iteration reports
 
@@ -144,7 +145,7 @@ Each step uses an LLM agent that:
 1. Receives context (data info, previous results, and the iteration plan when applicable)
 2. Generates a structured output (validated by Pydantic)
 3. Validates the output meets requirements
-4. Retries if validation fails (up to 10 times)
+4. Retries if validation fails (up to 5 times by default)
 
 ## Iteration Flow
 
@@ -186,11 +187,11 @@ Key architecture parameters in `src/utils/config.py`:
 
 | Parameter | Default | Description |
 |-----------|---------|-------------|
-| `temperature` | 1.0 | LLM creativity level |
+| `temperature` | 0.7 | LLM creativity level |
 | `max_steps` | 100 | Max steps per agent |
-| `max_validation_retries` | 10 | Output validation retries |
-| `llm_response_timeout` | 900s | LLM response timeout |
-| `bash_tool_timeout` | 300s | Bash command timeout |
+| `max_validation_retries` | 5 | Output validation retries |
+| `llm_response_timeout` | 600s | LLM response timeout |
+| `bash_tool_timeout` | 180s | Bash command timeout |
 | `run_python_tool_timeout` | 21600s | Training timeout (6 hours, configurable via `--run-python-timeout`) |
 
 ## Next Steps

diff --git a/docs/how-it-works/evaluation.md b/docs/how-it-works/evaluation.md
@@ -8,7 +8,7 @@ Models are evaluated at multiple stages:
 
 | Stage | Data Used | Purpose |
 |-------|-----------|---------|
-| **Dry Run** | Small sample | Validate inference script works |
+| **Dry Run** | Prepared training data without labels | Validate inference script shape and metrics compatibility |
 | **Validation** | Validation set | Guide optimization |
 | **Train** | Training set | Detect overfitting |
 | **Test** | Hidden test set | Final unbiased evaluation |