Update Arbor Tutorials (#9007)

Ziems · web-flow · commit 958923be48c1 · 2025-11-01T14:21:35.000-04:00
* Update DSPy docs * Remove `lm_local_arbor.py` provider (moved to https://github.com/Ziems/arbor for cleanliness and faster development)
diff --git a/docs/docs/tutorials/rl_multihop/index.ipynb b/docs/docs/tutorials/rl_multihop/index.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "WARNING: This feature is new and extremely EXPERIMENTAL. Unlike almost everything else in DSPy, it's currently in pure proof of concept and development mode, but we release it to encourage community involvement.\n",
     "\n",
-    "For this tutorial, you will also need DSPy's Arbor RL server.\n",
+    "For this tutorial, you will also need [DSPy's Arbor RL framework](https://github.com/Ziems/arbor) which you can install with:\n",
     "\n",
     "```bash\n",
     "> pip install -U arbor-ai\n",
@@ -22,23 +22,25 @@
    "outputs": [],
    "source": [
     "import dspy\n",
-    "from dspy.clients.lm_local_arbor import ArborProvider\n",
-    "\n",
     "import arbor\n",
+    "from arbor import ArborGRPO, ArborProvider\n",
     "arbor_server_info = arbor.init() # Initialize the Arbor server in the background\n",
     "\n",
     "port = 7453\n",
-    "local_lm_name = \"Qwen/Qwen2.5-7B-Instruct\"\n",
+    "local_lm_name = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
     "local_lm = dspy.LM(\n",
     "    model=f\"openai/arbor:{local_lm_name}\",\n",
     "    provider=ArborProvider(),\n",
-    "    temperature=0.7,\n",
-    "    api_base=arbor_server_info[\"api_base\"],\n",
+    "    api_base=arbor_server_info[\"base_url\"],\n",
+    "    # Arbor checks to make sure these match the training config\n",
+    "    temperature=1.0,\n",
+    "    top_p=1.0,\n",
+    "    top_k=-1,\n",
+    "    repetition_penalty=1.0,\n",
+    "    max_tokens=2048,\n",
     ")\n",
     "\n",
-    "dspy.configure(lm=local_lm)\n",
-    "\n",
-    "openai_lm = dspy.LM(model=\"openai/gpt-4.1-mini\")"
+    "dspy.configure(lm=local_lm)"
    ]
   },
   {
@@ -97,7 +99,12 @@
    "source": [
     "### Load the HoVer dataset.\n",
     "\n",
-    "Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim."
+    "Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.\n",
+    "\n",
+    "You may have to install an older version of the dataset to get it working properly...\n",
+    "```shell\n",
+    "> pip install datasets==3.6.0\n",
+    "```"
    ]
   },
   {
@@ -226,47 +233,61 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dspy.teleprompt.grpo import GRPO\n",
-    "\n",
     "program = ResearchHop(num_docs=4, num_hops=2)\n",
     "program.set_lm(local_lm)\n",
     "\n",
-    "# NOTE: Training on 6 GPUs.\n",
+    "# NOTE: Training on 4 GPUs.\n",
     "train_kwargs = {\n",
     "    \"per_device_train_batch_size\": 2,\n",
-    "    \"gradient_accumulation_steps\": 8,\n",
+    "    \"gradient_accumulation_steps\": 24/6,\n",
     "    \"temperature\": 1.0,\n",
-    "    \"beta\": 0.04,\n",
-    "    \"learning_rate\": 1e-5,\n",
+    "    \"top_k\": -1,\n",
+    "    \"top_p\": 1.0,\n",
+    "    \"repetition_penalty\": 1.0,\n",
+    "    \"beta\": 0.00,\n",
+    "    \"learning_rate\": 1e-6,\n",
     "    \"gradient_checkpointing\": True,\n",
-    "    \"gradient_checkpointing_kwargs\": {\"use_reentrant\": False},\n",
     "    \"bf16\": True,\n",
     "    \"lr_scheduler_type\": \"constant_with_warmup\",\n",
+    "    \"loss_type\": \"dapo\",\n",
+    "    \"max_steps\": 1000,\n",
+    "    \"report_to\": \"wandb\",\n",
+    "    \"log_completions\": True,\n",
+    "    \"logging_steps\": 1,\n",
     "    \"max_prompt_length\": None,\n",
     "    \"max_completion_length\": None,\n",
-    "    \"scale_rewards\": True,\n",
-    "    \"max_grad_norm\": 0.5,\n",
-    "    \"lora\": True,\n",
+    "    \"scale_rewards\": False,\n",
+    "    \"max_grad_norm\": 1.0,\n",
+    "    \"lora_config\": {\n",
+    "        \"lora_alpha\": 16,\n",
+    "        \"lora_dropout\": 0.05,\n",
+    "        \"r\": 8,\n",
+    "        \"target_modules\": [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"up_proj\", \"down_proj\", \"gate_proj\"],\n",
+    "    },\n",
+    "    \"num_training_gpus\": 3,\n",
+    "    \"num_inference_gpus\": 1,\n",
+    "    \"weight_decay\": 0.001,\n",
     "}\n",
     "\n",
-    "compiler = GRPO(\n",
+    "compiler = ArborGRPO(\n",
     "    metric=recall,\n",
     "    num_dspy_examples_per_grpo_step=6,\n",
-    "    num_rollouts_per_grpo_step=4,\n",
+    "    num_rollouts_per_grpo_step=24,\n",
     "    exclude_demos=True,\n",
-    "    num_train_steps=100,\n",
+    "    num_train_steps=1000,\n",
     "    num_threads=16,\n",
     "    use_train_as_val=False,\n",
-    "    num_steps_for_val=10,\n",
+    "    num_steps_for_val=50,\n",
     "    train_kwargs=train_kwargs,\n",
-    "    report_train_scores=False,\n",
+    "    checkpoint=\"single-best\",\n",
     ")\n",
     "\n",
     "optimized_program = compiler.compile(\n",
     "    student=program,\n",
     "    trainset=trainset,\n",
     "    valset=devset,\n",
-    ")\n"
+    ")\n",
+    "\n"
    ]
   },
   {
@@ -290,13 +311,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In our preliminary experiments, training above for about 18 hours boosts the recall (devset) from 61.8% to 66.2%. This is _typically_ worse on cost/quality basis than you'd get from running prompt optimizers dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for small LMs."
+    "In our preliminary experiments, training about 18 hours boosts the recall (devset) from 61.8% to 66.2%. This is _typically_ worse on cost/quality basis than you'd get from running prompt optimizers dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for small LMs."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "jun2024_py310",
+   "display_name": "arbor-exps",
    "language": "python",
    "name": "python3"
   },
@@ -310,7 +331,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,
diff --git a/docs/docs/tutorials/rl_papillon/index.ipynb b/docs/docs/tutorials/rl_papillon/index.ipynb
@@ -8,12 +8,11 @@
     "\n",
     "WARNING: This feature is new and extremely EXPERIMENTAL. Unlike almost everything else in DSPy, it's currently in pure proof of concept and development mode, but we release it to encourage community involvement.\n",
     "\n",
-    "In this tutorial, we optimize the LM weights of [PAPILLON](https://dspy.ai/tutorials/papillon/) with `dspy.GRPO`, a generalization of the popular GRPO online RL algorithm of LLMs to sophisticated multi-module LM programs.\n",
+    "In this tutorial, we optimize the LM weights of [PAPILLON](https://dspy.ai/tutorials/papillon/) with `ArborGRPO`, a generalization of the popular GRPO online RL algorithm of LLMs to sophisticated multi-module LM programs.\n",
     "\n",
-    "PAPILLON is a system for privacy-preserving delegation, where we will teach a tiny model (1.7B parameters) to use an \"untrusted\" external LLM, which is more powerful but may save your private data, to balance high-quality and private chat.\n",
-    "\n",
-    "For this tutorial, you will also need the Arbor RL server.\n",
+    "PAPILLON is a system for privacy-preserving delegation, where we will teach a tiny model (1.5B parameters) to use an \"untrusted\" external LLM, which is more powerful but may save your private data, to balance high-quality and private chat.\n",
     "\n",
+    "For this tutorial, you will also need [DSPy's Arbor RL framework](https://github.com/Ziems/arbor) which you can install with:\n",
     "```bash\n",
     "> pip install -U arbor-ai\n",
     "```"
@@ -26,18 +25,22 @@
    "outputs": [],
    "source": [
     "import dspy\n",
-    "from dspy.clients.lm_local_arbor import ArborProvider\n",
-    "\n",
     "import arbor\n",
+    "from arbor import ArborGRPO, ArborProvider\n",
     "arbor_server_info = arbor.init() # Initialize the Arbor server in the background\n",
     "\n",
     "port = 7453\n",
-    "local_lm_name = \"Qwen/Qwen2.5-7B-Instruct\"\n",
+    "local_lm_name = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
     "local_lm = dspy.LM(\n",
     "    model=f\"openai/arbor:{local_lm_name}\",\n",
     "    provider=ArborProvider(),\n",
-    "    temperature=0.7,\n",
-    "    api_base=arbor_server_info[\"api_base\"],\n",
+    "    api_base=arbor_server_info[\"base_url\"],\n",
+    "    # Arbor checks to make sure these match the training config\n",
+    "    temperature=1.0,\n",
+    "    top_p=1.0,\n",
+    "    top_k=-1,\n",
+    "    repetition_penalty=1.0,\n",
+    "    max_tokens=2048,\n",
     ")\n",
     "\n",
     "dspy.configure(lm=local_lm)\n",
@@ -255,30 +258,43 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dspy.teleprompt.grpo import GRPO\n",
-    "\n",
     "papillon = PAPILLON(untrusted_model=openai_lm)\n",
     "papillon.set_lm(local_lm)\n",
     "\n",
-    "# NOTE: Training on 3 GPUs.\n",
+    "# NOTE: Training on 4 GPUs.\n",
     "train_kwargs = {\n",
     "    \"per_device_train_batch_size\": 8,\n",
     "    \"gradient_accumulation_steps\": 4,\n",
     "    \"temperature\": 1.0,\n",
-    "    \"beta\": 0.04,\n",
-    "    \"learning_rate\": 2e-6,\n",
+    "    \"top_k\": -1,\n",
+    "    \"top_p\": 1.0,\n",
+    "    \"repetition_penalty\": 1.0,\n",
+    "    \"beta\": 0.00,\n",
+    "    \"learning_rate\": 1e-6,\n",
     "    \"gradient_checkpointing\": True,\n",
-    "    \"gradient_checkpointing_kwargs\": {\"use_reentrant\": False},\n",
     "    \"bf16\": True,\n",
     "    \"lr_scheduler_type\": \"constant_with_warmup\",\n",
+    "    \"loss_type\": \"dapo\",\n",
+    "    \"max_steps\": 1000,\n",
+    "    \"report_to\": \"wandb\",\n",
+    "    \"log_completions\": True,\n",
+    "    \"logging_steps\": 1,\n",
     "    \"max_prompt_length\": None,\n",
     "    \"max_completion_length\": None,\n",
-    "    \"scale_rewards\": True,\n",
-    "    \"max_grad_norm\": 0.5,\n",
-    "    \"lora\": True,\n",
+    "    \"scale_rewards\": False,\n",
+    "    \"max_grad_norm\": 1.0,\n",
+    "    \"lora_config\": {\n",
+    "        \"lora_alpha\": 16,\n",
+    "        \"lora_dropout\": 0.05,\n",
+    "        \"r\": 8,\n",
+    "        \"target_modules\": [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"up_proj\", \"down_proj\", \"gate_proj\"],\n",
+    "    },\n",
+    "    \"num_training_gpus\": 3,\n",
+    "    \"num_inference_gpus\": 1,\n",
+    "    \"weight_decay\": 0.001,\n",
     "}\n",
     "\n",
-    "compiler = GRPO(\n",
+    "compiler = ArborGRPO(\n",
     "    metric=compute_overall_score,\n",
     "    multitask=True,\n",
     "    num_dspy_examples_per_grpo_step=4,\n",
@@ -320,13 +336,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In our preliminary experiments, training above for three hours boosts the composite score (devset) from 54.6% to 60.0%. This is _typically_ worse on cost/quality basis than you'd get from running prompt optimizers like dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for tiny LMs."
+    "In our preliminary experiments, training three hours boosts the composite score (devset) from 54.6% to 60.0%. This is _typically_ worse on cost/quality basis than you'd get from running prompt optimizers like dspy.MIPROv2 or dspy.SIMBA, but it's still a very solid start for online RL over arbitrary LM programs for tiny LMs."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {
diff --git a/dspy/clients/lm_local_arbor.py b/dspy/clients/lm_local_arbor.py