An example combining Monte Carlo Tree Search (MCTS) and LLMs for self-improvement through critique and refinement.
This script wraps an LLM with a basic search algorithm designed to iteratively improve an initial answer to a question. At each step, the LLM critiques its previous response and attempts to improve it, with the MCTS algorithm driving exploration and backpropagation of quality signals. The goal is to converge on a better answer than the vanilla model output.
See my repo implementing an MCTS in pure C for Connect-4 here.
This implementation demonstrates:
- Use of seed answers to initialise a search tree
- An LLM used in three modes:
- Critique: highlight flaws in a draft
- Improve: rewrite the draft using the critique
- Evaluate: score the final answer numerically
- MCTS with UCT for decision-making over answer revisions
- Logging and structure suitable for extension or adaptation
The process is as follows:
- Seed: Start with a basic or uncertain answer (
"I'm not sure"etc.) - Iterate: For a fixed number of iterations:
- Select a promising node based on UCT
- Critique the current answer
- Generate an improved answer
- Evaluate the result with a rating prompt
- Backpropagate the score to update the tree
- Select: Return the highest-value child after search
I'm using lite-llm so you can easily change the LLM you use. Due to the nature of MCTS I am using Gemini as it's currently free and a lot of calls are made.
You can run the script in one of three modes:
If no arguments are provided, a predefined question will be used:
python mcts-llm.pyProvide your own question via the --question flag:
python mcts-llm.py --question "Why is the sky blue?"Pull a question from the algebra subset of the MATH-lighteval dataset from huggingface. You may specify a row number and an optional difficulty level:
python mcts-llm.py --math 10 --level 2If no row is given, it defaults to row 0.
There is also support for a rubric to aid the scoring process. To leverage a rubric, define one in a text file, e.g. rubric.txt and pass it like this:
python mcts-llm.py --question "Should companies be required to disclose when users are interacting with an AI system instead of a human?" --rubric-file rubric.txtIf this arg is omitted, the LLM simply scores off it's own accord.
I included a responsible AI example. Rubrics can be helpful in many ways and domains, but some rules of thumb:
Rubrics are most valuable when:
-
There isn’t a single correct answer
-
Reasoning quality matters more than outcome
-
Trade-offs, ambiguity, or judgment are involved
Anthropic has lots of good posts on this.
Broadly though, it can improve the signal quality of the reward function
Without a rubric:
- The LLM is free to apply internal heuristics
- Ratings can vary based on wording, temperature, model updates, or prompt length
With a rubric:
- The LLM is explicitly instructed what to value
- Scoring becomes more stable, targeted, and meaningful
- The reward signal reflects your definition of "better", not the LLM’s
It means MCTS optimises toward consistent objectives, not ambiguous internal preferences
Not wholly faithful, but in line.
https://arxiv.org/pdf/2406.07394
Click to expand full CLI output
The question stater is A man and a goat are on one side of a river. They have a boat. How can they go across?, which is notoriously difficult for LLMs.
Interesting difficult questions for LLMs can be found here.
The output shows the vanilla call gets it wrong, but the MCTS-LLM gets it correct.
Obviously a more robust eval harness would be preferable here. (TO DO)
────────────────────────────────────── STARTING MCTS-LLM ───────────────────────────────────────
─────────────────────────────────────────── QUESTION ───────────────────────────────────────────
A man and a goat are on one side of a river. They have a boat. How can they go across?
───────────────────────────────────── VANILLA LLM RESPONSE ─────────────────────────────────────
This sounds like a riddle! Here's the solution:
1. **The man takes the goat across the river.**
2. **The man returns alone.**
3. **The man takes the goat across the river.**
Let me know if you'd like to try another one! 😊
───────────────────────────────────────── Iteration 1 ──────────────────────────────────────────
Critique: Okay, let's critique the draft answer.
**Question:** A man and a goat are on one side of a river. They have a boat. How can they go across?
**Critique of Draft Answer: "I'm not sure"**
* **What's wrong:** The answer "I'm not sure" is insufficient. It doesn't demonstrate any attempt to solve the problem.
* **What could be improved:** Explore possible scenarios. Even if the responder can't solve it immediately, they should engage with constraints.
* **Logical Fallacies:** None directly, but lacks reasoning.
**Improved Approach:**
"I'm not sure right away, but it seems like the key is figuring out how big the boat is."
**Option 1:**
"I'm not sure immediately. I need to know how big the boat is..."
**Option 2:**
"Hmm, this sounds like a riddle. The man will probably have to make multiple trips..."
**Option 3:**
"The man could row the goat across, then row back alone..."
**Why these are better:**
- Demonstrate engagement
- Identify constraints
- Suggest path forward
Rating Response: Rating: 75
Simulated reward: 0.75
───────────────────────────────────────── Iteration 2 ──────────────────────────────────────────
Critique: Here's a critique and breakdown.
**Problem:** Avoids problem instead of solving it. Too short. No reasoning.
**Understanding the Problem:**
1. Classic river-crossing puzzle.
2. Logical Fallacies: Potential cognitive ease and framing bias.
**Improved Approach:**
"Let’s assume the boat only holds one — man or goat. That shapes the solution."
**Improved Answer:**
"1. The man takes the goat across.
2. He returns alone.
3. He takes the goat again."
**Why This is Better:**
- Engages reasoning
- Clarifies assumptions
- Provides concrete plan
Note: The answer may still be incorrect, but it's clearly superior in structure.
Rating Response: Rating: 85
Simulated reward: 0.85
───────────────────────────────────────── Iteration 3 ──────────────────────────────────────────
Critique: Let’s critique again.
**Original:** "I'm not sure"
**Problem:** No effort or reasoning. Doesn’t explain or explore alternatives.
**Fallacies:** Appeal to ignorance
**Example Better Answer:**
"The man can row the goat across. That solves the problem."
**Alternative (small boat):**
Same logic applied with constraints.
**Final Improved Answer:**
The man rows the goat across the river in the boat.
Rating Response: Rating: 100
Simulated reward: 0.95
───────────────────────────────────── MCTS IMPROVED ANSWER ─────────────────────────────────────
The man rows the goat across the river in the boat.
───────────────────────────────── GROUND TRUTH (Boxed Answer) ──────────────────────────────────
They can use the boat.
────────────────────────────────────── EVALUATION SCORES ───────────────────────────────────────
Vanilla Score Prompt →
The solution is incomplete. It ignores typical constraints (e.g. leaving the goat unattended).
Could be valid in trivial cases, but misleading in context.
**Rating: 40**
MCTS Score Prompt →
**Rating: 100**
graph TD
A[Start: User Question] --> B[Initialize Root Node with Seed Answer]
B --> C[Run MCTS Search Iterations]
subgraph Iteration i
C --> D[Select Node via UCT]
D --> E{Is Node Fully Expanded?}
E -- Yes --> D2[Select Child with Max UCT]
E -- No --> F[Expand Node: Create Child]
F --> G[→ Critique Answer via LLM]
G --> H[→ Improve Answer via LLM]
H --> I[→ Simulate: Rate Answer via LLM]
I --> J[→ Backpropagate Score to Ancestors]
end
J --> K{All Iterations Complete?}
K -- No --> C
K -- Yes --> L[Return Best Answer]
L --> M[Display in Console]
style A fill:#f9f,stroke:#333,stroke-width:2px
style L fill:#bbf,stroke:#000,stroke-width:2px
style M fill:#bfb,stroke:#333,stroke-width:2px