Your one-stop multi-LLM think tank
Shengji Tang1,2,* Weihao Lin1,3,* Peng Ye1,2,† Jingqi Ye4 Hao Li1,5 Yiqun Zhang1,6 Xiaosong Wang1
Bo Zhang1 Shuyue Hu1 Tao Chen3 Lei Bai1 Wanli Ouyang1,2
1Shanghai Artificial Intelligence Laboratory 2The Chinese University of Hong Kong 3Fudan University
4University of Science and Technology of China 5Northwestern Polytechnical University 6Northeastern University
* Equal Contribution † Corresponding Author
- [2026/06] 🌟We open-sourced all data and code, including the
aisfuture/jisi_datadataset on Hugging Face and this repository.- [2026/05] 🏆Our paper was accepted to ICML 2026.
We will continue to update the routing question bank with the latest open-source models, including Qwen, Kimi, GLM, DeepSeek, and others. Any updates will be announced here.
- Open-source code — JiSi routing and aggregation implementation in this repository
- Open-source dataset — ready-to-run splits, embeddings, and benchmark bank on Hugging Face
- Web demo — browser-based JiSi experience (coming soon)
- Public API — hosted routing and aggregation endpoints (coming soon)
This is the official code repository for the paper Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale. The repository is primarily intended for academic research, but you can also use it to quickly assemble your own dedicated multi-LLM think tank.
The core design is minimalism and effective: instead of relying on one monolithic model, JiSi turns a pool of heterogeneous open-source LLMs into a collective system. Given a query, JiSi retrieves semantically similar support-set questions, estimates which models are likely to perform well, and then either routes to a strong expert or aggregates multiple expert responses.
This repository keeps only the JiSi method and the minimal supporting infrastructure needed to reproduce and extend it.
Modern LLMs are strong but uneven: a model that excels at mathematical reasoning may underperform on long-form QA, coding, or domain-specific reasoning. JiSi treats this heterogeneity as a resource. It builds a support set containing questions, model responses, correctness records, usage information, and embeddings. At inference time, JiSi uses this support evidence to choose or combine models at the instance level.
JiSi contains three main mechanisms:
- Query-response mixed routing: retrieve support-set neighbors and estimate model capability using both query similarity and response-side evidence.
- Support-set-based aggregator selection: choose aggregation candidates from the model pool instead of using a fixed aggregator for every question.
- Adaptive routing-aggregation switch: decide whether the current instance should be answered by a single routed expert or by aggregation over multiple experts.
In the paper, JiSi built from ten open-source LLMs reaches 72.15% average performance while reducing total cost by 53.23% compared with Gemini-3-Pro on the reported benchmark suite.
The overview framework of JiSi.
How JiSi improves existing routing and aggregation: mixed routing, support-set-based aggregator selection, and adaptive switching between routing and aggregation.
- Training-free collaboration: JiSi uses support-set retrieval and score estimation rather than training a monolithic router.
- Instance-level decisions: each query can receive a different expert set and aggregation strategy.
- Superior comprehensive performance: JiSi achieves superior or competitive results across diverse benchmarks when compared with leading open-source and closed-source LLMs.
- Strong scalability: as more open-source LLMs join the pool, JiSi shows a scalable improvement.
Left: comparison with open-source LLMs. Right: comparison with closed-source LLMs.
JiSi performance improves consistently as the open-source model pool grows from 5 to 10 models.
| Component | Path | Purpose |
|---|---|---|
| JiSi runner | baselines/JiSi/run_jisi.py |
Main routing and aggregation entry point |
| JiSi config | baselines/JiSi/config.py |
Runtime configuration and validation |
| Model API layer | baselines/JiSi/utils/ |
OpenAI-compatible generation and aggregation utils |
| Data adaptor | baselines/adaptors/jisi_adaptor.py |
Converts collected benchmark results into JiSi JSONL files |
| Data collector | data_collector/ |
Optional utilities for collecting model outputs on supported evaluators |
| Evaluators | evaluation/ |
Benchmark-specific scoring utilities |
We recommend Python 3.10 or 3.11. CUDA is recommended for large-scale runs.
conda create -n jisi python=3.10
conda activate jisi
pip install -r requirements.txtSet API keys through environment variables.
cp .env.example .env
# Edit .env locally, or export variables in your shell:
export EMBEDDING_API_KEY="your-embedding-key"
export OPENAI_API_KEY="your-openai-compatible-key"Copy the example configs before your first run:
cp baselines/JiSi/config/jisi/main.example.json baselines/JiSi/config/jisi/main.local.json
cp baselines/JiSi/config/jisi/api_config.example.json baselines/JiSi/config/jisi/api_config.local.json
cp config/embedding_config.example.yaml config/embedding_config.local.yamlPoint main.local.json at the local paths above (embedding_config_path, api_config_path, and the data/jisi/... files from step 2).
Download the ready-to-run split from aisfuture/jisi_data and copy it into the repository:
pip install -U "huggingface_hub[cli]"
DATASET_REPO=aisfuture/jisi_data
huggingface-cli download \
--repo-type dataset "$DATASET_REPO" \
--include "example_data/seed42_split0.7/*" "example_data/*.tar" \
--local-dir .hf_jisi_data
mkdir -p data/jisi
cp -r .hf_jisi_data/example_data/* data/jisi/The layout should look like:
data/jisi/
seed42_split0.7/
train.jsonl
test.jsonl
baseline_scores.json
train_query_embed.tar
train_response_embed.tar
test_response_embed.tar
The three *.tar files are precomputed embedding caches. When they are present, run_jisi loads them automatically and skips rebuilding the large query/response embedding banks.
JiSi calls an OpenAI-compatible /embeddings endpoint at inference time. Edit config/embedding_config.local.yaml so base_url, api_model_name, and api_key match your deployment.
Option A — local server (recommended for reproduction)
Download gte-Qwen2-7B-instruct, then serve it with an OpenAI-compatible stack such as vLLM:
vllm serve <path-or-repo-to-gte-Qwen2-7B-instruct> \
--task embed \
--host 0.0.0.0 \
--port 8000Set embedding_config.local.yaml to http://127.0.0.1:8000/v1 and align main.local.json (embedding_base_url, embedding_model) with the served model name.
Option B — remote API
Point embedding_config.local.yaml at a hosted OpenAI-compatible embedding API (for example text-embedding-3-large on OpenAI or OpenRouter) and set EMBEDDING_API_KEY in .env.
Router-only:
python -m baselines.JiSi.run_jisi \
--config baselines/JiSi/config/jisi/main.local.jsonOr pass flags explicitly:
python -m baselines.JiSi.run_jisi \
--train-data data/jisi/seed42_split0.7/train.jsonl \
--test-data data/jisi/seed42_split0.7/test.jsonl \
--baseline-scores data/jisi/seed42_split0.7/baseline_scores.json \
--embedding-config config/embedding_config.local.yaml \
--api-config baselines/JiSi/config/jisi/api_config.local.json \
--mode routerAggregator mode — set "mode": "aggregator" in main.local.json (or pass --mode aggregator), and ensure api_config.local.json lists every model name in your JiSi pool.
Aggregation writes result.jsonl under result_dir. Score generated answers with:
python -m baselines.JiSi.post_eval \
--res_path results/jisi/<run_name>/result.jsonl \
--datasets paperThe repository does not include large benchmark result files. The easiest path is to download the Hugging Face dataset release aisfuture/jisi_data, copy the ready-to-run example_data/ contents to data/jisi/.
The expected structure is:
data/jisi/
seed42_split0.7/
train.jsonl
test.jsonl
baseline_scores.json
train_query_embed.tar
train_response_embed.tar
test_response_embed.tar
Each JSONL row should contain at least:
{
"query": "Question or prompt text",
"dataset": "aime",
"index": 1,
"split": "test",
"records": {
"ModelA": 1.0,
"ModelB": 0.0
},
"usages": {
"ModelA": {"prompt_tokens": 120, "completion_tokens": 64, "cost": 0.0}
},
"raw_output": {
"ModelA": "ModelA response"
},
"gt": "ground-truth answer"
}The adaptor can build this format from benchmark collection outputs:
python -m baselines.adaptors.jisi_adaptor \
--config config/adaptor/jisi_llm_v1.example.yaml \
--seed 42 \
--split-ratio 0.7 \
--output-dir data/jisiWe have updated the related datasets, cache, and question bank of JiSi at aisfuture/jisi_data. The Hugging Face dataset release is organized as:
example_data/
seed42_split0.7/
train.jsonl
test.jsonl
baseline_scores.json
train_query_embed.tar
train_response_embed.tar
test_response_embed.tar
benchmark_bank/
<benchmark>/<split-or-mode>/<model>/*.json
datasets/
<benchmark source files>
For a quick JiSi run, follow Quick Start step 2. The full dataset layout is:
example_data/
seed42_split0.7/
train.jsonl
test.jsonl
baseline_scores.json
train_query_embed.tar
train_response_embed.tar
test_response_embed.tar
The released question bank covers the following benchmark families. The standard JiSi post-evaluation command supports the paper benchmark set through baselines.JiSi.post_eval --datasets paper; SWE-Bench is included in the released question bank, but it is verified later with a separate SWE-Bench submission script rather than with the standard JiSi post-evaluation command.
| Dataset id | What it measures | JiSi usage | Supported by post_eval --datasets paper |
|---|---|---|---|
aime |
Competition-style mathematical reasoning. | Exact answer/math grading for final-answer responses. | Yes |
gpqa |
Graduate-level, multiple-choice science QA. | Letter-answer extraction and exact option matching. | Yes |
hle |
Broad expert-level factual and reasoning questions from Humanity's Last Exam. | LLM-assisted grading against benchmark references. | Yes |
livecodebench |
Programming problem solving. | Code extraction and benchmark test execution through the bundled evaluator. | Yes |
livemathbench |
Recent/live mathematical reasoning problems. | Exact answer/math grading for final-answer responses. | Yes |
mmlupro |
Multi-domain, multiple-choice knowledge and reasoning. | Letter-answer extraction and exact option matching. | Yes |
simpleqa |
Short-form factual QA. | LLM-assisted correctness grading against short references. | Yes |
arenahard |
Open-ended instruction following and chat quality. | Pairwise LLM judging against baseline answers. | Yes |
swe-bench |
Repository-level software issue repair on SWE-Bench Verified. | Single-turn patch-generation records for routing and aggregation research. | No |
The released SWE-Bench records use SWE-bench Verified, the 500-instance human-validated subset, with split=verified. JiSi treats each SWE-Bench instance as a single-turn patch-generation task: the issue statement, relevant repository context, and patch-format instruction are packed into one prompt, and the model returns one patch candidate. These records are useful for comparing routing and aggregation decisions, but they are not interactive coding-agent trajectories.
SWE-Bench cannot be checked by the standard JiSi post-evaluation flow because its verification is patch-based and requires the SWE-Bench submission/evaluation backend.
This repository includes the follow-up SWE-Bench verification helper at:
baselines/JiSi/test_swe.py
The helper is adapted from the original SWE-Bench validation flow. It reads the generated result.jsonl, filters rows whose dataset contains swe, maps each JiSi-local index to the official SWE-Bench instance_id with swe_imap.json when needed, extracts a patch from the model response, writes the SWE-Bench prediction file, and can submit it with sb-cli submit swe-bench_verified test. It is intentionally separate from baselines.JiSi.post_eval because it is a later patch-validation step, not a normal answer-extraction scorer.
For a rerun, pass the paths and run metadata through CLI arguments:
| Argument | Meaning |
|---|---|
--res_path |
Path to the JiSi result.jsonl containing generated answers. |
--output |
Output prediction JSON file consumed by sb-cli; defaults to swe_result.json beside result.jsonl. |
--index-map |
JSON mapping from JiSi-local SWE indices to official SWE-Bench instance ids. Omit it only when result rows already contain instance_id. |
--benchmark-file |
Optional released benchmark_bank SWE-Bench JSON file; the helper can derive the index to instance_id mapping from its records. |
--run-id |
Submission run id. |
--model-name |
Model or system name written into the SWE-Bench prediction file. |
| The prediction file produced by the script is a JSON object keyed by official SWE-Bench instance id: |
{
"astropy__astropy-12907": {
"model_patch": "... unified diff ...",
"model_name_or_path": "jisi_run_name"
}
}Generate the SWE-Bench prediction JSON:
python -m baselines.JiSi.test_swe \
--res_path results/jisi/<run_name>/result.jsonl \
--index-map path/to/swe_imap.json \
--model-name jisi_run_name \
--run-id jisi_swe_verifiedIf you are working from a released benchmark_bank SWE-Bench JSON file instead of swe_imap.json, pass it with --benchmark-file:
python -m baselines.JiSi.test_swe \
--res_path results/jisi/<run_name>/result.jsonl \
--benchmark-file .hf_jisi_data/benchmark_bank/swe-bench/verified/<model>/<file>.json \
--model-name jisi_run_name \
--run-id jisi_swe_verifiedAdd --submit when sb-cli is installed and SWEBENCH_API_KEY is available:
SWEBENCH_API_KEY=$SWEBENCH_API_KEY \
python -m baselines.JiSi.test_swe \
--res_path results/jisi/<run_name>/result.jsonl \
--index-map path/to/swe_imap.json \
--model-name jisi_run_name \
--run-id jisi_swe_verified \
--submitThe released question bank contains benchmark responses and correctness records from the following open-source model pool. The ready-to-run example_data/ split uses ten of these models, while the broader benchmark_bank/ keeps additional model-output files where available. We will continue updating the released question bank with the latest open-source models.
| Model | In example_data/ |
In benchmark_bank/ |
|---|---|---|
deepseek-r1-0528 |
Yes | Yes |
deepseek-v3-0324 |
Yes | Yes |
deepseek-v3.1-terminus |
Yes | Yes |
deepseek-v3.2-speciale |
Yes | Yes |
deepseek-v3.2-thinking |
Yes | Yes |
glm-4.6 |
Yes | Yes |
glm-5 |
No | Yes |
intern-s1 |
Yes | Yes |
kimi-k2-0905 |
Yes | Yes |
kimi-k2.5 |
No | Yes |
minimax-m2.5 |
No | Yes |
qwen3-235b-a22b-2507 |
Yes | Yes |
qwen3-235b-a22b-thinking-2507 |
Yes | Yes |
qwen3.5-397b-a17b |
No | Yes |
You can also inspect the ready-to-run split directly with datasets:
from datasets import load_dataset
repo_id = "aisfuture/jisi_data"
ds = load_dataset(repo_id, "jisi_example")
print(ds)
print(ds["train"][0].keys())If you followed Quick Start, you already created *.local.json and embedding_config.local.yaml. Edit those local copies and keep them untracked. The important fields are:
| Field | Meaning |
|---|---|
train_data_path / test_data_path |
Pre-split JiSi support/test files |
baseline_scores_path |
Per-model benchmark scores used for reporting |
embedding_config_path |
OpenAI-compatible embedding endpoint config |
api_config_path |
OpenAI-compatible LLM endpoint config for aggregation |
mode |
router or aggregator |
rag_num |
Number of retrieved support questions considered for routing evidence |
agg_model |
Aggregator model name, or auto to select from routed candidates |
result_dir |
Output directory for aggregation results |
The API config supports literal keys, key lists, environment variables, or named keys:
{
"extra_api_keys": {
"OPENAI": "OPENAI_API_KEY",
"LOCAL": "EMPTY"
},
"model_configs": {
"gpt-4.1-mini": {
"mode": "openai",
"model_name": "gpt-4.1-mini",
"base_url": "https://api.openai.com/v1",
"api_key_name": "OPENAI"
}
}
}If you want to rebuild model-output caches from scratch, use the data collector:
python -m data_collector.cli info config/data_collector_example.yaml
python -m data_collector.cli run config/data_collector_example.yamlThe collector writes benchmark result files under results/bench/, which can then be converted with baselines.adaptors.jisi_adaptor.
JiSi/
assets/ Figures used in the paper and README
baselines/
JiSi/ JiSi algorithm implementation
adaptors/ JiSi data adaptor only
common/cache/ Optional MySQL cache helpers
config/ Safe example configs
data/ External data placeholder
data_collector/ Optional benchmark collection utilities
evaluation/ Benchmark evaluators
generators/ OpenAI-compatible generation wrappers
results/ External result placeholder
We thank ynulihao/LLMRouterBench. This codebase was refactored from that project, and the open-source JiSi release keeps only the components needed for JiSi data preparation, routing, aggregation, and evaluation.
If you find JiSi useful, please cite:
@misc{tang2026gemini3prorevisitingllmrouting,
title={Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale},
author={Shengji Tang and Weihao Lin and Peng Ye and Jingqi Ye and Hao Li and Yiqun Zhang and Xiaosong Wang and Bo Zhang and Shuyue Hu and Tao Chen and Lei Bai and Wanli Ouyang},
year={2026},
eprint={2601.01330},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.01330}
}This repository is released under the MIT License. Datasets, model weights, and external benchmark assets may be governed by their own licenses.






