Skip to content

Commit abe8079

Browse files
authored
[Doc] add VL-LN Bench docs (#7)
* add VL-LN Bench doc * Solve the issua from kew6688 * add more result and logs
1 parent ec69d32 commit abe8079

File tree

4 files changed

+294
-0
lines changed

4 files changed

+294
-0
lines changed
4.78 MB
Loading

source/en/user_guide/internnav/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,5 @@ myst:
1313
1414
quick_start/index
1515
tutorials/index
16+
projects/index
1617
```
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Extended benchmarks in InternNav
2+
3+
This page provides specific tutorials about the usage of InternVLA-N1 for different benchmarks.
4+
5+
## VL-LN Bench
6+
7+
VL-LN Bench is a large-scale benchmark for Interactive Instance Goal Navigation. VL-LN Bench provides: (1) an automatically dialog-augmented trajectory generation pipeline, (2) a comprehensive evaluation protocol for training and assessing dialog-capable navigation models, and (3) the dataset and base model used in our experiments. For full details, see our [paper](https://arxiv.org/abs/2512.22342) and the [project website](https://0309hws.github.io/VL-LN.github.io/).
8+
9+
On this page, we cover VL-LN Bench: how to run InternVLA-N1 on this benchmark and an overview of the VL-LN Bench data collection pipeline.
10+
11+
- [Data Collection Pipeline](https://github.com/InternRobotics/VL-LN)
12+
- [Training and Evaluation Code](https://github.com/InternRobotics/InternNav)
13+
- [Dataset](https://huggingface.co/datasets/InternRobotics/VL-LN-Bench) and [Base Model](https://huggingface.co/InternRobotics/VL-LN-Bench-basemodel)
14+
15+
### Abstract
16+
In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Goal Navigation (IIGN), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IIGN extends Instance Goal Navigation (IGN) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation.
17+
18+
![img.jpg](../../../_static/image/vlln_teaser.png)
19+
20+
A case for the IIGN task. The oracle (top left) first gives a simple goal-oriented navigation instruction ("Search for the chair."). The agent has to locate a specific instance of the given category (chair). The agent can ask three types of questions—attribute, route, and disambiguation—to progressively resolve ambiguity and locate the target (instance). The full description in the bottom right is the instruction given to the agent in the IGN task, which can locate the specific chair in this environment.
21+
22+
### Evaluation
23+
24+
#### Metrics
25+
26+
VL-LN Bench reports standard navigation metrics (**SR**, **SPL**, **OS**, **NE**) and introduces **Mean Success Progress (MSP)** to measure dialog utility.
27+
28+
- SR: Success Rate
29+
- SPL: Success Rate weighted by Path Length
30+
- OS: Oracle Success Rate
31+
- NE: Navigation Error
32+
- MSP: Mean Success Progress
33+
34+
Given a maximum dialog budget of (n) turns:
35+
36+
- Let \( s_0 \) be the success rate **without dialog**.
37+
- Let \( s_i \) be the success rate with at most \(i\) dialog turns \((1 < i < n)\).
38+
39+
$$
40+
\mathrm{MSP}=\frac{1}{n}\sum_{i=1}^{n}(s_i-s_0)
41+
$$
42+
43+
MSP measures the **average success improvement** brought by dialog and favors **gains achieved with fewer turns**.
44+
45+
#### Methods
46+
47+
We evaluate **five baseline methods**.
48+
49+
- **FBE**: a greedy frontier-based exploration agent that repeatedly selects the nearest frontier; it detects the target instance using an open-vocabulary detector built on **Grounded SAM 2**.
50+
- **VLFM**: uses the [official released version](https://github.com/bdaiinstitute/vlfm).
51+
52+
The following three baselines are initialized from **Qwen2.5-VL-7B-Instruct** and trained using the **InternVLA-N1** recipe, with different data mixtures. All three include the **InternVLA-N1 VLN data**. For the detailed training configuration, please refer to [this script](https://github.com/InternRobotics/InternNav/blob/dev/scripts/train/qwenvl_train/train_system2_vlln.sh).
53+
54+
- **VLLN-O**: additionally uses **object goal navigation** data (**23,774** trajectories).
55+
- **VLLN-I**: additionally uses **instance goal navigation** data **without dialog** (**11,661** trajectories).
56+
- **VLLN-D**: additionally uses **instance goal navigation** data **with dialog** (**11,661** trajectories).
57+
58+
#### Results
59+
IIGN and IGN use the same episode setup, but differ in how the goal is described. In IIGN, the instruction only specifies the target category (e.g., "Search for the chair."). In contrast, IGN provides a fully disambiguating description that uniquely identifies the instance in the scene (e.g., "Locate the brown leather armchair with a smooth texture and curved shape, standing straight near the wooden desk and curtain. The armchair is in the resting room.").
60+
61+
**IIGN**
62+
| Method | SR↑ | SPL↑ | OS↑ | NE↓ | MSP↑ |
63+
| :----: | :--: | :---: | :--: | :---: | :--: |
64+
| FBE | 8.4 | 4.74 | 25.2 | 11.84 ||
65+
| VLFM | 10.2 | 6.42 | 32.4 | 11.17 ||
66+
| VLLN-O | 14.8 | 10.36 | 47.0 | 8.91 ||
67+
| VLLN-I | 14.2 | 8.18 | 47.8 | 9.54 ||
68+
| VLLN-D | 20.2 | 13.07 | 56.8 | 8.84 | 2.76 |
69+
70+
**IGN**
71+
| Method | SR↑ | SPL↑ | OS↑ | NE↓ | MSP↑ |
72+
| :----: | :--: | :---: | :--: | :---: | :--: |
73+
| FBE | 7.4 | 4.45 | 33.4 | 11.78 ||
74+
| VLFM | 12.6 | 7.68 | 35.4 | 10.85 ||
75+
| VLLN-O | 5.6 | 4.24 | 25.2 | 10.76 ||
76+
| VLLN-I | 22.4 | 13.43 | 60.4 | 8.16 ||
77+
| VLLN-D | 25.0 | 15.59 | 58.8 | 7.99 | 2.16 |
78+
79+
Across both **IIGN** and **IGN**, **VLLN-D** achieves the best performance, highlighting the benefit of proactive querying—while still leaving substantial room for improvement. Based on our analysis of **IIGN** failure cases, we summarize the key remaining challenges:
80+
81+
- **Image–attribute alignment is the main bottleneck** for both IGN and IIGN.
82+
- **Questioning remains limited**: the agent still struggles to reliably disambiguate the target instance from same-category distractors through dialog.
83+
84+
85+
#### Citation
86+
87+
```latex
88+
@misc{huang2025vllnbenchlonghorizongoaloriented,
89+
title={VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs},
90+
author={Wensi Huang and Shaohao Zhu and Meng Wei and Jinming Xu and Xihui Liu and Hanqing Wang and Tai Wang and Feng Zhao and Jiangmiao Pang},
91+
year={2025},
92+
eprint={2512.22342},
93+
archivePrefix={arXiv},
94+
primaryClass={cs.RO},
95+
url={https://arxiv.org/abs/2512.22342},
96+
}
97+
```
98+
99+
100+
### Deployment Tutorial
101+
102+
Here is a basic example of training and evaluating InternVLA-N1 on VL-LN Bench.
103+
104+
#### Step1. Dataset Preparation
105+
VL-LN Bench is built on Matterport3D (MP3D) Scene Dataset, so you need to download both the MP3D scene dataset and the VL-LN Bench dataset.
106+
- Scene Datasets
107+
108+
Download the [MP3D Scene Dataset](https://niessner.github.io/Matterport/)
109+
- [VL-LN Data](https://huggingface.co/datasets/InternRobotics/VL-LN-Bench)
110+
- [VL-LN Base Model](https://huggingface.co/InternRobotics/VL-LN-Bench-basemodel)
111+
112+
This directory should point to `VL-LN-Bench/`, the root folder of the **VL-LN Bench dataset** (VL-LN Data). First, **unzip all `*.json.gz` files** under `VL-LN-Bench/traj_data/`. Then:
113+
114+
- Place the **VL-LN Base Model** into `VL-LN-Bench/base_model/`.
115+
- Place the **Matterport3D (MP3D) scene dataset** into `VL-LN-Bench/scene_datasets/`.
116+
117+
After setup, your folder structure should look like this:
118+
119+
```bash
120+
VL-LN-Bench/
121+
├── base_model/
122+
│ └── iign/
123+
├── raw_data/
124+
│ └── mp3d/
125+
│ ├── scene_summary/
126+
│ ├── train/
127+
│ │ ├── train_ign.json.gz
128+
│ │ └── train_iign.json.gz
129+
│ └── val_unseen/
130+
│ ├── val_unseen_ign.json.gz
131+
│ └── val_unseen_iign.json.gz
132+
├── scene_datasets/
133+
│ └── mp3d/
134+
│ ├── 17DRP5sb8fy/
135+
│ ├── 1LXtFkjw3qL/
136+
│ ...
137+
└── traj_data/
138+
├── mp3d_split1/
139+
├── mp3d_split2/
140+
└── mp3d_split3/
141+
```
142+
143+
#### Step2. Environment Setup
144+
Here we set up the Python environment for VL-LN Bench and InternVLA-N1. If you've already installed the InternNav Habitat environment, you can skip these steps and only run the commands related to VL-LN Bench.
145+
146+
- Get Code
147+
```bash
148+
git clone git@github.com:InternRobotics/VL-LN.git # code for data collection
149+
git clone git@github.com:InternRobotics/InternNav.git # code for training and evaluation
150+
```
151+
152+
- Create Conda Environment
153+
```bash
154+
conda create -n vlln python=3.9 -y
155+
conda activate vlln
156+
```
157+
158+
- Install Dependencies
159+
```bash
160+
conda install habitat-sim=0.2.4 withbullet headless -c conda-forge -c aihabitat
161+
cd VL-LN
162+
pip install -r requirements.txt
163+
cd ../InternNav
164+
pip install -e .
165+
```
166+
167+
#### Step3. Guidance for Data Collection Pipeline
168+
This step is optional. You can either use our collected data for policy training, or follow this step to collect your own training data.
169+
170+
- Prerequisites:
171+
- Get pointnav_weights.pth from [VLFM](https://github.com/bdaiinstitute/vlfm/tree/main/data)
172+
- Arrange the Directory Structure Like This
173+
```bash
174+
VL-LN
175+
├── dialog_generation/
176+
├── images/
177+
├── VL-LN-Bench/
178+
│ ├── base_model/
179+
│ ├── raw_data/
180+
│ ├── scene_datasets/
181+
│ ├── traj_data/
182+
│ └── pointnav_weights.pth
183+
...
184+
```
185+
186+
- Collect Trajectories
187+
```bash
188+
# If having slurm
189+
sbatch generate_frontiers_dialog.sh
190+
191+
# Or directly run
192+
python generate_frontiers_dialog.py \
193+
--task instance \
194+
--vocabulary hm3d \
195+
--scene_ids all \
196+
--shortest_path_threshold 0.1 \
197+
--target_detected_threshold 5 \
198+
--episodes_file_path VL-LN-Bench/raw_data/mp3d/train/train_iign.json.gz \
199+
--habitat_config_path dialog_generation/config/tasks/dialog_mp3d.yaml \
200+
--baseline_config_path dialog_generation/config/expertiments/gen_videos.yaml \
201+
--normal_category_path dialog_generation/normal_category.json \
202+
--pointnav_policy_path VL-LN-Bench/pointnav_weights.pth\
203+
--scene_summary_path VL-LN-Bench/raw_data/mp3d/scene_summary\
204+
--output_dir <PATH_TO_YOUR_OUTPUT_DIR> \
205+
```
206+
207+
#### Step4. Guidance for Training and Evaluation
208+
Here we show how to train your own model for the IIGN task and evaluate it on VL-LN Bench.
209+
210+
- Prerequisites
211+
```bash
212+
cd InternNav
213+
# Link VL-LN Bench data into InternNav
214+
mkdir projects && cd projects
215+
ln -s /path/to/your/VL-LN-Bench ./VL-LN-Bench
216+
```
217+
- Write your OpenAI API key to api_key.txt.
218+
```bash
219+
# Your final repo structure may look like
220+
InternNav
221+
├── assets/
222+
├── internnav/
223+
│ ├── habitat_vlln_extensions
224+
│ │ ├── simple_npc
225+
│ │ │ ├── api_key.txt
226+
│ ... ... ...
227+
...
228+
├── projects
229+
│ ├── VL-LN-Bench/
230+
│ │ ├── base_model/
231+
│ │ ├── raw_data/
232+
│ │ ├── scene_datasets/
233+
│ │ ├── traj_data/
234+
... ...
235+
```
236+
237+
- Start Training
238+
```bash
239+
# Before running, please open this script and make sure
240+
# the "llm" path points to the correct checkpoint on your machine.
241+
sbatch ./scripts/train/qwenvl_train/train_system2_vlln.sh
242+
```
243+
244+
After training, a checkpoint folder will be saved to `checkpoints/InternVLA-N1-vlln/`. You can then evaluate either **your own checkpoint** or our **VLLN-D checkpoint**. To switch the evaluated model, update `model_path` in `scripts/eval/configs/habitat_dialog_cfg.py`.
245+
246+
247+
- Start Evaluation
248+
```bash
249+
# If having slurm
250+
sh ./scripts/eval/bash/srun_eval_dialog.sh
251+
252+
# Or directly run
253+
python scripts/eval/eval.py \
254+
--config scripts/eval/configs/habitat_dialog_cfg.py
255+
```
256+
257+
After running evaluation, you’ll get an `output/` directory like this:
258+
259+
```bash
260+
# Your final repo structure may look like:
261+
output/
262+
└── dialog/
263+
├── vis_0/ # rendered evaluation videos for each episode
264+
├── action/ # step-by-step model outputs (actions / dialogs) for each episode
265+
├── progress.json # detailed results for every episode
266+
└── result.json # aggregated metrics over all evaluated episodes
267+
```
268+
269+
**Notes**
270+
271+
1. Each `.txt` file under `action/` logs what the agent did at every step. Lines may look like:
272+
273+
- `0 <talk> Tell me the room of the flat-screen TV? living room`
274+
- `4 <move> ←←←←`
275+
The first number is the **step index**. The tag indicates whether it is an **active dialog** (`<talk>`) or a **navigation action** (`<move>`).
276+
277+
2. `progress.json` stores **per-episode** details, while `result.json` reports the **average performance** across all episodes in `progress.json`, including `SR`, `SPL`, `OS`, `NE`, and `STEP`.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
myst:
3+
html_meta:
4+
"description lang=en": |
5+
Documentation for users who wish to build sphinx sites with
6+
pydata-sphinx-theme.
7+
---
8+
9+
# Projects
10+
11+
```{toctree}
12+
:caption: Projects
13+
:maxdepth: 2
14+
15+
benchmark
16+
```

0 commit comments

Comments
 (0)