SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [NeurIPS 2025]

Welcome to the official repository for the SaFiRe model presented in "SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation."

Abstract

Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions—short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process—first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

About Referential Ambiguity

Current RIS methods primarily focus on simple expression pattern, However, in real-world applications, referring expressions often exhibit referential ambiguity.

We summarize referential ambiguity into two challenging cases:

object-distracting expression, e.g., “compared to the blue-shirt man, he is closer to the two giraffes”.
category-implicit expression,, e.g., “he is the taller one”.

To facilitate the study of referential ambiguity in real-world scenarios, we introduce the aRefCOCO dataset, which specifically focuses on challenging ambiguous referring expressions. You can access the dataset and related resources here:

👉 HuggingFace - 👉 GitHub

Framework

To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process—first forming a global understanding, then refining it through detail-oriented inspection.

Preparation

Prepare environment

python 3.10.13:
- conda create -n SaFiRe python=3.10.13
torch 2.1.1 + cu118:
- pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
install dependencies:
- pip install -r requirements.txt
build kernel for VMamba dependencies:
- cd selective_scan && pip install .

Prepare dataset

Refer to the aRefCOCO repository for dataset preparation instructions.

Download backbone models to ./checkpoint

Training & Evaluation

This implementation only supports multi-gpu, DistributedDataParallel training. For example, to train SaFiRe using 2 GPUs, run:

python -m torch.distributed.launch --nproc_per_node=2 \
                                   --use_env main.py \
                                   --output_dir your/logging/directory \
                                   --if_amp \
                                   --batch_size 8 \
                                   --model-ema \
                                   --data-set refcoco+

And to evaluate on a checkpoint, run:

python -m torch.distributed.launch --nproc_per_node=1 \
                                   --use_env main.py \
                                   --if_amp \
                                   --eval \
                                   --resume your/checkpoint.pth \
                                   --test-split testA

Checkpoint

Our best checkpoint can be downloaded from here. Note that the checkpoint was trained on the mixed datasets as described in our paper.

Acknowledgements

We sincerely appreciate the contributions of the open-source community for their work on data processing and usage. The related projects are as follows: ReMamber, VMamba, LAVT.

Citations

If you find our work helpful for your research, please consider citing our work.

@article{mao2025safire,
  title={SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation}, 
  author={Zhenjie Mao and Yuhuan Yang and Chaofan Ma and Dongsheng Jiang and Jiangchao Yao and Ya Zhang and Yanfeng Wang},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

We also recommend other highly related works:

@article{yang2024remamber,
  title   = {ReMamber: Referring Image Segmentation with Mamba Twister},
  author  = {Yuhuan Yang and Chaofan Ma and Jiangchao Yao and Zhun Zhong and Ya Zhang and Yanfeng Wang},
  journal = {European Conference on Computer Vision (ECCV)}
  year    = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
model		model
ref_dataset		ref_dataset
selective_scan		selective_scan
vmamba_model		vmamba_model
.gitignore		.gitignore
README.md		README.md
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [NeurIPS 2025]

Abstract

About Referential Ambiguity

Framework

Preparation

Prepare environment

Prepare dataset

Download backbone models to ./checkpoint

Training & Evaluation

Checkpoint

Acknowledgements

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [NeurIPS 2025]

Abstract

About Referential Ambiguity

Framework

Preparation

Prepare environment

Prepare dataset

Download backbone models to ./checkpoint

Training & Evaluation

Checkpoint

Acknowledgements

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages