Jaihoon Kim*, Taehoon Yoon*, Jisung Hwang*, Minhyuk Sung
Our inference-time scaling precisely aligns pretrained flow models with user preferences—such as text prompts, object quantities, and more.
- [12/10/25] 📌 Implementations of differentiable reward (aesthetc image generation) has been released.
- [18/09/25] 🎉 Our work has been accepted to NeurIPS 2025.
- [02/07/25] ⚙️ Configuration files for compositional image generation and quantity-aware image generation are released.
- [26/06/25] 📝 Baseline implementations are released.
- [30/04/25] 🚀 The code for quantity-aware image generation has been released.
- [21/04/25] 🔥 We have released the implementation of Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing for compositional image generation.
Create and activate a Conda environment (tested with Python 3.10):
conda create -n rbf python=3.10
conda activate rbf
Clone the repository:
git clone https://github.com/KAIST-Visual-AI-Group/Flow-Inference-Time-Scaling.git
cd Flow-Inference-Time-Scaling
Install PyTorch (tested with version 2.1.0) and required dependencies:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/openai/CLIP.git
pip install -r requirements.txt
pip install -e .
For compositional image generation, we use VQAScore, which can be installed with the following command:
cd third-party/t2v_metrics/
pip install -e .
-
--text_prompt: Text prompt to guide generation. Required for reward-based sampling. -
--filtering_method: Strategy to select or prune particles (bon,smc,code,svdd,rbf). -
--batch_size: Number of prompts or samples processed in parallel during inference. -
--n_particles: Number of particles used per timestep to explore the reward landscape. -
--block_size: Number of timesteps grouped together for blockwise updates (set to 1 except incode.yaml). -
--convert_scheduler: Apply interpolant conversion at inference time (vp). -
--sample_method: Sampling method (sde,ode). -
--diffusion_norm: SDE sampling diffusion norm. -
--max_nfe: Total computational budget (in number of function evaluations) available during sampling. -
--max_steps: Number of denoising steps in the generative process. -
--reward_score: Reward function used for alignment (vqa,counting,aesthetic). -
--init_n_particles: Initial number of particles at the start of generation.
Host the VQAScore VLM on a separate device to save GPU memory. By default, the server responds on port 5000:
python rbf/corrector/reward_model/vqa_server.py
Run compositional image generation using the following command. To prevent out-of-memory (OOM) issues, we recommend running it on a different device from the VQA server.
You may optionally override configuration values by specifying arguments directly in the command line:
CUDA_VISIBLE_DEVICES={$DEVICE} python main.py --config config/compositional_image/rbf.yaml text_prompt={$TEXT_PROMPT}
Run quantity-aware image generation using the following command. For the reward function, we use a combination of Grounding DINO and Segment Anything for robust object detection (experimented on a 48G GPU).
CUDA_VISIBLE_DEVICES={$DEVICE} python main.py --config config/quantity_aware/rbf.yaml text_prompt={$TEXT_PROMPT}
Download the checkpoint (sac+logos+ava1-l14-linearMSE.pth) of pretrained aesthetic score model from this link.
Place the checkpoint at ckpt directory and run the following command:
CUDA_VISIBLE_DEVICES={$DEVICE} python main.py --config config/aesthetic_image/rbf_dps.yaml text_prompt={$TEXT_PROMPT}
If you find our code helpful, please consider citing our work:
@article{kim2025inference,
title={Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing},
author={Kim, Jaihoon and Yoon, Taehoon and Hwang, Jisung and Sung, Minhyuk},
journal={arXiv preprint arXiv:2503.19385},
year={2025}
}
This repository incorporates implementations from Flow Matching and FLUX. We sincerely thank the authors for publicly releasing their codebases.
