A research project exploring Vision-Language-Action (VLA) models for autonomous driving. VLAD combines visual perception with language understanding to predict driving actions and trajectories in simulation, using CARLA and the Bench2Drive dataset.
VLAD integrates:
- Vision-Language Models (VLMs) — Qwen3-VL for multimodal reasoning over ego-view images
- DriveFusion — Transformer-based fusion of image embeddings with diffusion for trajectory prediction
- Diffusion Policy — Action prediction conditioned on visual and state inputs
- Bench2Drive — Large-scale driving dataset from CARLA (HuggingFace:
rethinklab/Bench2Drive)
The model predicts future waypoints and actions from camera history, ego-state, and navigation commands, suitable for end-to-end autonomous driving in simulation.
├── src/
│ ├── models/ # DriveFusion, diffusion policy, diffusion transformer
│ ├── dataloaders/ # Bench2Drive dataset loaders (single-frame & history)
│ ├── vlm/ # Qwen VLM wrappers, embedding cache
│ ├── driver/ # CARLA driver with VLM backbone
│ └── utils/ # Bench2Drive parsing, visualization
├── scripts/ # Setup, dataset download, testing
├── media/ # Example ego-view images for VLM testing
└── carla.sh # CARLA server launcher (Docker)
Example ego-view image used for VLM queries:
On Bridges-2:
source scripts/setup_env.sh
# Then: conda activate ./conda/vladLocal: Create a conda env with Python 3.8 and install src/requirements.txt (PyTorch 2.2, CARLA 0.9.15, etc.).
Open two terminals:
Terminal 1 — Start CARLA server:
./carla.shTerminal 2 — Run client:
python3 src/CarlaClientTest.pyThis spawns a vehicle in Town02 and runs autopilot for 60 seconds. CARLA listens on port 2000.
Bench2Drive Base (full set):
scripts/download_bench2drive_base.shpython3 scripts/test_dataloader.pyThis loads the Bench2Drive dataset, prints statistics, and saves sample visualizations (images with overlaid waypoints) to output/samples/.
- Diffusion Policy:
python3 src/models/train_diffusion_policy.py - DriveFusion:
python3 src/models/train_drivefusion.py
Both use Hydra for config and Weights & Biases for logging.
Key dependencies: PyTorch 2.2, CARLA 0.9.15, PyTorch Lightning, HuggingFace Hub, Hydra, OpenCV, Pandas. See src/requirements.txt for full list.
CMU Intro to Deep Learning 11785 Project — VLA for Autonomous Driving
