UDAN-CLIP (Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning) is a diffusion-based framework for underwater image enhancement. By integrating CLIP-guided semantic alignment, spatial attention mechanisms, and domain-adaptive diffusion modeling, our method restores color fidelity, contrast, and fine structures in images degraded by underwater scattering and absorption.
Underwater images suffer from complex degradations including light absorption, scattering, color casts, and artifacts—making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references, introducing bias and limiting generalization. Furthermore, fine-tuning these models can degrade learned priors, resulting in unrealistic enhancements due to domain shifts.
UDAN-CLIP addresses these challenges through an image-to-image diffusion framework pre-trained on synthetic underwater datasets and enhanced with a customized CLIP-based classifier, a spatial attention module, and a novel CLIP-Diffusion loss. The classifier preserves natural in-air priors and semantically guides the diffusion process, while the spatial attention module focuses on correcting localized degradations such as haze and low contrast.
Here is a pipeline diagram of UDAN-CLIP:

UDAN-CLIP achieves high-quality underwater image enhancement through four key components working in harmony:
-
Domain-Adaptive Diffusion Module: Learns underwater degradation distributions and progressively restores clean images through a reverse diffusion process, preserving natural in-air priors while adapting to underwater domains.
-
CLIP-Guided Classifier: Leverages vision-language alignment to semantically guide the enhancement process, ensuring restored images maintain semantic consistency with textual descriptions of "clear underwater scenes."
-
Spatial Attention Mechanism: Focuses computational resources on heavily degraded regions (e.g., haze, backscatter, low-contrast areas), enabling targeted correction where it matters most.
-
CLIP-Diffusion Loss: A novel loss function that strengthens visual-textual alignment during reverse diffusion, helping maintain semantic consistency throughout the enhancement pipeline.
git clone https://github.com/BRAIN-Lab-AI/UDAN-CLIP.git
cd UDAN-CLIP# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtDownload the following datasets and place them in the data/ directory:
- T200 Dataset: Underwater images from turbid environments
- Color-Checker7: Color calibration dataset
- C60 Dataset: Comprehensive underwater image collection
Dataset links and preparation scripts will be provided soon.
Edit the config/config.yaml file with your model settings:
model:
diffusion_steps: 1000
clip_model: "ViT-B/32"
spatial_attention: true
training:
batch_size: 8
learning_rate: 1e-4
epochs: 100
data:
dataset: "T200"
image_size: 256python infer.py --input path/to/image.jpg --output results/python infer_demo.pypython sample.py --input_dir data/test_images/ --output_dir results/python train.py --config config/config.yaml --gpu 0python eval.py
python final_calculate_metrics.py├── _pycache_/
├── clip_model/
│ └── ViT-B-32.pt
├── config/
│ └── config.yaml
├── core/
├── data/
│ ├── T200/
│ ├── Color-Checker7/
│ └── C60/
├── misc/
├── model/
│ ├── _pycache_/
│ ├── ddpm_modules/
│ ├── sr3_modules/
│ ├── __init__.py
│ ├── base_model.py
│ ├── model.py
│ └── networks.py
├── static/
│ └── images/
│ ├── architecture_fig1.png
│ ├── architecture_fig2.png
│ ├── C60_comparison.png
│ ├── T200_comparison.png
│ ├── Color-Checker_comparison.png
│ ├── heatmap.png
│ ├── intro_fig.png
│ ├── plot1_T200.png
│ ├── plot2_Color-Checker7.png
│ ├── plot3_C60.png
│ ├── updated_zoomedin1.png
│ ├── updated_zoomedin2.png
│ └── results_table.png
├── LICENSE
├── README.md
├── eval.py
├── final_calculate_metrics.py
├── index.html
├── infer.py
├── infer_demo.py
├── metrics_util.py
├── requirement.txt
├── sample.py
└── train.py
- Domain-Adaptive Pretraining: Leverages underwater datasets while preserving natural image priors
- Progressive Restoration: Multi-step denoising for high-quality output
- Semantic Guidance: CLIP-based conditioning ensures visually coherent results
- Vision-Language Alignment: Leverages CLIP's multimodal understanding for semantic consistency
- Textual Conditioning: Uses natural language prompts to guide enhancement direction
- Contrastive Learning: Employs contrastive objectives to separate enhanced from degraded features
- Degradation Localization: Identifies and prioritizes heavily degraded regions
- Adaptive Focus: Dynamically allocates computational resources based on degradation severity
- Edge Preservation: Maintains structural integrity while removing artifacts
- Multiple Metrics: PSNR, SSIM, UIQM, UCIQE, and CPBD for thorough assessment
- Benchmark Datasets: Evaluated on T200, Color-Checker7, and C60
- Visual Comparisons: Qualitative results against state-of-the-art methods
UDAN-CLIP consistently outperforms baseline methods across all evaluation metrics and datasets:
| Dataset | Metric | Improvement over SOTA |
|---|---|---|
| T200 | PSNR | +16.15 |
| T200 | SSIM | +11.38 |
| Color-Checker7 | UIQM | +0.064 |
| C60 | CPBD | +0.165 |
Comparison on C60 dataset showing superior color correction and detail recovery
Results on turbid T200 images demonstrating haze removal and contrast enhancement
Color checker evaluation showing accurate color restoration
Performance heatmap showing enhancement quality across different degradation levels
We welcome contributions from the community! Here are some ways you can help:
- Report bugs: Open an issue if you encounter any problems
- Suggest improvements: Share ideas for enhancing the model or codebase
- Add features: Submit pull requests for new functionality
- Share results: Showcase UDAN-CLIP applications in your research
We are particularly interested in:
- Extending to underwater video enhancement
- Integration with underwater robotics platforms
- Adaptation for specific underwater environments (coral reefs, deep sea, etc.)
- Lightweight versions for edge deployment
This project is licensed under the MIT License - see the LICENSE file for details.
If you find UDAN-CLIP helpful for your research, please cite our paper:
@article{shaahid2025udanclip,
title={Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement},
author={Shaahid, Afrah and Behzad, Muzammil},
journal={arXiv preprint arXiv:2505.19895},
year={2025}
}We thank the King Fahd University of Petroleum and Minerals and SDAIA-KFUPM JRC for Artificial Intelligence for supporting this research. We also acknowledge the developers of CLIP and the diffusion models that inspired this work.
Visit our project website for more details, visual results, and updates.
For questions or collaborations, please contact:
- Afrah Shaahid: afrahshaahid@outlook.com
- Muzammil Behzad: muzammil.behzad@kfupm.edu.sa






