Sample iOS app for SinSR — single-step diffusion-based super-resolution (CVPR 2024). 4× upscaling in one denoiser pass via a Swin Transformer UNet on a VQ-VAE latent space (~113M params).
Left: bicubic 4× upscale, Right: SinSR single-step diffusion SR (128×128 → 512×512)
| Model | Size | Input | Output |
|---|---|---|---|
| SinSR_Encoder.mlpackage.zip | 39 MB | image [1,3,1024,1024] | latent [1,3,256,256] |
| SinSR_Denoiser.mlpackage.zip | 420 MB | input [1,6,256,256] | predicted_latent [1,3,256,256] |
| SinSR_Decoder.mlpackage.zip | 58 MB | latent [1,3,256,256] | image [1,3,1024,1024] |
- Download the three
.mlpackage.zipfiles above - Unzip and drag them into the Xcode project
- Build and run on a physical device (iOS 17+)
LQ image → resize 256² → Encoder → latent [1,3,256,256]
│
add noise (κ=2.0, η_T=0.99) ▼
noisy_latent
│
concat with LQ → [1,6,256,256]
│
Denoiser (single step, t=14 baked in)
│
predicted_latent
│
clamp [-1, 1] → Decoder → image [1,3,1024,1024]
The denoiser runs once (not iteratively) — that's the SinSR distillation. Swift handles noise injection, scaling, and latent space marshalling.
- Swin Transformer patches required for tracing:
- Pre-compute relative position bias as
register_buffer - Replace
torch.rollwithslice + concat - Rewrite attention-mask creation to avoid
__setitem__ - Patch the coremltools
intop converter to handle multi-dim tensor shape casts
- Pre-compute relative position bias as
- VQ-VAE decoder ships with vector quantization inside the CoreML model — 8192-entry codebook with
argminnearest-neighbor lookup runs on-device. - Denoiser input is a 6-channel concat of
[scaled_noisy_latent, lq_image]with the timestep baked in (alwayst=14for the single-step distillation). - Denoiser must use FP32 precision. FP16 causes a pinkish color shift via overflow inside the Swin attention layers. Encoder/Decoder are fine in FP16.
- Use
.cpuOnlyfor the denoiser for best accuracy. Encoder/Decoder run happily on.cpuAndGPUor.all. - The output has a slight, consistent color shift relative to the input — that's inherent to the SinSR distilled architecture, not a conversion artifact.
