flux-amd-rocm

FLUX.1-dev on AMD Radeon consumer GPUs — fast, low-VRAM, and shippable.

This repo bundles the backport patches, reference benchmarks, and plug-and-play scripts to run FLUX.1-dev on AMD ROCm with the same torchao + group_offload + use_stream=True workflow recently announced for NVIDIA (see @Sayak Paul's PR stack).

Out of the box, the upstream stack crashes on AMD in five distinct places. This repo contains the five backport patches that make it work, plus configuration presets for the most common Radeon cards.

TL;DR

git clone https://github.com/Dev-next-gen/flux-amd-rocm
cd flux-amd-rocm
python -m venv venv && source venv/bin/activate
./setup.sh                           # detects your GPU, writes .env.rocm, installs deps
source .env.rocm                     # MUST be sourced — sets AOTriton, hipBLASLt, etc.
export HF_TOKEN=hf_...               # FLUX.1-dev is gated
python generate.py "a dragon coiled around a medieval tower at sunset" \
  --out dragon.png

Without source .env.rocm, AOTriton Flash Attention falls back to the math kernel (~10× slower) and hipBLASLt tuning is disabled. The setup.sh script writes the file but does NOT source it — your shell has to, once per session.

On a single RX 7800 XT (16 GB): FLUX.1-dev 1024² in ~80 seconds, 6.4 GB peak VRAM (down from 144.9 s on the vanilla AMD path, at equal VRAM — best clean run 72.5 s; see variance note below).

Benchmarks

Full table: BENCHMARKS.md.

Single RX 7800 XT (gfx1101, 16 GB), FLUX.1-dev 1024² × 28 steps, bf16 int8

Configuration	Latency	Peak VRAM	Notes
Vanilla `sequential_cpu_offload` (pre-patch baseline)	144.9 s	6.39 GB	Only viable pre-patch path at low VRAM
`enable_model_cpu_offload`	82.1 s	12.57 GB	Pre-patch path at higher VRAM
`group_offload` + `use_stream=True` + `record_stream=True` (this repo)	~80 s	6.39 GB	Fastest at low VRAM (72–88 s observed, see note below)
`enable_model_cpu_offload` + FP16 (this repo)	75.2 s	12.57 GB	Fastest if you have 16 GB budget

Bottom line: ~45 % latency reduction at equal VRAM on consumer AMD, or 50 % VRAM reduction at equal latency. You pick.

How we got there — the patch-by-patch progression

Each row is the same group_offload path at 6.39 GB VRAM, with one additional fix applied:

Step	Patches applied	Latency	Δ vs baseline
Vanilla `sequential_cpu_offload` (ROCm state of the art pre-patch)	—	144.9 s	—
First port: `group_offload` + `use_stream=True` (4 patches)	1–4	128.3 s	-11 %
+ 5th patch: `record_stream` dispatch on AQT	1–5	88.5 s	-31 %
+ block8 tuning (this repo's default)	1–5	~80 s	-45 %

Patch #5 (record_stream) is the one that unlocks the big jump: without it, the custom transfer stream can't signal to PyTorch's allocator that the weight tensor is still in use, so the allocator keeps stalling on false conflicts. See docs/bugs.md for the full technical trace.

On reproducibility: the fastest clean run we measured was 72.5 s. On a fresh boot with a warm Triton cache we hit 72–79 s. After sustained multi-GPU workloads (thermal drift + kernel-cache churn) the same config stabilises at 85–88 s. All in the same 6.39 GB VRAM envelope. The 2.2–2.7× range vs a 4090 brackets the raw FP16 TFLOPS ratio exactly.

Reference NVIDIA (from HuggingFace docs)

RTX 4090 + quantization + torch.compile + model_cpu_offload: 32.3 s / 12.2 GB. A single 7800 XT's raw FP16 compute is ≈37 TFLOPS vs the 4090's ≈82 TFLOPS (2.22× gap). Our measured range (72–88 s vs 32.3 s = 2.2–2.7×) matches that hardware ceiling. The software gap is closed on this path.

Reproducing the benchmarks

With the env sourced and HF_TOKEN exported:

# Full sweep (sequential_cpu_offload, model_cpu_offload, group_offload+stream+record)
# Writes bench_results.json. ~5-10 min on a 7800 XT.
python bench.py --all

# Single run using an auto-detected preset (fastest way to verify setup)
python bench.py

# Force a specific preset
python bench.py --config configs/rx_7800_xt.yaml

First run is ~30 s slower (AOTriton kernel compilation). Subsequent runs hit the persistent cache at .cache/torchinductor/.

Supported hardware

GPU	Arch	Tested	Notes
RX 7800 XT	gfx1101 (RDNA3)	✅ full reference	this repo's primary target
RX 7700 XT	gfx1101	✅ (same arch)	should match 7800 XT
RX 7900 XTX	gfx1100 (RDNA3)	⚠️ expected to work	needs bench contribution
RX 7900 XT	gfx1100	⚠️ expected to work	needs bench contribution
RX 6800 / 6900 XT	gfx1030 (RDNA2)	⚠️ partial	RDNA2 lacks WMMA int8 accel, slower
MI300X	gfx942 (CDNA3)	🧪 untested	should work; has native matrix cores, likely much faster

See docs/adapt_your_gpu.md to test and contribute a config for your card.

System requirements

Component	Minimum	Recommended
GPU VRAM	8 GB (with `low_cpu_mem` preset)	16 GB
System RAM	32 GB	48 GB+
Disk (model weights)	32 GB free	—
ROCm	6.2+	7.1
PyTorch	2.7+ rocm	2.9.1+rocm7.1.1
Python	3.10+	3.12

Important

System RAM (not VRAM) is the sneaky constraint. The FLUX.1-dev load path spikes to ~35 GB briefly during bf16 → int8 quantization, even if steady-state is ~20 GB. If you have 16 GB system RAM you'll crash on load; if you have 24–32 GB use the rx_7800_xt_lowram preset which pins weights on-the-fly (-50 % RAM usage, measured +19 % latency — 86.7 s vs 72.5 s on a matched clean run; expect both to shift ~10 s upward under everyday load).

What this repo contains

The 5 backport patches

#	Bug	Location	Fix
1	`is_pinned` / `pin_memory` dispatches missing on `AffineQuantizedTensor` + `PlainAQTTensorImpl`	torchao 0.14.1	`patches/torchao_pin_memory.py`
2	`non_blocking` silently dropped in `AQT.to()` and `PlainAQTTensorImpl.to()`	torchao 0.14.1	`patches/torchao_stream_sync.py`
3	`ModuleGroup._onload_from_memory` does not synchronize the default stream against the custom transfer stream	diffusers 0.37.1	same file
4	`param.data = aqt` does not propagate to `.tensor_impl` — outer tensor wrapper on the right device but internals still on CPU	diffusers 0.37.1 (+ structural torchao issue)	same file
5	`record_stream` dispatch missing on `AffineQuantizedTensor`	torchao 0.14.1	`patches/torchao_pin_memory.py`

All five are pure Python monkey-patches applied at import time. No C/C++ build, no fork of upstream. They are designed to be easy to lift into upstream PRs against pytorch/ao and huggingface/diffusers.

See docs/bugs.md for the full technical write-up of each bug, including reproducers and root-cause analysis.

Configuration presets

configs/ contains per-GPU YAML presets (env vars, torch dtype, offload strategy, block size). configs/auto.py detects your card at runtime and picks the right preset.

Docker image

docker/Dockerfile.rocm7.1 builds a ready-to-run image with everything preinstalled. See docker/README.md.

Multi-GPU? See the companion repo

This repo is single-GPU focused. For multi-GPU on AMD ROCm — true tensor / context parallelism via diffusers' native enable_parallelism() (ring attention, Ulysses) and/or weight-sharded inference via device_map="balanced" for models too big to fit on one card — see the companion repo:

👉 diffusers-rocm-parallel

It ships the 6th AMD-specific backport patch (LSE-shape bug in diffusers 0.37's ring-attention merge on ROCm) and reference benches for 2–5 GPU configurations.

Why this exists

On April 17 2026, Sayak Paul announced the TorchAO + offloading integration in Diffusers (PR #13276 + torchao #4192 + pytorch/pytorch #175397). The announcement emphasises "consumer GPU users often restricted by heavy memory demands" — which describes AMD RDNA3 users precisely.

AMD is not mentioned in the release. Attempting to run the announced workflow on a RX 7800 XT with ROCm 7.1 surfaces five distinct bugs, most of them structural and reproducible on CUDA too. This repo documents them, fixes them, and provides a fully working path, plus reference benchmarks for anyone to reproduce.

Contributing

Benchmark your GPU: run python bench.py --config configs/auto.yaml on your card, add a row to BENCHMARKS.md via PR.
Add a config preset for an untested card: see docs/adapt_your_gpu.md.
Upstream the patches: each patch is written to be lifted into a pytorch/ao or huggingface/diffusers PR. If you want to help push them upstream, open an issue.

License

MIT — see LICENSE.

FLUX.1-dev weights are released by Black Forest Labs under their own non-commercial license. This repo does not redistribute any model weights; users must accept the license and download the weights themselves from HuggingFace.

Credits

@Sayak Paul and the HuggingFace / PyTorch / TorchAO teams for the upstream TorchAO × Diffusers integration
The ROCm and AOTriton teams at AMD for the underlying platform
Leo Camus — AMD backport patches, configuration presets, and reference benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flux-amd-rocm

TL;DR

Benchmarks

Single RX 7800 XT (gfx1101, 16 GB), FLUX.1-dev 1024² × 28 steps, bf16 int8

How we got there — the patch-by-patch progression

Reference NVIDIA (from HuggingFace docs)

Reproducing the benchmarks

Supported hardware

System requirements

What this repo contains

The 5 backport patches

Configuration presets

Docker image

Multi-GPU? See the companion repo

Why this exists

Contributing

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
configs		configs
docker		docker
docs		docs
examples		examples
patches		patches
tests		tests
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
generate.py		generate.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

flux-amd-rocm

TL;DR

Benchmarks

Single RX 7800 XT (gfx1101, 16 GB), FLUX.1-dev 1024² × 28 steps, bf16 int8

How we got there — the patch-by-patch progression

Reference NVIDIA (from HuggingFace docs)

Reproducing the benchmarks

Supported hardware

System requirements

What this repo contains

The 5 backport patches

Configuration presets

Docker image

Multi-GPU? See the companion repo

Why this exists

Contributing

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages