Skip to content

Commit ce61bef

Browse files
authored
Merge pull request #814 from mi804/qwen-image-edit
Qwen image edit
2 parents f11a91e + 123f6db commit ce61bef

File tree

14 files changed

+238
-60
lines changed

14 files changed

+238
-60
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ image.save("image.jpg")
9090
|Model ID|Inference|Low VRAM Inference|Full Training|Validation after Full Training|LoRA Training|Validation after LoRA Training|
9191
|-|-|-|-|-|-|-|
9292
|[Qwen/Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)|[code](./examples/qwen_image/model_inference/Qwen-Image.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image.py)|
93+
|[Qwen/Qwen-Image-Edit](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit)|[code](./examples/qwen_image/model_inference/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image-Edit.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image-Edit.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit.py)|
9394
|[DiffSynth-Studio/Qwen-Image-Distill-Full](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Distill-Full)|[code](./examples/qwen_image/model_inference/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image-Distill-Full.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image-Distill-Full.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-Distill-Full.py)|
9495
|[DiffSynth-Studio/Qwen-Image-Distill-LoRA](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Distill-LoRA)|[code](./examples/qwen_image/model_inference/Qwen-Image-Distill-LoRA.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Distill-LoRA.py)|-|-|-|-|
9596
|[DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen)|[code](./examples/qwen_image/model_inference/Qwen-Image-EliGen.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py)|-|-|[code](./examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
@@ -368,6 +369,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
368369

369370

370371
## Update History
372+
- **August 19, 2025** 🔥 Qwen-Image-Edit is now open source. Welcome the new member to the image editing model family!
373+
-
371374
- **August 18, 2025** We trained and open-sourced the Inpaint ControlNet model for Qwen-Image, [DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint), which adopts a lightweight architectural design. Please refer to [our sample code](./examples/qwen_image/model_inference/Qwen-Image-Blockwise-ControlNet-Inpaint.py).
372375

373376
- **August 15, 2025** We open-sourced the [Qwen-Image-Self-Generated-Dataset](https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset). This is an image dataset generated using the Qwen-Image model, with a total of 160,000 `1024 x 1024` images. It includes the general, English text rendering, and Chinese text rendering subsets. We provide caption, entity and control images annotations for each image. Developers can use this dataset to train models such as ControlNet and EliGen for the Qwen-Image model. We aim to promote technological development through open-source contributions!

README_zh.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ image.save("image.jpg")
9292
|模型 ID|推理|低显存推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
9393
|-|-|-|-|-|-|-|
9494
|[Qwen/Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image)|[code](./examples/qwen_image/model_inference/Qwen-Image.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image.py)|
95+
|[Qwen/Qwen-Image-Edit](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit)|[code](./examples/qwen_image/model_inference/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image-Edit.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image-Edit.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image-Edit.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-Edit.py)|
9596
|[DiffSynth-Studio/Qwen-Image-Distill-Full](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Distill-Full)|[code](./examples/qwen_image/model_inference/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_training/full/Qwen-Image-Distill-Full.sh)|[code](./examples/qwen_image/model_training/validate_full/Qwen-Image-Distill-Full.py)|[code](./examples/qwen_image/model_training/lora/Qwen-Image-Distill-Full.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-Distill-Full.py)|
9697
|[DiffSynth-Studio/Qwen-Image-Distill-LoRA](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Distill-LoRA)|[code](./examples/qwen_image/model_inference/Qwen-Image-Distill-LoRA.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-Distill-LoRA.py)|-|-|-|-|
9798
|[DiffSynth-Studio/Qwen-Image-EliGen](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen)|[code](./examples/qwen_image/model_inference/Qwen-Image-EliGen.py)|[code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen.py)|-|-|[code](./examples/qwen_image/model_training/lora/Qwen-Image-EliGen.sh)|[code](./examples/qwen_image/model_training/validate_lora/Qwen-Image-EliGen.py)|
@@ -384,6 +385,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
384385

385386

386387
## 更新历史
388+
- **2025年8月19日** 🔥 Qwen-Image-Edit 开源,欢迎图像编辑模型新成员!
389+
387390
- **2025年8月18日** 我们训练并开源了 Qwen-Image 的图像重绘 ControlNet 模型 [DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint),模型结构采用了轻量化的设计,请参考[我们的示例代码](./examples/qwen_image/model_inference/Qwen-Image-Blockwise-ControlNet-Inpaint.py)
388391

389392
- **2025年8月15日** 我们开源了 [Qwen-Image-Self-Generated-Dataset](https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset) 数据集。这是一个使用 Qwen-Image 模型生成的图像数据集,共包含 160,000 张`1024 x 1024`图像。它包括通用、英文文本渲染和中文文本渲染子集。我们为每张图像提供了图像描述、实体和结构控制图像的标注。开发者可以使用这个数据集来训练 Qwen-Image 模型的 ControlNet 和 EliGen 等模型,我们旨在通过开源推动技术发展!

diffsynth/models/qwen_image_dit.py

Lines changed: 34 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,8 @@ def __init__(self, theta: int, axes_dim: list[int], scale_rope=False):
6363
super().__init__()
6464
self.theta = theta
6565
self.axes_dim = axes_dim
66-
pos_index = torch.arange(1024)
67-
neg_index = torch.arange(1024).flip(0) * -1 - 1
66+
pos_index = torch.arange(4096)
67+
neg_index = torch.arange(4096).flip(0) * -1 - 1
6868
self.pos_freqs = torch.cat([
6969
self.rope_params(pos_index, self.axes_dim[0], self.theta),
7070
self.rope_params(pos_index, self.axes_dim[1], self.theta),
@@ -127,49 +127,42 @@ def forward(self, video_fhw, txt_seq_lens, device):
127127
self.pos_freqs = self.pos_freqs.to(device)
128128
self.neg_freqs = self.neg_freqs.to(device)
129129

130-
if isinstance(video_fhw, list):
131-
video_fhw = video_fhw[0]
132-
frame, height, width = video_fhw
133-
rope_key = f"{frame}_{height}_{width}"
134-
135-
if rope_key not in self.rope_cache:
136-
seq_lens = frame * height * width
137-
freqs_pos = self.pos_freqs.split([x // 2 for x in self.axes_dim], dim=1)
138-
freqs_neg = self.neg_freqs.split([x // 2 for x in self.axes_dim], dim=1)
139-
freqs_frame = freqs_pos[0][:frame].view(frame, 1, 1, -1).expand(frame, height, width, -1)
130+
vid_freqs = []
131+
max_vid_index = 0
132+
for idx, fhw in enumerate(video_fhw):
133+
frame, height, width = fhw
134+
rope_key = f"{idx}_{height}_{width}"
135+
136+
if rope_key not in self.rope_cache:
137+
seq_lens = frame * height * width
138+
freqs_pos = self.pos_freqs.split([x // 2 for x in self.axes_dim], dim=1)
139+
freqs_neg = self.neg_freqs.split([x // 2 for x in self.axes_dim], dim=1)
140+
freqs_frame = freqs_pos[0][idx : idx + frame].view(frame, 1, 1, -1).expand(frame, height, width, -1)
141+
if self.scale_rope:
142+
freqs_height = torch.cat(
143+
[freqs_neg[1][-(height - height // 2) :], freqs_pos[1][: height // 2]], dim=0
144+
)
145+
freqs_height = freqs_height.view(1, height, 1, -1).expand(frame, height, width, -1)
146+
freqs_width = torch.cat([freqs_neg[2][-(width - width // 2) :], freqs_pos[2][: width // 2]], dim=0)
147+
freqs_width = freqs_width.view(1, 1, width, -1).expand(frame, height, width, -1)
148+
149+
else:
150+
freqs_height = freqs_pos[1][:height].view(1, height, 1, -1).expand(frame, height, width, -1)
151+
freqs_width = freqs_pos[2][:width].view(1, 1, width, -1).expand(frame, height, width, -1)
152+
153+
freqs = torch.cat([freqs_frame, freqs_height, freqs_width], dim=-1).reshape(seq_lens, -1)
154+
self.rope_cache[rope_key] = freqs.clone().contiguous()
155+
vid_freqs.append(self.rope_cache[rope_key])
156+
140157
if self.scale_rope:
141-
freqs_height = torch.cat(
142-
[
143-
freqs_neg[1][-(height - height//2):],
144-
freqs_pos[1][:height//2]
145-
],
146-
dim=0
147-
)
148-
freqs_height = freqs_height.view(1, height, 1, -1).expand(frame, height, width, -1)
149-
freqs_width = torch.cat(
150-
[
151-
freqs_neg[2][-(width - width//2):],
152-
freqs_pos[2][:width//2]
153-
],
154-
dim=0
155-
)
156-
freqs_width = freqs_width.view(1, 1, width, -1).expand(frame, height, width, -1)
157-
158+
max_vid_index = max(height // 2, width // 2, max_vid_index)
158159
else:
159-
freqs_height = freqs_pos[1][:height].view(1, height, 1, -1).expand(frame, height, width, -1)
160-
freqs_width = freqs_pos[2][:width].view(1, 1, width, -1).expand(frame, height, width, -1)
161-
162-
freqs = torch.cat([freqs_frame, freqs_height, freqs_width], dim=-1).reshape(seq_lens, -1)
163-
self.rope_cache[rope_key] = freqs.clone().contiguous()
164-
vid_freqs = self.rope_cache[rope_key]
165-
166-
if self.scale_rope:
167-
max_vid_index = max(height // 2, width // 2)
168-
else:
169-
max_vid_index = max(height, width)
160+
max_vid_index = max(height, width, max_vid_index)
170161

171162
max_len = max(txt_seq_lens)
172-
txt_freqs = self.pos_freqs[max_vid_index: max_vid_index + max_len, ...]
163+
txt_freqs = self.pos_freqs[max_vid_index : max_vid_index + max_len, ...]
164+
vid_freqs = torch.cat(vid_freqs, dim=0)
165+
173166
return vid_freqs, txt_freqs
174167

175168

0 commit comments

Comments
 (0)