Skip to content

Commit 4ae117a

Browse files
authored
Merge pull request #8 from williechai/seg_blog
Seg blog
2 parents ab084bb + 2a81f99 commit 4ae117a

File tree

2 files changed

+363
-0
lines changed

2 files changed

+363
-0
lines changed
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
---
2+
title: "Segmentation-as-Editing for Unified Multimodal AI"
3+
date: 2025-09-13T00:00:03+08:00
4+
weight: 1
5+
math: true
6+
# draft: true
7+
show_reading_time: true
8+
show_bread_crumbs: true
9+
show_post_nav_links: false # the prev/next after the content
10+
show_code_copy_buttons: true
11+
show_word_count: true
12+
---
13+
14+
{{< button href="https://github.com/inclusionAI/Ming" label="GITHUB" external=true >}} 🤗 <a href="https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5">Hugging Face</a>| 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5">ModelScope</a>
15+
16+
# Ming-lite-omni 1.5: Segmentation-as-Editing for Unified Multimodal AI
17+
18+
### The Hype and the Hidden Question
19+
20+
The multimodal AI world has been thriving.
21+
22+
From the debut of Qwen-Image to the interactive editing hype sparked by Nano Banana, image editing has rapidly become the next battlefield for generative AI.
23+
24+
Editing fundamentally requires two distinct skill sets:
25+
- **Know *where*, *what*, and *how* to change** (understanding the image)
26+
- **Produce the change with high visual quality** (generating the image)
27+
28+
Its rich gameplay and strong interactivity have pulled in users, developers, and creators alike.
29+
30+
But behind the noise, few are asking:
31+
32+
> **Beneath this prosperity, how close are we to a truly unified “understanding + generation” AI?**
33+
34+
### Understanding and Generation: Two Hands, Often Out of Sync
35+
36+
For years, we’ve chased an ambitious goal:
37+
38+
Build a unified multimodal model that understands the world like a scientist (e.g., image segmentation) while creating it like an artist (e.g., image editing).
39+
40+
In theory, these abilities should be mutually reinforcing:
41+
42+
> *“The deeper the understanding, the better the creation; the more the creation, the deeper the understanding.”*
43+
44+
Reality is messier.
45+
46+
In AI today:
47+
- **Understanding = the left hand:** precise abstractions, semantic reasoning, boundaries.
48+
- **Generation = the right hand:** coherent pixels, style, aesthetics.
49+
50+
But training a model to recognize 10,000 cat photos doesn’t magically make it capable of painting cats, and painting cats repeatedly doesn’t make it understand cats better.
51+
52+
Worse, in multitask training, the two often compete for resources — optimizations for understanding can hurt generation, and vice versa.
53+
54+
**We’re missing a catalyst: a task that forces the left and right hands to evolve together.**
55+
56+
---
57+
58+
### The Struggle: 16% Segmentation and Out-of-Control Generation
59+
60+
Before finding our solution, our unified model was struggling with generative segmentation:
61+
62+
Given an instruction like “*segment the banana in the upper-right corner*”, we wanted the model to output a segmentation mask directly.
63+
64+
The results were painful.
65+
66+
![Struggling with Segmentation](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*2BAkRZ9WGTcAAAAAgCAAAAgAevzJAQ/original)
67+
68+
On RefCOCO-val, our cIoU plateaued at **~16%**.
69+
70+
The root cause is the **distribution gap**.
71+
72+
Generative models thrive on natural, continuous image distributions. Segmentation masks, however, are synthetic, abstract, binary maps — as unnatural as it gets for an image generator.
73+
74+
It was like asking a painter to draw an X-ray: doable, but far from their artistic instincts.
75+
76+
Here, generation wasn’t helping segmentation — it was tripping it up.
77+
78+
We needed a new task that:
79+
1. Met the precision demands of **understanding**.
80+
2. Played to the strengths of **generation**.
81+
82+
### The “Aha” Moment: Dressing Segmentation in Color
83+
84+
Here’s the analogy that unlocked it for us:
85+
86+
> *If you want a child to mark an object, is it easier to have them draw a tight outline with a pencil, or fill it in with bright colors?*
87+
88+
Obviously, the latter.
89+
90+
Instead of forcing our model to output abstract black-and-white masks, we **turned the segmentation task into a color-editing task**.
91+
92+
**Example:**
93+
- **Instruction:***segment the banana in the upper-right*
94+
- **Old way:** Output a mask ❌
95+
- **New way:** Directly edit the image: “*paint the banana purple*”, “*make the banana red*”, etc. ✅
96+
97+
![Segmentation as Editing](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*-_O6RLOxXKcAAAAAgBAAAAgAevzJAQ/original)
98+
99+
This brought the task’s data distribution back to the realm of natural images — where generative models shine.
100+
101+
### Why This Works: The Hidden Catalyst
102+
103+
That small twist turned out to be exactly the catalyst we’d been searching for.
104+
105+
- **Boosting Understanding:**
106+
To color the banana without bleeding outside the boundary, the model must internally nail pixel-perfect segmentation. The segmentation step became an **implicit prerequisite** to editing.
107+
108+
- **Unleashing Generation:**
109+
No more awkward synthetic masks — the model is doing what it knows best: image-to-image editing. All its strengths in shading, texture, and edge blending go into making the change look natural.
110+
111+
For the first time, the left hand and right hand weren’t fighting — **they were helping each other**.
112+
113+
---
114+
115+
### The Numbers: From 16% to 72.4% — and Beyond
116+
117+
#### 1. SOTA-level Segmentation
118+
119+
The cIoU score didn’t just improve — it soared from 16% to **72.4%** on RefCOCO-val, a relative gain of over **350%**.
120+
121+
Qualitatively, the model outperformed competitors in pinpointing and segmenting targets, even in reasoning-heavy cases.
122+
123+
Against Qwen-Image and Nano Banana, our model:
124+
- Located small or occluded targets more reliably.
125+
- Produced boundaries that were visually and semantically aligned with instructions.
126+
127+
![Segmentation Comparison 1](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*DwJpSZyoW-YAAAAAgJAAAAgAevzJAQ/original)
128+
*Our model (right) accurately locates and segments the target subject. Qwen-Image (second from left) fails to locate the correct target, while Nano-banana (third from left) fails to accurately segment the man's head and has loose boundary lines.*
129+
130+
![Segmentation Comparison 2](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*yL2MR7vLQdEAAAAAgEAAAAgAevzJAQ/original)
131+
*For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.*
132+
133+
During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image.
134+
135+
![Calculating difference on Ming-Lite-Omni1.5, Qwen-Image-Edit, Nano-banana](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*UJX1RJJpu3cAAAAASyAAAAgAevzJAQ/original)
136+
137+
138+
The results show that our model's performance on segmentation is now on par with specialized vision models.
139+
140+
| Model Category | Model Name | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) |
141+
| :--- | :--- | :---: | :---: | :---: |
142+
| **Vision Specialist Models** | VLT | 67.5 | 56.3 | 55.0 |
143+
| | CRIS | 70.5 | 62.3 | 59.9 |
144+
| | LAVT | 72.7 | 62.1 | 61.2 |
145+
| | PolyFormer-B | 74.8 | 67.6 | 67.8 |
146+
| **MLLM + Specialist (SAM)** | LISA-7B | 74.1 | 62.4 | 66.4 |
147+
| | PixelLM-7B | 73.0 | 66.3 | 69.3 |
148+
| **Generative Models** | Nano-banana* | 15.7 | 13.9 | 14.9 |
149+
| | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 |
150+
| | **Ming-Lite-Omni1.5** | **72.4** | **62.8** | **64.3** |
151+
152+
*<small>For each test set, Nano-banana and Qwen-Image-Edit was evaluated on a randomly sampled subset of 500 images, to reduce computational cost while preserving the key statistical trends. We observed that Nano-banana frequently fails to accurately grasp the image segmentation intent during inference, leading to its comparatively lower evaluation metrics. This may be attributed to differences in training objectives and data emphasis.</small>*
153+
154+
#### 2. Sharper, More Controllable Editing
155+
156+
The beauty of this method is that it not only fixed the segmentation weakness but also dramatically enhanced the model's general editing capabilities.
157+
158+
Because the model has learned an unprecedented "respect for boundaries" through thousands of "precise coloring" exercises, this "muscle memory" for fine-grained control has transferred to all editing tasks. Our edit controllability score saw a significant jump from **7.69 to 8.12** across sub-tasks like background, color, and material changes.
159+
160+
![Editing Controllability Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*szjcQqQkC80AAAAAgIAAAAgAevzJAQ/original)
161+
*Prompt: "remove the bow tie of the man on the far right." Our model (right) precisely removes only the target bow tie while maintaining background consistency. Qwen (second from left) incorrectly removes multiple bow ties and introduces inconsistencies. Nano-banana (third from left) also struggles with consistency.*
162+
163+
#### 3. Stronger ID Consistency
164+
165+
A core challenge in portrait editing is maintaining identity. Our model excels here as well. Whether changing a hairstyle or adjusting an expression, the model skillfully preserves the person's core features.
166+
167+
![ID Consistency Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*Tc2-RoAHys8AAAAAd9AAAAgAevzJAQ/original)
168+
*Top Row (Turn head): Our model (right) maintains ID and background consistency, unlike competitors. Middle Row (Smile): Our model (right) correctly follows the prompt while preserving ID, avoiding distortions seen in others. Bottom Row (Change background): Our model (right) excels at preserving the subject's ID and appearance during a background swap.*
169+
170+
**See More Editing Consistency in Action:**
171+
<video src="https://gw.alipayobjects.com/v/huamei_wp0xz6/afts/video/A*CcqdTbafkt8AAAAAgEAAAAgAevzJAQ" width="704px" height="740px" controls></video>
172+
173+
---
174+
175+
### An Honest Look: Where We Can Still Improve
176+
177+
Despite the leap forward, challenges remain:
178+
- **Large pose changes** (e.g., standing → running) need more reliability.
179+
- **Multi-step or compound instructions** require better parsing and execution.
180+
- **Instruction diversity support** needs expansion.
181+
182+
These are our next milestones.
183+
184+
### Takeaway: The Next Catalysts Are Out There
185+
186+
From 16% to 72.4% — this wasn’t driven by a massive architecture overhaul or billion-image datasets.
187+
188+
It came from **one change in task design**.
189+
190+
The lesson: Instead of gluing capabilities together after the fact, **find naturally cooperative tasks** — where solving the problem requires multiple abilities to mesh seamlessly.
191+
192+
“Segmentation-as-editing” is just the first example.
193+
194+
We suspect 3D understanding, video generation, and other domains have their own hidden catalysts, waiting to be discovered.
195+
196+
**At last, AI’s left and right hands have learned to high-five.**
197+
198+
**And this is only the overture.**
199+
200+
Try out our open-source model **Ming-lite-omni 1.5** on our [**GitHub Page / Demo Page**](https://github.com/inclusionAI/Ming/blob/main/cookbook.ipynb). Please star our repo if you like it!
201+
202+
203+
<!-- ---
204+
205+
Try out our open-source model **Ming-lite-omni 1.5** on our [**GitHub Page / Demo Page**](占位符:你的GitHub/Demo链接). Please star our repo if you like it!
206+
207+
To cite our work:
208+
```
209+
210+
``` -->

0 commit comments

Comments
 (0)