You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.*
132
132
133
-
During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. The results show that our model's performance on segmentation is now on par with specialized vision models.
133
+
During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image.
134
+
135
+

136
+
137
+
138
+
The results show that our model's performance on segmentation is now on par with specialized vision models.
134
139
135
140
| Model Category | Model Name | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) |
136
141
| :--- | :--- | :---: | :---: | :---: |
@@ -140,9 +145,11 @@ During evaluation, thanks to the high consistency of non-edited regions in our m
*<small>For each test set, Nano-banana and Qwen-Image-Edit was evaluated on a randomly sampled subset of 500 images, to reduce computational cost while preserving the key statistical trends. We observed that Nano-banana frequently fails to accurately grasp the image segmentation intent during inference, leading to its comparatively lower evaluation metrics. This may be attributed to differences in training objectives and data emphasis.</small>*
0 commit comments