Skip to content

Conversation

wtomin
Copy link
Collaborator

@wtomin wtomin commented Oct 8, 2025

What does this PR do?

Adds # (feature)

The inference script, document, and demo for CannyEdit, a training-free method for versatile image editing tasks.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SamitHuang @vigo999

Copy link
Contributor

Summary of Changes

Hello @wtomin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the CannyEdit feature, a training-free framework for versatile image editing. It includes the core inference script, a web-based demo for mask generation, and comprehensive documentation. The system supports region-specific and multi-region image edits, with capabilities for automated prompt generation and advanced attention mechanisms for precise control.

Highlights

  • New Feature: CannyEdit: This PR introduces CannyEdit, a novel training-free framework for versatile image editing tasks. It supports high-quality region-specific edits and multi-region edits in a single generation pass.
  • Comprehensive Implementation: The implementation includes an inference script, a web-based mask generation demo, detailed documentation (README), and all necessary supporting modules for the Flux model, ControlNet, AutoEncoder, and text/image embedders.
  • Automated Prompt Generation: The system can automatically generate source and target prompts using the Qwen2.5-VL-7B-Instruct model if they are not explicitly provided by the user, particularly useful for adding and removal tasks.
  • Interactive Mask Generation: A Flask-based web application (app_mask.py) is provided, allowing users to interactively draw on images to generate masks using either ellipse fitting or SAM2.1, which are then used for editing.
  • Advanced Denoising and Attention Mechanisms: The core CannyEditPipeline integrates ControlNet and employs specialized denoising functions (denoise_cannyedit, denoise_cannyedit_removal) with 'Attention Amplification' and 'Cyclical Blending' techniques for precise regional control during image generation and removal.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the inference script and demo for CannyEdit, a new image editing method. The scope of the changes is large, introducing a complete example with a web UI, inference logic, and model definitions. The code is generally well-structured, but there are several critical issues related to correctness and maintainability that need to be addressed. Key areas for improvement include fixing a major logic flaw in the main inference script, resolving TypeError bugs in model layers, improving the robustness of the web application, and addressing significant code duplication. I've provided detailed comments and suggestions to resolve these issues.

Comment on lines +380 to +449
print("Running CannyEdit")
# Stage 1: Generation
stage1 = "stage_removal"
result = cannyedit_pipeline(
prompt_source=args.prompt_source,
prompt_local1=args.prompt_local[0],
prompt_target=args.prompt_target,
prompt_local_addition=args.prompt_local[1:],
controlnet_image=image,
local_mask=local_mask,
local_mask_addition=local_mask_addition,
width=args.width,
height=args.height,
guidance=args.guidance,
num_steps=args.num_steps,
seed=args.seed,
true_gs=args.true_gs,
control_weight=args.control_weight,
control_weight2=args.control_weight2,
neg_prompt=args.neg_prompt,
# removal_add
neg_prompt2=args.neg_prompt2,
timestep_to_start_cfg=args.timestep_to_start_cfg,
stage=stage1,
generate_save_path=args.generate_save_path,
inversion_save_path=args.inversion_save_path,
)

# Save the edited image
if not os.path.exists(args.save_folder):
os.mkdir(args.save_folder)
ind = len(os.listdir(args.save_folder))
result_save_path = os.path.join(args.save_folder, f"result_{ind}.png")
result.save(result_save_path)

if removal_flag is False:
# Stage 1: Generation
stage1 = "stage_generate"
print("Running CannyEdit")
result = cannyedit_pipeline(
prompt_source=args.prompt_source,
prompt_local1=args.prompt_local[0],
prompt_target=args.prompt_target,
prompt_local_addition=args.prompt_local[1:],
controlnet_image=image,
local_mask=local_mask,
local_mask_addition=local_mask_addition,
width=args.width,
height=args.height,
guidance=args.guidance,
num_steps=args.num_steps,
seed=args.seed,
true_gs=args.true_gs,
control_weight=args.control_weight,
control_weight2=args.control_weight2,
neg_prompt=args.neg_prompt,
neg_prompt2=args.neg_prompt2,
timestep_to_start_cfg=args.timestep_to_start_cfg,
stage=stage1,
generate_save_path=args.generate_save_path,
inversion_save_path=args.inversion_save_path,
)

# Save the edited image
if not os.path.exists(args.save_folder):
os.mkdir(args.save_folder)
ind = len(os.listdir(args.save_folder))
result_save_path = os.path.join(args.save_folder, f"result_{ind}.png")
result.save(result_save_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical logic flaw in how the cannyedit_pipeline is executed. The current structure can lead to the pipeline being run twice or not at all, depending on the input arguments.

Specifically:

  • If prompts are not provided (args.prompt_source is None), the pipeline is first executed with stage1 = "stage_removal" (lines 382-413). Then, if removal_flag is False, it runs again with stage1 = "stage_generate" (lines 417-448).
  • If prompts are provided and removal_flag is True, the pipeline is never executed.

This leads to incorrect behavior and wasted computation. The logic should be refactored to ensure the pipeline is called only once with the correct stage.

I suggest restructuring the main function to:

  1. Determine the stage based on removal_flag.
  2. Generate prompts if they are not provided.
  3. Execute the pipeline a single time with the correct parameters.

Comment on lines +493 to +497
) -> Tensor:
if image_proj is None:
return self.processor(self, x, vec, pe, attention_kwargs)
else:
return self.processor(self, x, vec, pe, image_proj, ip_scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the DoubleStreamBlock, there is a TypeError here. The construct method of SingleStreamBlock calls its processor with image_proj and ip_scale, but the __call__ method of SingleStreamBlockProcessor does not accept these arguments. The signature of SingleStreamBlockProcessor.__call__ must be updated to resolve this.

Comment on lines +393 to +396
if image_proj is None:
return self.processor(self, img, txt, vec, pe, attention_kwargs)
else:
return self.processor(self, img, txt, vec, pe, image_proj, ip_scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a TypeError here. The construct method of DoubleStreamBlock calls self.processor (an instance of DoubleStreamBlockProcessor) with image_proj and ip_scale arguments when image_proj is not None. However, the __call__ method of DoubleStreamBlockProcessor does not accept these arguments. The signature of DoubleStreamBlockProcessor.__call__ needs to be updated to accept image_proj and ip_scale to fix this bug.

SAM_AVAILABLE = False
else:
print("SAM2.1 is not available. Please install segment-anything package.")
app.run(host="0.0.0.0", port=5000, debug=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Running a Flask application with debug=True in a script that might be deployed is a significant security risk. The debug mode can expose sensitive information and allow arbitrary code execution. It's recommended to disable debug mode by default and make it configurable, for example, through an environment variable or a command-line argument.

    app.run(host="0.0.0.0", port=5000, debug=False)

tqdm==4.67.1
transformers==4.50.0
hydra-core>=1.3.2
torch # load SAM2 pytorch weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For reproducibility, it's crucial to pin dependency versions. The torch package is listed without a specific version, which could lead to different behavior or errors if a new version is installed. Please specify a version that is known to work with your project.

torch>=2.1.0 # load SAM2 pytorch weights

Comment on lines +98 to +115
curr_atten = attn_weight[:, :, -image_size:, 512 : 512 * (num_edit_region + 1)].copy()
attn_weight[:, :, -image_size:, 512 : 512 * (num_edit_region + 1)] = mint.where(
union_mask == 1, curr_atten, curr_atten * (local_t2i_strength)
)
# amplify the attention between the target prompt and the whole image
curr_atten1 = attn_weight[:, :, -image_size:, :512].copy()
attn_weight[:, :, -image_size:, :512] = curr_atten1 * (context_t2i_strength)

for local_mask in local_mask_list:
# outside the union of masks is 1
mask1_flat = union_mask.flatten() # (local_mask).flatten()
mask1_indices = 512 * (num_edit_region + 1) + mint.nonzero(mask1_flat, as_tuple=True)[0]
# mask2_flat inside the mask is 1
mask2_flat = (1 - local_mask).flatten()
mask2_indices = 512 * (num_edit_region + 1) + mint.nonzero(mask2_flat, as_tuple=True)[0]
# inside the other masks is 1
mask3_flat = 1 - mint.logical_or(mask1_flat.bool(), mask2_flat.bool()).int()
mask3_indices = 512 * (num_edit_region + 1) + mint.nonzero(mask3_flat, as_tuple=True)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 512 is used multiple times in this section. This number likely corresponds to the text embedding dimension. Hardcoding it makes the code harder to understand and maintain. It should be defined as a named constant or passed as a parameter to the function to improve clarity and make it easier to modify if the embedding dimension changes.

Comment on lines 215 to 216
def __call__(self, attn, x, pe, **attention_kwargs):
print("2" * 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be a leftover from debugging. Such statements should be removed from the final code to keep the output clean.

Comment on lines +12 to +14
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sam_dir = os.path.join(parent_dir, "sam2")
sys.path.insert(0, sam_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manipulating sys.path dynamically can lead to fragile and hard-to-maintain code. It makes the script dependent on the directory from which it is run. A more robust approach would be to structure the project as a package and use relative imports, or to set the PYTHONPATH environment variable. This improves code portability and makes dependencies explicit.

CannyEdit is a novel training-free framework to support multitask image editing. It enables high-quality region-specific image edits, especially useful in cases where SOTA free-form image editing methods fail to ground edits accurately. Besides, it can support edits on multiple user-specific regions at one generation pass when multiple masks are given.

<p align="center">
<img src=./assets/page_imgs/grid_image.png width=500 />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In HTML, it's a best practice to enclose attribute values in quotes for correctness and readability. The src and width attributes of the <img> tag are missing quotes.

Suggested change
<img src=./assets/page_imgs/grid_image.png width=500 />
<img src="./assets/page_imgs/grid_image.png" width="500" />

parser.add_argument("--num_steps", type=int, default=50, help="The num_steps for diffusion process")
parser.add_argument("--guidance", type=float, default=4, help="The guidance for diffusion process")
parser.add_argument(
"--seed", type=int, default=random.randint(0, 9999999), help="A seed for reproducible inference"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a random seed by default makes experiments non-reproducible. For scientific and debugging purposes, it's better to use a fixed default seed (e.g., 42). This ensures that anyone running the script gets the same result. The user can still override it with a specific seed if they want randomness.

Suggested change
"--seed", type=int, default=random.randint(0, 9999999), help="A seed for reproducible inference"
"--seed", type=int, default=42, help="A seed for reproducible inference"

# Invert the mask (object area becomes 0, background becomes 1)
local_mask = 1 - binary_downsampled_mask

# Convert the final mask to a PyTorch tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comments

@wtomin wtomin requested a review from vigo999 October 9, 2025 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants