mindspore-lab
diff --git a/‎official/cv/segment-anything/README.md
Lines changed: 53 additions & 9 deletions b/‎official/cv/segment-anything/README.md
Lines changed: 53 additions & 9 deletions
diff --git a/‎official/cv/segment-anything/configs/cloud/sa1b_text_finetune_blip2.yaml
Lines changed: 1 addition & 1 deletion b/‎official/cv/segment-anything/configs/cloud/sa1b_text_finetune_blip2.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎official/cv/segment-anything/configs/cloud/sa1b_text_finetune_clip.yaml
Lines changed: 1 addition & 1 deletion b/‎official/cv/segment-anything/configs/cloud/sa1b_text_finetune_clip.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎official/cv/segment-anything/demo/inference_with_prompts.py
Lines changed: 11 additions & 10 deletions b/‎official/cv/segment-anything/demo/inference_with_prompts.py
Lines changed: 11 additions & 10 deletions
diff --git a/‎official/cv/segment-anything/images/dengta-buildings.png
350 KB b/‎official/cv/segment-anything/images/dengta-buildings.png
350 KB
diff --git a/‎official/cv/segment-anything/images/dengta-floor.png
347 KB b/‎official/cv/segment-anything/images/dengta-floor.png
347 KB
diff --git a/‎official/cv/segment-anything/images/dengta-sky.png
338 KB b/‎official/cv/segment-anything/images/dengta-sky.png
338 KB
diff --git a/‎official/cv/segment-anything/images/tumor2_5point.png
251 KB b/‎official/cv/segment-anything/images/tumor2_5point.png
251 KB
diff --git a/‎official/cv/segment-anything/point_inference.py
Lines changed: 1 addition & 1 deletion b/‎official/cv/segment-anything/point_inference.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎official/cv/segment-anything/text_inference.py
Lines changed: 3 additions & 3 deletions b/‎official/cv/segment-anything/text_inference.py
Lines changed: 3 additions & 3 deletions
@@ -30,12 +30,12 @@ Beside fine-tuning our code on COCO2017 dataset which contains common seen objec
 The bellowing shows the mask quality before and after finetune.
 
 
-| pretrained_model | dataset  |    epochs     | mIOU |
-|:----------------:| -------- |:-------------:|------|
-|    sam-vit-b     | COCO2017 | 0 (zero-shot) | 77.4 |
-|    sam-vit-b     | COCO2017 |      20       | 83.5 |
-|    sam-vit-b     | FLARE22  | 0 (zero-shot) | 79.5 |
-|    sam-vit-b     | FLARE22  |      10       | 88.1 |
+| pretrained_model | dataset  |    epochs     | mIOU | ckpt                                                                                                         |
+|:----------------:| -------- |:-------------:|------|--------------------------------------------------------------------------------------------------------------|
+|    sam-vit-b     | COCO2017 | 0 (zero-shot) | 74.5 |                                                                                                              |
+|    sam-vit-b     | COCO2017 |      20       | 80.2 | [link](https://download-mindspore.osinfra.cn/toolkits/mindone/sam/sam_vitb_box_finetune_coco-a9b75828.ckpt)                                                                                                         |
+|    sam-vit-b     | FLARE22  | 0 (zero-shot) | 78.6 |                                                                                                              |
+|    sam-vit-b     | FLARE22  |      10       | 87.4 | [link](https://download-mindspore.osinfra.cn/toolkits/mindone/sam/sam_vitb_box_finetune_flare-ace06cc2.ckpt) |
 
 A machine with **32G ascend memory** is required for box-prompt finetune.
 
@@ -82,6 +82,38 @@ Here are the examples of segmentation result predicted by box-prompt fine-tuned
   <em> FLARE22 image example </em>
 </p>
 
+### Finetune with point-prompt
+The point in addition to the previous-step-output mask are used as prompt input to predict mask.
+We follow an iterative interactive training schedule described in the official SAM paper. First a foreground point is sampled uniformly from the ground truth mask. After making a prediction, 
+subsequent points are selected uniformly from the error region between the previous mask prediction and the ground truth mask. Each new point is a foreground or background if the error region is a false negative or false positive. 
+The mask prediction from the previous iteration is used as an additional prompt. In order to encourage the model to benefit from the supplied mask, several more iterations are used where no additional points are sampled.
+The total iteration number and the position where mask-only iterations are inserted is configurable.
+
+Since the original training dataset (SA-1B) is almost of common objects, we use a medical imaging segmentation dataset [FLARE22](https://flare22.grand-challenge.org/Dataset/) (preprocess the raw dataset as mentioned in the last chapter) for the finetune experiment. 
+We note that SAM model express strong zero-shot ability and the finetune process may learn mainly the labelling bias for most downstream datasets.
+
+for standalone finetune of FLARE22 dataset, please run:
+```shell
+python train.py -c configs/sa1b_point_finetune.yaml
+```
+
+for distributed finetune of FLARE22 dataset, please run:
+```shell
+mpirun --allow-run-as-root -n 4 python train.py -c configs/sa1b_point_finetune.yaml
+```
+
+the fine-tuned model will be saved at the work_root specified in `configs/sa1b_point_finetune.yaml`. For a fast single image inference, please run,
+
+```shell
+python point_inference.py --checkpoint=your/path/to/ckpt
+```
+
+Below is an experimental result batch-prompted with 5 points and the model is trained at scale `vit_b`. The checkpoint can be downloaded [here](https://download-mindspore.osinfra.cn/toolkits/mindone/sam/sam_vitb_point_finetune_flare-898ae8f6.ckpt). 
+<div align="center">
+    <img alt="img.png" src="images/tumor2_5point.png" width="600"/>
+</div>
+
+Explore more interesting applications such as iterative positive and negative points prompting described in the following Demo Chapter.
 
 ### Finetune with text-prompt
 *Note again that text-to-mask finetune is exploratory and not robust, and the official pytorch code is not release yet.*
@@ -111,14 +143,26 @@ mpirun --allow-run-as-root -n 8 python train.py -c configs/sa1b_text_finetune_bl
 the fine-tuned model will be saved at the work_root specified in `configs/sa1b_text_finetune.yaml`. For a fast single image inference, please run,
 
 ```shell
-python text_inference.py --checkpoint=your/path/to/ckpt
+python text_inference.py --checkpoint=your/path/to/ckpt --text-prompt your_prompt
 ```
 
-Below is an experimental result prompted with `wheels`. _Note that the model is trained with limited data and the smallest SAM type `vit_b`._ 
+Below are some zero-shot experimental result prompted with `floor` and `buildings`. The checkpoint can be downloaded [here](https://download-mindspore.osinfra.cn/toolkits/mindone/sam/sam_vitb_text_finetune_sa1b_10k-972de39e.ckpt). _Note that the model is trained with limited data and the smallest SAM type `vit_b`._
+
 <div align="center">
-    <img alt="img.png" src="images/blip2-text-prompt-wheel.png" width="600"/>
+<img src="images/dengta-floor.png" height="350" />
+      
+<img src="images/dengta-buildings.png" height="350" />
 </div>
 
+<p align="center">
+  <em> prompt: floor</em>
+                        
+                        
+  <em> prompt: buildings </em>
+</p>
+
+Try more prompts like `sky` or `trees` etc.
+
 ## Demo
 
 First download the weights ([sam_vit_b](https://download.mindspore.cn/toolkits/mindone/sam/sam_vit_b-35e4849c.ckpt), [sam_vit_l](https://download.mindspore.cn/toolkits/mindone/sam/sam_vit_l-1b460f38.ckpt), [sam_vit_h](https://download.mindspore.cn/toolkits/mindone/sam/sam_vit_h-c72f8ba1.ckpt)) and put them under `${project_root}/models` directory.
 
@@ -22,7 +22,7 @@ optimizer:
   group_param:
 
   lr_scheduler:
-    type: segment_anything.optim.scheduler.SAMDynamicDecayLR
+    type: segment_anything.optim.scheduler.sam_dynamic_decay_lr
     learning_rate: 8e-6
     warmup_steps: 250
     decay_steps: [ 60000, 86666 ]
 
@@ -22,7 +22,7 @@ optimizer:
   group_param:
 
   lr_scheduler:
-    type: segment_anything.optim.scheduler.SAMDynamicDecayLR
+    type: segment_anything.optim.scheduler.sam_dynamic_decay_lr
     learning_rate: 8e-6
     warmup_steps: 250
     decay_steps: [ 60000, 86666 ]
 
@@ -56,13 +56,13 @@ def main(args: argparse.Namespace):
 
 def predict_with_point(predictor, image, args: argparse.Namespace):
     # predict the first point
-    input_point = np.array([[500, 375]])
-    input_label = np.array([1])
+    input_point1 = np.array([[500, 375]])
+    input_label1 = np.array([1])
 
     s1 = time.time()
     masks, scores, logits = predictor.predict(
-        point_coords=input_point,
-        point_labels=input_label,
+        point_coords=input_point1,
+        point_labels=input_label1,
         multimask_output=True,
     )
     s2 = time.time()
@@ -73,7 +73,7 @@ def predict_with_point(predictor, image, args: argparse.Namespace):
         plt.figure(figsize=(10, 10))
         plt.imshow(image)
         show_mask(mask, plt.gca())
-        show_points(input_point, input_label, plt.gca())
+        show_points(input_point1, input_label1, plt.gca())
         plt.title(f"Mask {i + 1}, Score: {score:.3f}", fontsize=18)
         plt.axis('off')
         path = os.path.join(args.output_dir, f'mask_{i+1}.jpg')
@@ -83,15 +83,15 @@ def predict_with_point(predictor, image, args: argparse.Namespace):
             plt.show()
 
     # predict the second and third points
-    input_point = np.array([[500, 375], [1125, 625]])
-    input_label = np.array([1, 0])
+    input_point2 = np.array([[500, 375], [1125, 625]])
+    input_label2 = np.array([1, 0])
 
     mask_input = logits[np.argmax(scores), :, :]  # Choose the model's best mask
     print(f'mask input shape {mask_input.shape}')
     s3 = time.time()
     masks, _, _ = predictor.predict(
-        point_coords=input_point,
-        point_labels=input_label,
+        point_coords=input_point2,
+        point_labels=input_label2,
         mask_input=mask_input[None, :, :],
         multimask_output=False,
     )
@@ -101,7 +101,8 @@ def predict_with_point(predictor, image, args: argparse.Namespace):
     plt.figure(figsize=(10, 10))
     plt.imshow(image)
     show_mask(masks, plt.gca())
-    show_points(input_point, input_label, plt.gca())
+    show_points(input_point1, input_label1, plt.gca())
+    show_points(input_point2, input_label2, plt.gca())
     plt.axis('off')
     path = os.path.join(args.output_dir, f'two_point.jpg')
     print(f'saving mask at {path}')
 
@@ -90,7 +90,7 @@ def infer(args):
     parser.add_argument(
         "--checkpoint",
         type=str,
-        default='./models/sam_vit_b-35e4849c.ckpt',
+        default='./models/sam_vitb_point_finetune_flare-898ae8f6.ckpt',
         help="The type of model to load, in ['default', 'vit_h', 'vit_l', 'vit_b']",
     )
 
 
@@ -83,7 +83,7 @@ def infer(args):
 
 if __name__ == '__main__':
     parser = argparse.ArgumentParser(description=("Runs inference on one image"))
-    parser.add_argument("--image_path", type=str, default='./images/truck.jpg', help="Path to an input image.")
+    parser.add_argument("--image_path", type=str, default='./images/dengta.jpg', help="Path to an input image.")
     parser.add_argument(
         "--model-type",
         type=str,
@@ -100,14 +100,14 @@ def infer(args):
     parser.add_argument(
         "--text-prompt",
         type=str,
-        default='wheels',
+        default='floor',
         help="Text prompt",
     )
 
     parser.add_argument(
         "--checkpoint",
         type=str,
-        default='./models/sam_vit_b-35e4849c.ckpt',
+        default='./models/sam_vitb_text_finetune_sa1b_10k-972de39e.ckpt',
         help="The type of model to load, in ['default', 'vit_h', 'vit_l', 'vit_b']",
     )
Original file line number	Diff line number	Diff line change
`@@ -90,7 +90,7 @@ def infer(args):`
`90`	`90`	`parser.add_argument(`
`91`	`91`	`"--checkpoint",`
`92`	`92`	`type=str,`
`93`		`- default='./models/sam_vit_b-35e4849c.ckpt',`
	`93`	`+ default='./models/sam_vitb_point_finetune_flare-898ae8f6.ckpt',`
`94`	`94`	`help="The type of model to load, in ['default', 'vit_h', 'vit_l', 'vit_b']",`
`95`	`95`	`)`
`96`	`96`