luxonis · ptoupas · Oct 28, 2024 · Oct 21, 2024 · Oct 21, 2024 · Oct 24, 2024
diff --git a/README.md b/README.md
@@ -244,7 +244,7 @@ https://user-images.githubusercontent.com/26127866/147658296-23be4621-d37a-4fd6-
 
 [![Gen2 WebRTC](https://user-images.githubusercontent.com/5244214/121884542-58a1bf00-cd13-11eb-851d-dc45d541e385.gif)](https://youtu.be/8aeqGgO8LjY)
 
-## YOLO V3 V4 V5 X and P On-Camera ([here](https://github.com/luxonis/depthai-experiments/tree/master/gen2-yolo))
+## YOLO V3 V4 V5 X P V8 V9 and 11 On-Camera ([here](https://github.com/luxonis/depthai-experiments/tree/master/gen2-yolo))
 
 ![yolo-logo](https://user-images.githubusercontent.com/56075061/144863247-fa819d1d-28d6-498a-89a8-c3f94d9e9357.gif)
 
diff --git a/gen2-yolo/README.md b/gen2-yolo/README.md
@@ -7,11 +7,12 @@ This repository contains code for various Yolo experiments:
 | Directory       | Decoding | Version                     | Description                                                  |
 | :-------------: | :------: | :--------------------------: | ------------------------------------------------------------ |
 | `main.py` | device   | From https://tools.luxonis.com | Run your custom trained YOLO model that was converted using the tools.luxonis.com. Uses [DepthAI-SDK](https://docs.luxonis.com/projects/sdk/en/latest/) |
-| `device-decoding` | device   | V3, V3-tiny, V4, V4-tiny, V5 | General object detection using any of the versions for which we support on-device decoding. Uses [DepthAI-API]https://docs.luxonis.com/projects/api/en/latest/) |
+| `device-decoding` | device   | V3, V3-tiny, V4, V4-tiny, V5 | General object detection using any of the versions for which we support on-device decoding. Uses [DepthAI-API](https://docs.luxonis.com/projects/api/en/latest/) |
 | `car-detection`   | device   | V3-tiny, V4-tiny             | Car detection using YoloV3-tiny and YoloV4-tiny with on-device decoding ([DepthAI-SDK](https://docs.luxonis.com/projects/sdk/en/latest/)). |
 | `host-decoding`   | host     | V5                           | Object detection using YoloV5 and on-host decoding.          |
 | `yolox` | host | X | Object detection without anchors using YOLOX-tiny with on-host decoding. |
 | `yolop` | host | P | Vehicle detection, road segmentation, and lane segmentation using YOLOP on OAK with on-host decoding. |
+| `yolo-segmentation` | host | V5, V8, V9, 11 | Object segmentation using YOLOv5, YOLOv8, YOLOv9, YOLO11 on OAK with on-host decoding. |
 
 ## On-Device decoding
 

diff --git a/gen2-yolo/yolo-segmentation/README.md b/gen2-yolo/yolo-segmentation/README.md
@@ -0,0 +1,84 @@
+# YOLO segmentation with decoding on host
+
+This example shows how to perform instance segmentation on OAK devices using YOLO models ([YOLOv5](https://github.com/ultralytics/yolov5), [YOLOv8](https://docs.ultralytics.com/models/yolov8), [YOLOv9](https://github.com/WongKinYiu/yolov9) and [YOLO11](https://docs.ultralytics.com/models/yolo11)). The decoding of the models' output is done on the host side. The ONNX models were exported from pretrained weights on COCO and then were converted to blob to be compatible with OAK devices.
+
+![On-host decoding YOLO segmentation in OAK](docs/oak_segmentation_example.gif)
+
+## Pre-requisites
+
+### 1. Install requirements
+
+```bash=
+python3 -m pip install -r requirements.txt
+```
+
+### 2. Convert/export models to ONNX format
+
+You can either train your custom model or try directly with a model trained on the COCO dataset. The latter is the case handled in this experiment.
+
+#### **YOLOv5**
+
+**1. Installation:**
+```bash=
+git clone https://github.com/ultralytics/yolov5.git
+cd yolov5/
+pip install -r requirements.txt
+```
+
+**2. Export:**
+```bash=
+python3 export.py --weights yolov5{n,s}-seg.pt --include onnx --imgsz 640 --opset 16 --simplify
+```
+
+#### **YOLOv8** and **YOLO11**
+
+**1. Installation:**
+```bash=
+pip install ultralytics
+```
+
+**2. Export:**
+```bash=
+yolo export model=yolov8{n,s}-seg.pt format=onnx imgsz=640 half=True dynamic=False simplify=True batch=1
+yolo export model=yolo11{n,s}-seg.pt format=onnx imgsz=640 half=True dynamic=False simplify=True batch=1
+```
+
+#### **YOLOv9**
+
+**1. Installation:**
+```bash=
+git clone https://github.com/WongKinYiu/yolov9.git
+cd yolov9/
+pip install -r requirements.txt
+```
+
+**2. Download weights:**
+```bash=
+wget https://github.com/WongKinYiu/yolov9/releases/download/v0.1/gelan-c-seg.pt
+```
+
+**3. Export:**
+```bash=
+python3 export.py --weights gelan-c-seg.pt --include onnx --imgsz 640 --batch-size 1 --simplify
+```
+
+### 3. Convert ONNX models to blob
+
+The conversion from ONNX to blob is made by means of Luxonis [Blob Converter Tool](http://blobconverter.luxonis.com). Note that the mean values of the ``Model optimizer parameters`` in the ``Advanced options`` must be changed from the default ``[127.5, 127.5, 127.5]`` to ``[0, 0, 0]``.
+
+
+## Usage
+
+#### Inference with YOLOv5, YOLO8, YOLOv9 or YOLO11
+
+```bash=
+python3 main.py --blob <path_to_blob_file> --conf <confidence_threshold> --iou <iou_threshold> --version <yolo_version>
+```
+
+Options:
+* --blob: Path to YOLO blob file for inference. ``str``
+* --conf: Set the confidence threshold. Default: 0.3. ``float``
+* --iou: Set the NMS IoU threshold. Default: 0.5. ``float``
+* --version: Set the YOLO version to consider. ``int``
+* --input-shape: Set the input shape of YOLO model. Default: [640, 640]. ``int [int ...]``
+* --fps: Set the FPS. Default: 30. ``int``
diff --git a/gen2-yolo/yolo-segmentation/docs/oak_segmentation_example.gif b/gen2-yolo/yolo-segmentation/docs/oak_segmentation_example.gif
diff --git a/gen2-yolo/yolo-segmentation/main.py b/gen2-yolo/yolo-segmentation/main.py
@@ -0,0 +1,263 @@
+import sys
+import cv2
+import depthai as dai
+import numpy as np
+import argparse
+import time
+import random
+
+
+def sigmoid(z):
+  return 1 / (1 + np.exp(-z))
+
+def generate_random_color():
+  return [random.randint(0, 255) for _ in range(3)]
+
+
+coco_classes = ['person', 'bicycle', 'car', 'motorcycle', 'airplane',
+                'bus', 'train', 'truck', 'boat', 'traffic light',
+                'fire hydrant', 'stop sign', 'parking meter', 'bench',
+                'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant',
+                'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+                'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
+                'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+                'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife',
+                'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli',
+                'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+                'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
+                'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
+                'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
+                'scissors', 'teddy bear', 'hair drier', 'toothbrush']
+
+class_colors = {cls: generate_random_color() for cls in coco_classes}
+
+pixel_conf_threshold = 0.3
+available_yolo_versions = [5, 8, 9, 11]
+
+
+if __name__ == '__main__':
+
+  parser = argparse.ArgumentParser(description="Object segmentation on OAK devices using YOLO models.", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+  parser.add_argument("--blob", help="Path to YOLO blob file for inference.", required=True, type=str)
+  parser.add_argument("--conf", help="Set the confidence threshold.", default=0.3, type=float)
+  parser.add_argument("--iou", help="Set the NMS IoU threshold.", default=0.5, type=float)
+  parser.add_argument("--version", help=f"Set the YOLO version to consider.", required=True, type=int, choices=available_yolo_versions)
+  parser.add_argument("--input-shape", help="Set the input shape of YOLO model.", nargs="+", default=[640,640], type=int)
+  parser.add_argument("--fps", help="Set the FPS.", default=30, type=int)
+  args = parser.parse_args()
+
+  nn_blob_path = args.blob
+  confidence_threshold = args.conf
+  iou_threshold = args.iou
+  yolo_version = args.version
+  yolo_input_shape = args.input_shape
+  fps = args.fps
+
+  # Expand input shape if needed
+  yolo_input_shape *= 2 if len(yolo_input_shape) == 1 else 1
+
+  # Start defining a pipeline
+  pipeline = dai.Pipeline()
+  pipeline.setOpenVINOVersion(version = dai.OpenVINO.VERSION_2022_1)
+
+  # Define a neural network that will make predictions based on the source frames
+  nn_node = pipeline.create(dai.node.NeuralNetwork)
+  nn_node.setNumPoolFrames(4)
+  nn_node.input.setBlocking(False)
+  nn_node.setNumInferenceThreads(2)
+  nn_node.setBlobPath(nn_blob_path)
+
+  # Define camera source
+  rgb_cam_node = pipeline.create(dai.node.ColorCamera)
+  rgb_cam_node.setBoardSocket(dai.CameraBoardSocket.CAM_A)
+  rgb_cam_node.setFps(fps)
+  rgb_cam_node.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
+  rgb_cam_node.setPreviewSize(yolo_input_shape)
+  rgb_cam_node.setInterleaved(False)
+  rgb_cam_node.preview.link(nn_node.input)
+
+  # Create outputs
+  xout_rgb = pipeline.create(dai.node.XLinkOut)
+  xout_rgb.setStreamName("rgb")
+  xout_rgb.input.setBlocking(False)
+  nn_node.passthrough.link(xout_rgb.input)
+
+  xout_nn = pipeline.create(dai.node.XLinkOut)
+  xout_nn.setStreamName("nn")
+  xout_nn.input.setBlocking(False)
+  nn_node.out.link(xout_nn.input)
+
+
+  # Pipeline defined, now the device is assigned and pipeline is started
+  with dai.Device() as device:
+    device.startPipeline(pipeline)
+
+    # Output queues will be used to get the rgb frames and nn data from the outputs defined above
+    q_rgb = device.getOutputQueue(name="rgb", maxSize=8, blocking=False)
+    q_nn = device.getOutputQueue(name="nn", maxSize=8, blocking=False)
+
+    start_time = time.monotonic()
+    counter = 0
+    fps = 0
+    layer_info_printed = False
+    while True:
+
+      rgb_out = q_rgb.get()
+      nn_out = q_nn.get()
+
+      frame = rgb_out.getCvFrame()
+      frame_height, frame_width, c = frame.shape
+
+      if nn_out is not None:
+        layers = nn_out.getAllLayers()
+
+        if not layer_info_printed:
+          print("+++ Output layer info +++\n")
+          for layer_nr, layer in enumerate(layers):
+            print(f"Layer {layer_nr}")
+            print(f"Name: {layer.name}")
+            print(f"Order: {layer.order}")
+            print(f"dataType: {layer.dataType}")
+            #dims = layer.dims[::-1] # reverse dimensions
+            print(f"dims: {layer.dims}\n")
+          layer_info_printed = True
+          print("+++++++++++++++++++++++++\n")
+
+        # Get output0 data to parse detected bounding boxes and get mask weights
+        output0 = nn_out.getLayerFp16(layers[0].name)
+        output0 = np.array(output0)
+        dims0 = layers[0].dims            # (1, 116, 8400) for YOLOv8,v9,11 | (1, 25200, 117) for YOLOv5
+        output0 = output0.reshape(dims0)
+        output0 = output0.squeeze(0)      # (116, 8400) for YOLOv8,v9,11 | (25200, 117) for YOLOv5
+        if yolo_version == 5:
+          output0 = output0.transpose()   # (117, 25200) for YOLOv5
+
+        # Get output1 data to parse segmentation mask prototypes
+        output1 = nn_out.getLayerFp16(layers[1].name)
+        output1 = np.array(output1)
+        dims1 = layers[1].dims            # (1, 32, 160, 160)
+        output1 = output1.reshape(dims1)
+        output1 = output1.squeeze(0)      # (32, 160, 160)
+        mask_protos = output1.reshape(output1.shape[0], output1.shape[1]*output1.shape[2]) # (32, 25600)
+
+        # Get main info from output0
+        if yolo_version == 5:
+          num_classes = output0.shape[0] - 5 - 32            # number of classes = Total number of 2nd dimension - 5 bbox. info. - 32 mask weights
+          bounding_boxes = output0[:4, :]                    # bounding boxes coordinates format: (xc, yc, w, h)
+          box_confidences = output0[4, :]                    # bounding box confidence format: (conf)
+          class_confidences = output0[5:(num_classes+5), :]  # class confidences format: (class_0, class_1, ...)
+          mask_weights = output0[(num_classes+5):, :]        # mask weights format: (mask_weight_0, mask_weight_1, ... , mask_weight_31)
+        else:
+          num_classes = output0.shape[0] - 4 - 32            # number of classes = Total number of 2nd dimension - 4 bbox. coord. - 32 mask weights
+          bounding_boxes = output0[:4, :]                    # bounding boxes coordinates format: (xc, yc, w, h)
+          class_confidences = output0[4:(num_classes+4), :]  # class confidences format: (class_0, class_1, ...)
+          mask_weights = output0[(num_classes+4):, :]        # mask weights format: (mask_weight_0, mask_weight_1, ... , mask_weight_31)
+
+        class_scores = np.max(class_confidences, axis=0)
+        class_ids = np.argmax(class_confidences, axis=0)
+
+        # Initial filtering based on class scores
+        if yolo_version == 5:
+          filtered_indices = box_confidences > 0.0
+          filtered_box_scores = box_confidences[filtered_indices]
+        else:
+          filtered_indices = class_scores > 0.0
+        filtered_boxes = bounding_boxes[:, filtered_indices]
+        filtered_class_scores = class_scores[filtered_indices]
+        filtered_class_ids = class_ids[filtered_indices]
+        filtered_mask_weights = mask_weights[:, filtered_indices]
+
+        # Format bounding box coordinates
+        x_center = filtered_boxes[0, :]
+        y_center = filtered_boxes[1, :]
+        box_width = filtered_boxes[2, :]
+        box_height = filtered_boxes[3, :]
+
+        x1 = x_center - box_width / 2.
+        y1 = y_center - box_height / 2.
+        x2 = x_center + box_width / 2.
+        y2 = y_center + box_height / 2.
+
+        # Apply NMS
+        bboxes = np.stack([x1, y1, x2, y2], axis=1)
+        if yolo_version == 5:
+          indices = cv2.dnn.NMSBoxes(bboxes.tolist(), filtered_box_scores.tolist(), confidence_threshold, iou_threshold)
+        else:
+          indices = cv2.dnn.NMSBoxes(bboxes.tolist(), filtered_class_scores.tolist(), confidence_threshold, iou_threshold)
+
+        final_boxes = [[int(v) for v in bboxes[i]] for i in indices]
+        final_scores = [filtered_class_scores[i] for i in indices]
+        final_class_ids = [filtered_class_ids[i] for i in indices]
+        filtered_mask_weights_t = filtered_mask_weights.transpose()
+        filtered_mask_weights = np.asarray([filtered_mask_weights_t[i] for i in indices]) # (N, 32)
+
+        if filtered_mask_weights.shape[0] != 0:
+          final_masks = filtered_mask_weights @ mask_protos # matrix multiplication
+
+        for i in range(len(final_boxes)):
+          # Get bounding box data
+          x1_i, y1_i, x2_i, y2_i = final_boxes[i]
+          score = final_scores[i]
+          class_id = final_class_ids[i]
+          class_name = coco_classes[class_id]
+
+          # Clamp coordinates
+          x1_i = np.clip(x1_i, 0, frame_width - 1)
+          y1_i = np.clip(y1_i, 0, frame_height - 1)
+          x2_i = np.clip(x2_i, 0, frame_width - 1)
+          y2_i = np.clip(y2_i, 0, frame_height - 1)
+
+          # Draw bounding box
+          cv2.rectangle(frame, (x1_i, y1_i), (x2_i, y2_i), color=class_colors[class_name], thickness=1)
+
+          # Get mask data
+          mask_ph, mask_pw = output1.shape[1:]
+          mask = final_masks[i].reshape(mask_ph, mask_pw)
+          mask = sigmoid(mask)
+          mask = (mask > pixel_conf_threshold).astype('uint8') * 255
+          mask_x1 = round(x1_i / frame_width * mask_pw)
+          mask_y1 = round(y1_i / frame_height * mask_ph)
+          mask_x2 = round(x2_i / frame_width * mask_pw)
+          mask_y2 = round(y2_i / frame_height * mask_ph)
+          mask = mask[mask_y1:mask_y2, mask_x1:mask_x2]
+          mask_img = mask.astype(np.uint8)
+
+          if mask_img.shape:
+            mask_img = cv2.resize(mask_img, (x2_i-x1_i,y2_i-y1_i), interpolation=cv2.INTER_LINEAR)
+            mask = np.array(mask_img)
+
+            # Get polygon data
+            contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
+            polygon = np.array([[x1_i + c[0][0], y1_i + c[0][1]] for c in contours[0]], np.int32)
+
+            # Fill polygon
+            overlay = np.zeros_like(frame, dtype=np.uint8)
+            #overlay[y1_i:y2_i, x1_i:x2_i][mask_img == 255] = (255, 255, 0)
+            cv2.fillPoly(overlay, [polygon], color=class_colors[class_name])
+            cv2.addWeighted(frame, 1.0, overlay, 0.5, 0, frame)
+
+            # Draw polygon
+            cv2.polylines(frame, [polygon], isClosed=True, color=(0, 0, 0), thickness=2)
+            #cv2.drawContours(frame, [polygon], -1, (0,0,0), 2)
+
+          # Draw detection label (class + confidence)
+          label = f"{class_name}: {score:.2f}"
+          cv2.putText(frame, label, (x1_i, y1_i - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)
+
+        # Draw fps on frame
+        cv2.putText(frame, "NN fps: {:.1f}".format(fps), (2, frame_height - 4), cv2.FONT_HERSHEY_DUPLEX, 0.6, (0, 0, 255), 1)
+
+        # Show frame
+        cv2.imshow('res', frame)
+
+      # Compute fps
+      counter += 1
+      current_time = time.monotonic()
+      if (current_time - start_time) > 1:
+        fps = counter / (current_time - start_time)
+        counter = 0
+        start_time = current_time
+
+      if cv2.waitKey(1) == ord('q'):
+        cv2.destroyAllWindows()
+        break
diff --git a/gen2-yolo/yolo-segmentation/requirements.txt b/gen2-yolo/yolo-segmentation/requirements.txt
@@ -0,0 +1,6 @@
+depthai==2.28.0.0
+opencv-python>=4.10.0
+numpy>=1.24.4
+argparse
+time
+random