This fork is an MIT version of YOLO, with some bug fixes and the addition of the V9-N (nano) and V9-E (extended) variants to the original. This repository is already capable of achieving convergence speed and accuracy comparable to stable GPLv3. https://github.com/MultimediaTechLab/YOLO.
Welcome to the official implementation of YOLOv71 and YOLOv92, YOLO-RD3. This repository will contains the complete codebase, pre-trained models, and detailed instructions for training and deploying YOLOv9.
- This is the official YOLO model implementation with an MIT License.
- YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
- YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
- YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
To get started using YOLOv9's developer mode, we recommand you clone this repository and install the required dependencies:
git clone https://github.com/PINTO0309/YOLO.git
cd YOLO
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate
export PYTHONWARNINGS="ignore"For more customization details, please refer to HOWTO.
data
└── wholebody34
├── train.pache # Cache file automatically generated when training starts
├── val.pache # Cache file automatically generated when training starts
├── images
│ ├── train
│ │ ├── 000000000036.jpg
│ │ ├── 000000000077.jpg
│ │ ├── 000000000110.jpg
│ │ ├── 000000000113.jpg
│ │ └── 000000000165.jpg
│ └── val
│ ├── 000000000241.jpg
│ ├── 000000000294.jpg
│ ├── 000000000308.jpg
│ ├── 000000000322.jpg
│ └── 000000000328.jpg
└── labels
├── train
│ ├── 000000000036.txt
│ ├── 000000000077.txt
│ ├── 000000000110.txt
│ ├── 000000000113.txt
│ └── 000000000165.txt
└── val
├── 000000000241.txt
├── 000000000294.txt
├── 000000000308.txt
├── 000000000322.txt
└── 000000000328.txt
-
000000000036.txtItem Note classId classId cx, cy 0.0-1.0 normalized center coordinates w, h 0.0-1.0 normalized width and height classId cx cy w h30 0.729688 0.959667 0.141042 0.080667 25 0.919385 0.974417 0.052521 0.051167 25 0.525000 0.680847 0.049167 0.071806 23 0.663813 0.657361 0.100125 0.105889 21 0.612667 0.519583 0.068542 0.068056 29 0.628292 0.896000 0.292500 0.082889 30 0.546063 0.957611 0.210792 0.084778 19 0.547917 0.417986 0.073125 0.037361 26 0.488281 0.653583 0.123104 0.151444 24 0.840208 0.778889 0.080417 0.092222 24 0.435312 0.790972 0.074375 0.089167 22 0.411469 0.557500 0.103313 0.112222 22 0.773646 0.546944 0.087708 0.110556 9 0.560417 0.366667 0.233333 0.266667 7 0.560417 0.366667 0.233333 0.266667 27 0.956385 0.970417 0.087229 0.055833 16 0.541667 0.370833 0.154167 0.197222 26 0.956385 0.970417 0.087229 0.055833 4 0.681458 0.621667 0.637083 0.756667 0 0.681458 0.621667 0.637083 0.756667 18 0.527188 0.373333 0.042917 0.047500 20 0.644792 0.370028 0.023125 0.036667 1 0.681458 0.621667 0.637083 0.756667 28 0.488281 0.653583 0.123104 0.151444 17 0.489687 0.370972 0.032917 0.020556 17 0.561875 0.350694 0.044583 0.019722
yolo/config/dataset/wholebody34.yaml
path: data/wholebody34
train: train
validation: val
class_num: 34
class_list: ['body', 'adult', 'child', 'male', 'female', 'body_with_wheelchair', 'body_with_crutches', 'head', 'front', 'right-front', 'right-side', 'right-back', 'back', 'left-back', 'left-side', 'left-front', 'face', 'eye', 'nose', 'mouth', 'ear', 'collarbone', 'shoulder', 'solar_plexus', 'elbow', 'wrist', 'hand', 'hand_left', 'hand_right', 'abdomen', 'hip_joint', 'knee', 'ankle', 'foot']
auto_download:To train YOLO on your machine/dataset:
- Modify the configuration file
yolo/config/dataset/**.yamlto point to your dataset. - Run the training script:
uv run python yolo/lazy.py task=train dataset=** use_wandb=True
uv run python yolo/lazy.py task=train task.data.batch_size=8 model=v9-c weight=False # or more argsTo perform transfer learning with YOLOv9:
configs- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/config.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/general.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/train.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/validation.yaml
- https://github.com/PINTO0309/YOLO/tree/wholebody/yolo/config/model
# n, t, s, c, e
VARIANT=n
EPOCH=100
BATCHSIZE=8
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# When specifying trained weights as initial weights
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# Automatically downloading the initial weights published by the official repository
# Default: weight=True
# Weight download path: weights/*.pt
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=True \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# When starting training without initial weights
# Default: weight=True
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=False \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# Resume learning from where you left off
# Please note that you must specify the Lightning checkpoint file (.ckpt)
# and not the .pt file that contains only the EMA weights.
# Unlike the official implementation, all parameters are restored from the .ckpt file,
# so training resumes exactly where it left off.
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
task.resume_ckpt="runs/train/v9-n/lightning_logs/version_3/checkpoints/epoch_5_step_3660.ckpt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# To run a shorter fine-tuning schedule, use the dedicated configuration
# at `yolo/config/task/trainft.yaml`
# All CLI overrides available for `task=train` (e.g., `task.data.batch_size`,
# `task.resume_ckpt`) also apply to `task=trainft`.
VARIANT=n
EPOCH=60
BATCHSIZE=8
uv run python yolo/lazy.py \
task=trainft \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# # DDP (Distributed data parallel training), Multi-GPU training
# # Below is a sample for 8 GPUs
# # n, t, s, c, e
# VARIANT=n
# EPOCH=100
# # Number of GPUs running on one node
# NPROC=8
# # When NPROC=8, the string [0,1,2,3,4,5,6,7] is set to DEVICES.
# DEVICES="[$(seq -s, 0 $((NPROC-1)))]"
# # When there are 8 GPUs and 8 batches are assigned to each GPU
# # {Batch size per GPU} x {Number of GPUs} = {Total batch size}
# # 8 x 8 = 64
# BATCHSIZE=8
# TOTALBATCHSIZE=$((BATCHSIZE * NPROC))
# uv run torchrun \
# --nproc_per_node=${NPROC} \
# yolo/lazy.py \
# task=train \
# device=${DEVICES} \
# name=v9-${VARIANT} \
# task.epoch=${EPOCH} \
# task.data.batch_size=${TOTALBATCHSIZE} \
# task.data.cpu_num=$((TOTALBATCHSIZE / NPROC)) \
# model=v9-${VARIANT} \
# weight=False \
# dataset=wholebody34 \
# use_wandb=False \
# use_tensorboard=True
↓↓↓ Experimental implementation. Not recommended as accuracy is significantly reduced. ↓↓↓
# Online Knowledge Distillation (Teacher E → Student {C,S,T,N})
# Default: task.kd.enable=False
# ./ARCHITECTURE_ENHANCED_YOLOv9.md#8-online-knowledge-distillation-teacher-e--student-cstn
# ./yolo/config/task/train.yaml
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=False \
task.kd.enable=True \
task.kd.teacher_model=v9-e \
task.kd.teacher_weight=weights/v9-e.pt \
task.kd.apply_to=both \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
↑↑↑ Experimental implementation. Not recommended as accuracy is significantly reduced. ↑↑↑Pay particular attention to the maximum number of CPU threads and the amount of RAM on the machine you are trying to train on. I'm talking RAM, not VRAM. The number of worker processes specified during training is batch_size + 1, but you must adjust batch_size so that it is less than the maximum number of CPU threads - 1. Also, the amount of RAM consumed increases in proportion to the number of enabled augmentations, so you need to pay attention to the amount of RAM installed on your PC. Checking only the amount of VRAM is not enough. If you need to run heavy augmentation that exceeds the RAM capacity, we recommend setting batch_size to a relatively small value.
The figure below shows the CPU and RAM status of my work PC. When I run 16 batches with the maximum number of augmentations enabled, 17 threads are started, which not only consumes a lot of RAM, causing the learning process to silently abort after a few epochs without outputting any errors.
Countermeasure for situations where resume is unstable and CUDA initialization error occurs https://discuss.pytorch.org/t/dataloader-num-workers-1-cuda-initialization-error-3/159989
If the following mp.set_start_method is specified, there are some environments where the process will silently terminate before learning begins. Therefore, if you are in an environment where learning does not start normally, it may be a good idea to comment out the following line: mp.set_start_method.
yolo/lazy.pyif __name__ == "__main__": # Countermeasure for situations where resume is unstable and CUDA initialization error occurs # https://discuss.pytorch.org/t/dataloader-num-workers-1-cuda-initialization-error-3/159989 # If the following `mp.set_start_method` is specified, there are some environments where # the process will silently terminate before learning begins. # Therefore, if you are in an environment where learning does not start normally, # it may be a good idea to comment out the following line: `mp.set_start_method`. # mp.set_start_method("spawn", force=True) <--- Here main()
To speed up training and significantly reduce VRAM consumption during training, validation is limited to a simple, minimal evaluation per epoch. Therefore, validation results other than the final epoch do not properly evaluate the model's true performance, but they do confirm that training is progressing normally, that accuracy is not deteriorating significantly, and that overfitting is not occurring. The true performance of the model can only be confirmed by the evaluation results of rigorous validation performed at the final epoch. This means that the spot validation results do not perfectly match the true weight improvement as the learning progresses. It would be foolish to perform early stopping based solely on the validation status of each epoch. First of all, you should not use an insufficient dataset that results in overfitting.
The final epoch performs fairly accurate validation, so it may take several minutes or more depending on the volume of your dataset.
-
NMS settings for validation at each learning progress
Intermediate Epoch Final Epoch pre_topk 300 20,000 max_bbox 300 20,000 multi_label False True class_agnostic False False
If you want to display the AP for each class for all epochs, change yolo/config/task/validation.yaml's print_map_per_class: True and start training. If print_map_per_class: False is set, AP per class will be calculated and output only once at the end of the final epoch. Since print_map_per_class takes a very long time to process, we recommend setting it to False and automatically calculating map_per_class only in the final epoch.
┏━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━┓
┃Epoch┃Avg. Precision ┃ %┃Avg. Recall ┃ %┃
┡━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━┩
│ 2│AP @ .5:.95 │000.77╎AR maxDets 1 │003.08│
│ 2│AP @ .5 │002.02╎AR maxDets 10 │006.91│
│ 2│AP @ .75 │000.45╎AR maxDets 100 │008.74│
│ 2│AP (small) │000.33╎AR (small) │001.93│
│ 2│AP (medium) │000.69╎AR (medium) │007.74│
│ 2│AP (large) │001.34╎AR (large) │008.55│
└─────┴────────────────┴──────┴────────────────┴──────┘
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ ID┃Name ┃ AP┃ ID┃Name ┃ AP┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ 0│body │ 0.0343│ 20│ear │ 0.0023│
│ 1│adult │ 0.0320│ 21│collarbone │ 0.0003│
│ 2│child │ 0.0000│ 22│shoulder │ 0.0033│
│ 3│male │ 0.0268│ 23│solar_plexus │ 0.0003│
│ 4│female │ 0.0103│ 24│elbow │ 0.0001│
│ 5│body_with_wheelchair │ 0.0029│ 25│wrist │ 0.0001│
│ 6│body_with_crutches │ 0.0455│ 26│hand │ 0.0029│
│ 7│head │ 0.0340│ 27│hand_left │ 0.0022│
│ 8│front │ 0.0102│ 28│hand_right │ 0.0027│
│ 9│right-front │ 0.0155│ 29│abdomen │ 0.0005│
│ 10│right-side │ 0.0059│ 30│hip_joint │ 0.0006│
│ 11│right-back │ 0.0023│ 31│knee │ 0.0010│
│ 12│back │ 0.0001│ 32│ankle │ 0.0012│
│ 13│left-back │ 0.0015│ 33│foot │ 0.0063│
│ 14│left-side │ 0.0025│ │ │ │
│ 15│left-front │ 0.0105│ │ │ │
│ 16│face │ 0.0047│ │ │ │
│ 17│eye │ 0.0000│ │ │ │
│ 18│nose │ 0.0000│ │ │ │
│ 19│mouth │ 0.0000│ │ │ │
└───┴─────────────────────────┴───────┴───┴─────────────────────────┴───────┘
The weights after training are output to the following path.
| File | Note |
|---|---|
best_{variant}_{epoch:04}_{map:.4f}.pt |
Optimized weight file containing only EMA weights. The weights with the highest mAP are automatically saved. |
epoch_{epoch}_step_{step}.ckpt |
A checkpoint file containing all learning logs automatically saved by Lightning. |
last.pt |
Optimized weight file containing only EMA weights. The weights of the last epoch are automatically saved. |
e.g.
runs/train/v9-n/lightning_logs/version_0/checkpoints
├── best_n_0002_0.0065.pt
├── epoch_2_step_3462.ckpt
└── last.pt
To use a model for object detection, use:
# n, t, s, c, e
VARIANT=n
RENDER_LABELS=False
# If you do not specify `dataset={dataset_name}` correctly,
# the classification head weights will not be loaded properly
# and you will not see any inference results.
# The number of classes in the head part of the weights used for inference
# must match `class_num`.
# https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/dataset/wholebody34.yaml
---
path: data/wholebody34
train: train
validation: val
class_num: 34 # <--- Here
class_list: ['body', ..., 'foot']
---
uv run python yolo/lazy.py \
task=inference \
name=v9-${VARIANT} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
task.nms.min_confidence=0.1 \
task.fast_inference=onnx \
task.data.source=data/wholebody34/images/val \
task.data.max_samples=100 \
task.render_labels=${RENDER_LABELS} \
+quite=True
To validate model performance, or generate a json file in COCO format:
# n, t, s, c, e
VARIANT=n
# Specify the same `batch_size` as the validation batch size used during training.
# Otherwise, the mAP value after validation will be significantly degraded.
# data:
# batch_size: 32
# https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/validation.yaml
BATCHSIZE=32
# The higher the model's performance, the more accurate the evaluation will be
# if the MAXDET value (the upper limit of the number of detections) is set to
# a larger value. The default value is 1,000. yolo/config/task/validation.yaml
# However, setting a value that exceeds the maximum number of labels contained
# in one image will have no effect. For example, in my dataset, an image contains
# a maximum of 3,875 labels, so setting it to 4,000 is appropriate.
MAXDET=20000
uv run python yolo/lazy.py \
task=validation \
name=v9-${VARIANT} \
task.data.batch_size=${BATCHSIZE} \
task.nms.pre_topk=${MAXDET} \
task.nms.max_bbox=${MAXDET} \
task.nms.multi_label=True \
task.nms.class_agnostic=False \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=FalseUse the Hydra-driven CLI to run the export task and produce a compact ONNX graph. The exporter emits a
single [batches, 4 + num_classes, boxes] tensor, keeps detection heads minimal, and derives an informative
filename (e.g. best_e_0060_0.6585_1x3x480x640.onnx). Example:
uv run python yolo/lazy.py \
task=export \
name=v9-demo \
model=v9-e \
dataset=wholebody34 \
weight="runs/trainft/v9-e/lightning_logs/version_ft0/checkpoints/best_e_0060_0.6585.pt" \
task.dynamic_batch=False \
task.dynamic_size=False \
task.image_size=480x640 \
task.batch_size=1 \
task.opset=13 \
task.half=false \
task.apply_sigmoid=True \
task.include_metadata=TrueKey overrides (all optional):
task.batch_size: dummy input batch size (default 1).task.dynamic_batch:truemarks batch as symbolicNand names the file accordingly.task.dynamic_size:truemarks Height and Width as symbolicH,Wand names the file accordingly.task.image_size: input resolution. Accepts'HxW'.task.batch_size: input batch size.task.opset: ONNX opset version (default 13).task.simplify: runonnxsimfor graph simplification.task.half: export weights/activations in FP16.task.apply_sigmoid: emit post-sigmoid class probabilities instead of raw logits.task.include_metadata: embed class names in ONNX metadata.task.output_path: explicit destination; omit to auto-name beside the weight file.task.name: experiment/run folder label (standard Hydra behaviour).
If you want to use webgpu, you can use ONNX without NMS or TensorFlow.js models without NMS. If you don't want to go through ONNX, you can output the LiteRT model directly from PyTorch using ai_edge_torch.
- ONNX to TF/LiteRT
# Transformation with `Grouped Convolution` disabled uv run onnx2tf -i yolov9_n_wholebody25_post_0100_1x3x480x640.onnx -dgc - TF to TFJS
uv run tensorflowjs_converter \ --input_format tf_saved_model \ --output_format tfjs_graph_model \ saved_model \ tfjs_model
# Install CUDA==12.9
# https://developer.nvidia.com/cuda-toolkit-archive
# Install TensorRT==10.13.3.9-1+cuda12.9
# https://docs.nvidia.com/deeplearning/tensorrt/latest/installing-tensorrt/installing.htmluv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep cpu
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 3673.502206802368 ms
INFO: avg elapsed time per pred: 367.3502206802368 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32uv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep cuda
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 350.10218620300293 ms
INFO: avg elapsed time per pred: 35.01021862030029 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32# It will take a while to generate the TensorrtExecutionProvider_TRTKernel_*.engine cache.
uv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep tensorrt
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 104.28452491760254 ms
INFO: avg elapsed time per pred: 10.428452491760254 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32# With NMS + TensorRT
# For models with dynamic tensors as input, specify the size of the tensor
# to be tested using the --fixed_shapes / -fs option.
uv run sit4onnx -if yolov9_n_wholebody25_post_0100_1x3xHxW.onnx -oep tensorrt -fs 1 3 480 640
INFO: file: yolov9_n_wholebody25_post_0100_1x3x480x640.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: input_bgr shape: [1, 3, 480, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 20.3857421875 ms
INFO: avg elapsed time per pred: 2.03857421875 ms
INFO: output_name.1: batchno_classid_score_x1y1x2y2 shape: [0, 7] dtype: float32Contributions to the YOLO project are welcome! See CONTRIBUTING for guidelines on how to contribute.
@inproceedings{wang2022yolov7,
title={{YOLOv7}: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors},
author={Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark},
year={2023},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
}
@inproceedings{wang2024yolov9,
title={{YOLOv9}: Learning What You Want to Learn Using Programmable Gradient Information},
author={Wang, Chien-Yao and Yeh, I-Hau and Liao, Hong-Yuan Mark},
year={2024},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
}
@inproceedings{tsui2024yolord,
author={Tsui, Hao-Tang and Wang, Chien-Yao and Liao, Hong-Yuan Mark},
title={{YOLO-RD}: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary},
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2025},
}
