This fork is an MIT version of YOLO, with some bug fixes and the addition of the V9-N (nano) and V9-E (extended) variants to the original. This repository is already capable of achieving convergence speed and accuracy comparable to stable GPLv3. https://github.com/MultimediaTechLab/YOLO.
Welcome to the official implementation of YOLOv71 and YOLOv92, YOLO-RD3. This repository will contains the complete codebase, pre-trained models, and detailed instructions for training and deploying YOLOv9.
- This is the official YOLO model implementation with an MIT License.
- YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
- YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
- YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
To get started using YOLOv9's developer mode, we recommand you clone this repository and install the required dependencies:
git clone https://github.com/PINTO0309/YOLO.git
cd YOLO
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate
export PYTHONWARNINGS="ignore"
For more customization details, please refer to HOWTO.
data
βββ wholebody34
βββ train.pache # Cache file automatically generated when training starts
βββ val.pache # Cache file automatically generated when training starts
βββ images
β βββ train
β β βββ 000000000036.jpg
β β βββ 000000000077.jpg
β β βββ 000000000110.jpg
β β βββ 000000000113.jpg
β β βββ 000000000165.jpg
β βββ val
β βββ 000000000241.jpg
β βββ 000000000294.jpg
β βββ 000000000308.jpg
β βββ 000000000322.jpg
β βββ 000000000328.jpg
βββ labels
βββ train
β βββ 000000000036.txt
β βββ 000000000077.txt
β βββ 000000000110.txt
β βββ 000000000113.txt
β βββ 000000000165.txt
βββ val
βββ 000000000241.txt
βββ 000000000294.txt
βββ 000000000308.txt
βββ 000000000322.txt
βββ 000000000328.txt
-
000000000036.txt
Item Note classId classId cx, cy 0.0-1.0 normalized center coordinates w, h 0.0-1.0 normalized width and height classId cx cy w h
30 0.729688 0.959667 0.141042 0.080667 25 0.919385 0.974417 0.052521 0.051167 25 0.525000 0.680847 0.049167 0.071806 23 0.663813 0.657361 0.100125 0.105889 21 0.612667 0.519583 0.068542 0.068056 29 0.628292 0.896000 0.292500 0.082889 30 0.546063 0.957611 0.210792 0.084778 19 0.547917 0.417986 0.073125 0.037361 26 0.488281 0.653583 0.123104 0.151444 24 0.840208 0.778889 0.080417 0.092222 24 0.435312 0.790972 0.074375 0.089167 22 0.411469 0.557500 0.103313 0.112222 22 0.773646 0.546944 0.087708 0.110556 9 0.560417 0.366667 0.233333 0.266667 7 0.560417 0.366667 0.233333 0.266667 27 0.956385 0.970417 0.087229 0.055833 16 0.541667 0.370833 0.154167 0.197222 26 0.956385 0.970417 0.087229 0.055833 4 0.681458 0.621667 0.637083 0.756667 0 0.681458 0.621667 0.637083 0.756667 18 0.527188 0.373333 0.042917 0.047500 20 0.644792 0.370028 0.023125 0.036667 1 0.681458 0.621667 0.637083 0.756667 28 0.488281 0.653583 0.123104 0.151444 17 0.489687 0.370972 0.032917 0.020556 17 0.561875 0.350694 0.044583 0.019722
yolo/config/dataset/wholebody34.yaml
path: data/wholebody34
train: train
validation: val
class_num: 34
class_list: ['body', 'adult', 'child', 'male', 'female', 'body_with_wheelchair', 'body_with_crutches', 'head', 'front', 'right-front', 'right-side', 'right-back', 'back', 'left-back', 'left-side', 'left-front', 'face', 'eye', 'nose', 'mouth', 'ear', 'collarbone', 'shoulder', 'solar_plexus', 'elbow', 'wrist', 'hand', 'hand_left', 'hand_right', 'abdomen', 'hip_joint', 'knee', 'ankle', 'foot']
auto_download:
To train YOLO on your machine/dataset:
- Modify the configuration file
yolo/config/dataset/**.yaml
to point to your dataset. - Run the training script:
uv run python yolo/lazy.py task=train dataset=** use_wandb=True
uv run python yolo/lazy.py task=train task.data.batch_size=8 model=v9-c weight=False # or more args
To perform transfer learning with YOLOv9:
configs
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/config.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/general.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/train.yaml
- https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/validation.yaml
- https://github.com/PINTO0309/YOLO/tree/wholebody/yolo/config/model
# n, t, s, c, e
VARIANT=n
EPOCH=100
BATCHSIZE=8
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# When specifying trained weights as initial weights
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# Automatically downloading the initial weights published by the official repository
# Default: weight=True
# Weight download path: weights/*.pt
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=True \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# When starting training without initial weights
# Default: weight=True
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=False \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# Resume learning from where you left off
# Please note that you must specify the Lightning checkpoint file (.ckpt)
# and not the .pt file that contains only the EMA weights.
# Unlike the official implementation, all parameters are restored from the .ckpt file,
# so training resumes exactly where it left off.
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
task.resume_ckpt="runs/train/v9-n/lightning_logs/version_3/checkpoints/epoch_5_step_3660.ckpt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# To run a shorter fine-tuning schedule, use the dedicated configuration
# at `yolo/config/task/trainft.yaml`
# All CLI overrides available for `task=train` (e.g., `task.data.batch_size`,
# `task.resume_ckpt`) also apply to `task=trainft`.
VARIANT=n
EPOCH=60
BATCHSIZE=8
uv run python yolo/lazy.py \
task=trainft \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
# # DDP (Distributed data parallel training), Multi-GPU training
# # Below is a sample for 8 GPUs
# # n, t, s, c, e
# VARIANT=n
# EPOCH=100
# # Number of GPUs running on one node
# NPROC=8
# # When NPROC=8, the string [0,1,2,3,4,5,6,7] is set to DEVICES.
# DEVICES="[$(seq -s, 0 $((NPROC-1)))]"
# # When there are 8 GPUs and 8 batches are assigned to each GPU
# # {Batch size per GPU} x {Number of GPUs} = {Total batch size}
# # 8 x 8 = 64
# BATCHSIZE=8
# TOTALBATCHSIZE=$((BATCHSIZE * NPROC))
# uv run torchrun \
# --nproc_per_node=${NPROC} \
# yolo/lazy.py \
# task=train \
# device=${DEVICES} \
# name=v9-${VARIANT} \
# task.epoch=${EPOCH} \
# task.data.batch_size=${TOTALBATCHSIZE} \
# task.data.cpu_num=$((TOTALBATCHSIZE / NPROC)) \
# model=v9-${VARIANT} \
# weight=False \
# dataset=wholebody34 \
# use_wandb=False \
# use_tensorboard=True
βββ Experimental implementation. Not recommended as accuracy is significantly reduced. βββ
# Online Knowledge Distillation (Teacher E β Student {C,S,T,N})
# Default: task.kd.enable=False
# ./ARCHITECTURE_ENHANCED_YOLOv9.md#8-online-knowledge-distillation-teacher-e--student-cstn
# ./yolo/config/task/train.yaml
uv run python yolo/lazy.py \
task=train \
name=v9-${VARIANT} \
task.epoch=${EPOCH} \
task.data.batch_size=${BATCHSIZE} \
model=v9-${VARIANT} \
weight=False \
task.kd.enable=True \
task.kd.teacher_model=v9-e \
task.kd.teacher_weight=weights/v9-e.pt \
task.kd.apply_to=both \
dataset=wholebody34 \
device=cuda \
use_wandb=False \
use_tensorboard=True
βββ Experimental implementation. Not recommended as accuracy is significantly reduced. βββ
Pay particular attention to the maximum number of CPU threads and the amount of RAM on the machine you are trying to train on. I'm talking RAM, not VRAM. The number of worker processes specified during training is batch_size + 1
, but you must adjust batch_size
so that it is less than the maximum number of CPU threads - 1
. Also, the amount of RAM consumed increases in proportion to the number of enabled augmentations, so you need to pay attention to the amount of RAM installed on your PC. Checking only the amount of VRAM is not enough. If you need to run heavy augmentation that exceeds the RAM capacity, we recommend setting batch_size
to a relatively small value.
The figure below shows the CPU and RAM status of my work PC. When I run 16 batches with the maximum number of augmentations enabled, 17 threads are started, which not only consumes a lot of RAM, causing the learning process to silently abort after a few epochs without outputting any errors.

Countermeasure for situations where resume is unstable and CUDA initialization error occurs https://discuss.pytorch.org/t/dataloader-num-workers-1-cuda-initialization-error-3/159989
If the following mp.set_start_method
is specified, there are some environments where the process will silently terminate before learning begins. Therefore, if you are in an environment where learning does not start normally, it may be a good idea to comment out the following line: mp.set_start_method
.
yolo/lazy.py
if __name__ == "__main__": # Countermeasure for situations where resume is unstable and CUDA initialization error occurs # https://discuss.pytorch.org/t/dataloader-num-workers-1-cuda-initialization-error-3/159989 # If the following `mp.set_start_method` is specified, there are some environments where # the process will silently terminate before learning begins. # Therefore, if you are in an environment where learning does not start normally, # it may be a good idea to comment out the following line: `mp.set_start_method`. # mp.set_start_method("spawn", force=True) <--- Here main()
To speed up training and significantly reduce VRAM consumption during training, validation is limited to a simple, minimal evaluation per epoch. Therefore, validation results other than the final epoch do not properly evaluate the model's true performance, but they do confirm that training is progressing normally, that accuracy is not deteriorating significantly, and that overfitting is not occurring. The true performance of the model can only be confirmed by the evaluation results of rigorous validation performed at the final epoch. This means that the spot validation results do not perfectly match the true weight improvement as the learning progresses. It would be foolish to perform early stopping based solely on the validation status of each epoch. First of all, you should not use an insufficient dataset that results in overfitting.
The final epoch performs fairly accurate validation, so it may take several minutes or more depending on the volume of your dataset.
-
NMS settings for validation at each learning progress
Intermediate Epoch γγγ Final Epoch pre_topk 300 20,000 max_bbox 300 20,000 multi_label False True class_agnostic False False
If you want to display the AP for each class for all epochs, change yolo/config/task/validation.yaml
's print_map_per_class: True
and start training. If print_map_per_class: False
is set, AP per class will be calculated and output only once at the end of the final epoch. Since print_map_per_class
takes a very long time to process, we recommend setting it to False
and automatically calculating map_per_class
only in the final epoch.
βββββββ³βββββββββββββββββ³βββββββ³βββββββββββββββββ³βββββββ
βEpochβAvg. Precision β %βAvg. Recall β %β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 2βAP @ .5:.95 β000.77βAR maxDets 1 β003.08β
β 2βAP @ .5 β002.02βAR maxDets 10 β006.91β
β 2βAP @ .75 β000.45βAR maxDets 100 β008.74β
β 2βAP (small) β000.33βAR (small) β001.93β
β 2βAP (medium) β000.69βAR (medium) β007.74β
β 2βAP (large) β001.34βAR (large) β008.55β
βββββββ΄βββββββββββββββββ΄βββββββ΄βββββββββββββββββ΄βββββββ
βββββ³ββββββββββββββββββββββββββ³ββββββββ³ββββ³ββββββββββββββββββββββββββ³ββββββββ
β IDβName β APβ IDβName β APβ
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β 0βbody β 0.0343β 20βear β 0.0023β
β 1βadult β 0.0320β 21βcollarbone β 0.0003β
β 2βchild β 0.0000β 22βshoulder β 0.0033β
β 3βmale β 0.0268β 23βsolar_plexus β 0.0003β
β 4βfemale β 0.0103β 24βelbow β 0.0001β
β 5βbody_with_wheelchair β 0.0029β 25βwrist β 0.0001β
β 6βbody_with_crutches β 0.0455β 26βhand β 0.0029β
β 7βhead β 0.0340β 27βhand_left β 0.0022β
β 8βfront β 0.0102β 28βhand_right β 0.0027β
β 9βright-front β 0.0155β 29βabdomen β 0.0005β
β 10βright-side β 0.0059β 30βhip_joint β 0.0006β
β 11βright-back β 0.0023β 31βknee β 0.0010β
β 12βback β 0.0001β 32βankle β 0.0012β
β 13βleft-back β 0.0015β 33βfoot β 0.0063β
β 14βleft-side β 0.0025β β β β
β 15βleft-front β 0.0105β β β β
β 16βface β 0.0047β β β β
β 17βeye β 0.0000β β β β
β 18βnose β 0.0000β β β β
β 19βmouth β 0.0000β β β β
βββββ΄ββββββββββββββββββββββββββ΄ββββββββ΄ββββ΄ββββββββββββββββββββββββββ΄ββββββββ
The weights after training are output to the following path.
File | Note |
---|---|
best_{variant}_{epoch:04}_{map:.4f}.pt |
Optimized weight file containing only EMA weights. The weights with the highest mAP are automatically saved. |
epoch_{epoch}_step_{step}.ckpt |
A checkpoint file containing all learning logs automatically saved by Lightning. |
last.pt |
Optimized weight file containing only EMA weights. The weights of the last epoch are automatically saved. |
e.g.
runs/train/v9-n/lightning_logs/version_0/checkpoints
βββ best_n_0002_0.0065.pt
βββ epoch_2_step_3462.ckpt
βββ last.pt
To use a model for object detection, use:
# n, t, s, c, e
VARIANT=n
RENDER_LABELS=False
# If you do not specify `dataset={dataset_name}` correctly,
# the classification head weights will not be loaded properly
# and you will not see any inference results.
# The number of classes in the head part of the weights used for inference
# must match `class_num`.
# https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/dataset/wholebody34.yaml
---
path: data/wholebody34
train: train
validation: val
class_num: 34 # <--- Here
class_list: ['body', ..., 'foot']
---
uv run python yolo/lazy.py \
task=inference \
name=v9-${VARIANT} \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
task.nms.min_confidence=0.1 \
task.fast_inference=onnx \
task.data.source=data/wholebody34/images/val \
task.data.max_samples=100 \
task.render_labels=${RENDER_LABELS} \
+quite=True

To validate model performance, or generate a json file in COCO format:
# n, t, s, c, e
VARIANT=n
# Specify the same `batch_size` as the validation batch size used during training.
# Otherwise, the mAP value after validation will be significantly degraded.
# data:
# batch_size: 32
# https://github.com/PINTO0309/YOLO/blob/wholebody/yolo/config/task/validation.yaml
BATCHSIZE=32
# The higher the model's performance, the more accurate the evaluation will be
# if the MAXDET value (the upper limit of the number of detections) is set to
# a larger value. The default value is 1,000. yolo/config/task/validation.yaml
# However, setting a value that exceeds the maximum number of labels contained
# in one image will have no effect. For example, in my dataset, an image contains
# a maximum of 3,875 labels, so setting it to 4,000 is appropriate.
MAXDET=20000
uv run python yolo/lazy.py \
task=validation \
name=v9-${VARIANT} \
task.data.batch_size=${BATCHSIZE} \
task.nms.pre_topk=${MAXDET} \
task.nms.max_bbox=${MAXDET} \
task.nms.multi_label=True \
task.nms.class_agnostic=False \
model=v9-${VARIANT} \
weight="runs/train/v9-n/lightning_logs/version_1/checkpoints/best_n_0002_0.0065.pt" \
dataset=wholebody34 \
device=cuda \
use_wandb=False
Use the Hydra-driven CLI to run the export
task and produce a compact ONNX graph. The exporter emits a
single [batches, 4 + num_classes, boxes]
tensor, keeps detection heads minimal, and derives an informative
filename (e.g. best_e_0060_0.6585_1x3x480x640.onnx
). Example:
uv run python yolo/lazy.py \
task=export \
name=v9-demo \
model=v9-e \
dataset=wholebody34 \
weight="runs/trainft/v9-e/lightning_logs/version_ft0/checkpoints/best_e_0060_0.6585.pt" \
task.dynamic_batch=False \
task.dynamic_size=False \
task.image_size=480x640 \
task.batch_size=1 \
task.opset=13 \
task.half=false \
task.apply_sigmoid=True \
task.include_metadata=True
Key overrides (all optional):
task.batch_size
: dummy input batch size (default 1).task.dynamic_batch
:true
marks batch as symbolicN
and names the file accordingly.task.dynamic_size
:true
marks Height and Width as symbolicH
,W
and names the file accordingly.task.image_size
: input resolution. Accepts'HxW'
.task.batch_size
: input batch size.task.opset
: ONNX opset version (default 13).task.simplify
: runonnxsim
for graph simplification.task.half
: export weights/activations in FP16.task.apply_sigmoid
: emit post-sigmoid class probabilities instead of raw logits.task.include_metadata
: embed class names in ONNX metadata.task.output_path
: explicit destination; omit to auto-name beside the weight file.task.name
: experiment/run folder label (standard Hydra behaviour).

If you want to use webgpu
, you can use ONNX without NMS or TensorFlow.js models without NMS. If you don't want to go through ONNX, you can output the LiteRT model directly from PyTorch using ai_edge_torch.
- ONNX to TF/LiteRT
# Transformation with `Grouped Convolution` disabled uv run onnx2tf -i yolov9_n_wholebody25_post_0100_1x3x480x640.onnx -dgc
- TF to TFJS
uv run tensorflowjs_converter \ --input_format tf_saved_model \ --output_format tfjs_graph_model \ saved_model \ tfjs_model
# Install CUDA==12.9
# https://developer.nvidia.com/cuda-toolkit-archive
# Install TensorRT==10.13.3.9-1+cuda12.9
# https://docs.nvidia.com/deeplearning/tensorrt/latest/installing-tensorrt/installing.html
uv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep cpu
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 3673.502206802368 ms
INFO: avg elapsed time per pred: 367.3502206802368 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32
uv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep cuda
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 350.10218620300293 ms
INFO: avg elapsed time per pred: 35.01021862030029 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32
# It will take a while to generate the TensorrtExecutionProvider_TRTKernel_*.engine cache.
uv run sit4onnx -if best_e_0205_0.4140_1x3x640x640.onnx -oep tensorrt
INFO: file: best_e_0205_0.4140_1x3x640x640.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: images shape: [1, 3, 640, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 104.28452491760254 ms
INFO: avg elapsed time per pred: 10.428452491760254 ms
INFO: output_name.1: output shape: [1, 38, 8400] dtype: float32
# With NMS + TensorRT
# For models with dynamic tensors as input, specify the size of the tensor
# to be tested using the --fixed_shapes / -fs option.
uv run sit4onnx -if yolov9_n_wholebody25_post_0100_1x3xHxW.onnx -oep tensorrt -fs 1 3 480 640
INFO: file: yolov9_n_wholebody25_post_0100_1x3x480x640.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: input_bgr shape: [1, 3, 480, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 20.3857421875 ms
INFO: avg elapsed time per pred: 2.03857421875 ms
INFO: output_name.1: batchno_classid_score_x1y1x2y2 shape: [0, 7] dtype: float32
Contributions to the YOLO project are welcome! See CONTRIBUTING for guidelines on how to contribute.
@inproceedings{wang2022yolov7,
title={{YOLOv7}: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors},
author={Wang, Chien-Yao and Bochkovskiy, Alexey and Liao, Hong-Yuan Mark},
year={2023},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
}
@inproceedings{wang2024yolov9,
title={{YOLOv9}: Learning What You Want to Learn Using Programmable Gradient Information},
author={Wang, Chien-Yao and Yeh, I-Hau and Liao, Hong-Yuan Mark},
year={2024},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
}
@inproceedings{tsui2024yolord,
author={Tsui, Hao-Tang and Wang, Chien-Yao and Liao, Hong-Yuan Mark},
title={{YOLO-RD}: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary},
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2025},
}