Open-vocabulary object detection on iOS using YOLO-World + CLIP.
Type any text — "person", "red car", "coffee cup" — and detect it in real-time camera, photos, or videos. No fixed class list.
Text Input ──→ CLIP Text Encoder ──→ txt_feats [1,80,512]
│
Camera/Image ──→ YOLO-World Detector ─────┤──→ boxes [1,4,8400]
└──→ scores [1,80,8400] (sigmoid-calibrated)
│
NMS + Filter ──→ Bounding Boxes
The CoreML detector includes the full BNContrastiveHead scoring pipeline internally. Scores are pre-computed — no external parameter files needed.
| Model | Size | Description |
|---|---|---|
yoloworld_detector.mlpackage |
25 MB | YOLO-World V2-S visual detector |
clip_text_encoder.mlpackage |
121 MB | CLIP ViT-B/32 text encoder |
clip_vocab.json |
1.6 MB | BPE vocabulary for tokenizer |
- Camera: Real-time open-vocabulary detection
- Photo: Pick from library, detect with any text query
- Video: Pick a video, detect frame-by-frame with overlay
- Open-vocabulary: Up to 80 simultaneous queries, any text
- iOS 16.0+
- Xcode 15.0+
- Physical device (camera + Neural Engine)
- Open
YOLOWorldDemo.xcodeprojin Xcode - Select your development team
- Build and run on a physical device
Models are pre-bundled. No additional setup required.
To convert with a different model size (m/l/x):
pip install ultralytics open_clip_torch coremltools torch==2.7.0
python convert_models.py --size lThen replace the .mlpackage files in the Xcode project.
- Enter comma-separated object names in the text field (e.g.,
person, dog, car) - Tap the search button or press return
- Switch between Camera / Photo / Video modes with the bottom buttons
- In Photo/Video mode, tap the green (+) button to pick from library