Accurate, auditable image captions for LoRA dataset preparation in ComfyUI.
CaptionForge is built around a simple idea: one captioner can be useful, but one captioner is also easy to fool. Instead of asking a single model to describe an image and hoping it gets everything right, CaptionForge can ask multiple independent captioning engines to produce separate “witness accounts” of the same image. Those accounts are then merged by a text-LLM distillation pass that looks for agreement, preserves useful details, and separates likely contradictions or unsupported claims. The resulting draft is checked against the image by a final vision-language model, which acts as the image-aware judge before the final captions are exported.
The goal is not magic, and it is not perfection. CaptionForge is meant for automated captioning of large image archives and LoRA training sets where hand-captioning would be too slow, but where the usual hallucinations, omissions, and inconsistencies from a single captioning model are still a problem. The pipeline is intentionally heavier than a normal caption node, so it is best used when caption quality, auditability, and consistency matter enough to justify the extra computation.
The current v0.1.x workflow is tuned primarily for character, fashion, portrait, doll/render, cosplay, pageant, glamour, and style-LoRA datasets, where visible details such as face, hair, eyes, expression, pose, body shape, clothing construction, accessories, colors, materials, lighting, background, framing, and visual style matter.
A full workflow sample is included as a PNG with embedded ComfyUI workflow metadata:
assets/workflows/CaptionForge_FullWorkflow.png
A separate JSON export of the same workflow is also included:
assets/workflows/CaptionForge_FullWorkflow.json
In ComfyUI, load the workflow by dragging either CaptionForge_FullWorkflow.png or CaptionForge_FullWorkflow.json onto the canvas.
Clone CaptionForge into your ComfyUI custom nodes folder:
git clone https://github.com/Damkohler/CaptionForge.git ComfyUI/custom_nodes/CaptionForgeOr copy the repository manually so the folder layout is:
ComfyUI/custom_nodes/CaptionForge/
Then restart ComfyUI.
If your ComfyUI environment does not already include the needed Python packages, install CaptionForge dependencies from inside your ComfyUI Python environment. The exact command depends on how your ComfyUI install is managed, but typical options are:
cd ComfyUI/custom_nodes/CaptionForge
pip install -e .or, if you maintain dependencies manually:
pip install torch transformers accelerate huggingface-hub pillow numpy safetensors qwen-vl-utilsOptional 8-bit loading may require:
pip install bitsandbytesOllama-backed stages require a working local Ollama installation and installed Ollama model tags.
Example:
ollama pull mistral-small:24b
ollama pull gemma4:26bCaptionForge does not ship model weights. Joy, Qwen, and Ollama model downloads remain user-controlled.
CaptionForge's main pipeline is:
Pass A — raw witness captions
Joy Caption xN
Qwen Caption xN
optional Ollama VLM Caption xN
Pass B — text-LLM distillation
combine witness captions
preserve repeated and useful details
separate contradictions and weak claims
build a rich draft caption
Pass C — image-aware VLM validation
inspect the actual image
keep image-supported details
remove unsupported hallucinations
correct visible errors
produce the authoritative long caption
Pass D — deterministic export formatting
write the validated long caption
derive a shorter LoRA-length caption
derive a compact taggy caption
write TXT and JSONL audit records
The important distinction is that the expensive semantic work should mostly end at the VLM-validated long caption. The short and taggy outputs are intentionally lighter recipe-style formatting steps derived from that validated caption, not new attempts to reinterpret the image.
CaptionForge v0.1.0 is a working experimental preview for ComfyUI users and node developers who want to test a multi-pass captioning pipeline.
It is not presented as a universal replacement for a strong standalone captioner. If JoyCaption, Qwen, Florence, BLIP, WD14, or another captioning tool already gives you exactly what your dataset needs, you may not need CaptionForge. This project is aimed at cases where a single captioner is not accurate, complete, consistent, or auditable enough.
Expected v0.1.x realities:
- the workflow is computationally heavy
- large models may be slow
- model choices matter a lot
- output schemas may still evolve
- prompts and defaults may continue to be refined
- not every dataset will benefit equally
- comparison feedback is welcome
This is a heavy tool. Use it when the extra caption quality and audit trail of large automated jobs are worth the runtime cost.
You may want CaptionForge when:
- one captioner notices the face but misses clothing details
- another captioner notices clothing but misreads the pose
- a third captioner catches style or material details the others miss
- you want an LLM to consolidate agreement instead of merely accepting one model's wording
- you want a final VLM to check the draft against the actual image
- you want intermediate JSONL records for debugging and audit
- you want final captions written as sidecars beside the source images
- you need both long natural captions and compact LoRA-style derivatives
The project question is practical:
Can independent caption witnesses plus text distillation plus image-aware validation produce better dataset captions than a single captioning model alone?
For some datasets, the answer may be yes. For others, a simpler captioner may be enough. CaptionForge is designed to make that comparison visible.
CaptionForge currently favors captions that are:
- rich enough for LoRA training
- visually grounded
- less hallucinated than unvalidated text-only synthesis
- explicit about visible, trainable details
- auditable through JSONL records
- locally runnable
- model-agnostic enough to improve as better captioners, distillers, and validators become available
Useful caption details often include:
- subject type and visible style
- face shape and facial traits
- hair color and hairstyle
- eye color and makeup as separate details
- expression and pose
- hands and body position
- body shape and visible proportions when relevant
- clothing construction, layers, fit, and materials
- accessories, jewelry, nails, props, and distinctive details
- colors, textures, lighting, background, framing, and crop
Visible glamour, swimwear, lingerie, revealing clothing, cleavage, side openings, exposed midriff, or similar styling may be described neutrally when it is actually visible and relevant to the dataset. CaptionForge prompts should not invent hidden anatomy, unseen clothing, explicit acts, or contradicted details.
Node categories are being normalized under:
Captioning/CaptionForge
with active caption nodes under:
Captioning/CaptionForge/Caption Nodes
The central planning node for normal runs.
It coordinates:
- input image path or direct image passthrough
- recursive folder traversal
- filename glob filtering
- output directory
- run name
- overwrite behavior
- Pass A witness run counts
- seed schedules
- sampling schedules
- max image size
- max token budget
- LoRA trigger word
- user caption anchor
- distiller settings
- validator settings
- final export settings
- derived JSONL/TXT/config paths
The main capstone/orchestration node.
It consumes Pass A raw caption records, runs the distillation and validation stages, and exports final captions. The VLM-validated natural paragraph is the authoritative long caption. Formatting stages should not blindly rewrite that natural caption.
Python/Hugging Face JoyCaption/LLaVA-family Pass A witness.
Joy is treated as a first-class CaptionForge caption source and is often one of the strongest raw caption witnesses.
Python/Hugging Face Qwen-family Pass A witness.
Qwen is useful as a second independent captioning voice, especially when its behavior complements Joy. Optional 8-bit loading may be available where supported.
Ollama-backed VLM Pass A witness.
This node delegates image-caption generation to a local Ollama server rather than loading Hugging Face/PyTorch weights inside ComfyUI. It can use configured Ollama VLM tags such as:
gemma4:26b
qwen3.6:35B-A3B
huihui_ai/gemma-4-abliterated:26b
Its purpose is to provide access to other raw-caption witness alternatives. It's function is parallel to the Joy Caption and Qwen Caption nodes, and should not be confused with the later VLM validator/capstone role.
Shared prompt-option sidecar for caption nodes.
Template Options let one sidecar node feed consistent LoRA-relevant prompt modifiers into Joy, Qwen, Ollama, and later caption witnesses without duplicating the same option widgets on every caption node.
CaptionForge uses two model ecosystems:
- Python / Hugging Face model folders for Joy and Qwen witness engines.
- Ollama models for text-LLM distillation, image-aware VLM validation, optional formatting, and Ollama-backed caption witnesses.
Joy and Qwen use Python/Hugging Face engines that integrate with the CaptionForge process-local model cache. Those engines manage Python model residency, reuse, and eviction before loading heavyweight caption models.
Ollama-facing stages are different. Ollama models live in the Ollama daemon, not inside the CaptionForge Python model cache. Before handing work to Ollama, the Ollama Caption node and the CaptionForge capstone clear any resident CaptionForge Python/HF caption models if needed. After that handoff, Ollama owns Ollama model residency.
In short:
Joy/Qwen engines:
manage Python-hosted caption models through captionforge_model_cache
Ollama Caption and CaptionForge capstone:
clear Python-hosted models before calling the Ollama daemon
Ollama daemon:
owns Ollama model loading and residency
Large model weights are intentionally not stored in this repository.
Python-based witness models are expected under ComfyUI model folders, for example:
ComfyUI/models/LLM/JLC_JoyCaption/
ComfyUI/models/LLM/JLC_QwenCaption/
Ollama models must be installed and runnable through Ollama outside this repository.
CaptionForge does not require every supported backend to be installed for every workflow. Users can test smaller subsets first.
The file:
config/captionforge_ollama_models.json
defines user-editable Ollama model tags for dropdowns used by distiller, validator, formatter, and Ollama caption-witness nodes.
Example:
{
"distiller_models": [
"mistral-small:24b",
"VladimirGav/gemma4-26b-16GB-VRAM-Uncensored",
"deepseek-r1:32b",
"tarruda/neuraldaredevil-8b-abliterated:fp16",
"gpt-oss:20b"
],
"validator_models": [
"gemma4:26b",
"qwen3.6:35B-A3B",
"huihui_ai/gemma-4-abliterated:26b"
],
"format_models": [
"mistral-small:24b",
"VladimirGav/gemma4-26b-16GB-VRAM-Uncensored",
"gpt-oss:20b",
"deepseek-r1:32b"
],
"caption_models": [
"gemma4:26b",
"qwen3.6:35B-A3B",
"huihui_ai/gemma-4-abliterated:26b"
],
"defaults": {
"distiller_model": "mistral-small:24b",
"validator_model": "gemma4:26b",
"format_model": "mistral-small:24b",
"caption_model": "gemma4:26b"
},
"include_custom": true
}Terminology:
distiller_model text-only LLM for Pass B distillation
validator_model image-aware VLM for Pass C validation
format_model text-only LLM for formatting/taggy conversion when used
caption_model Ollama-backed Pass A image-caption witness model
Values should be concrete Ollama model tags used exactly as written.
CaptionForge writes auditable run artifacts and final sidecars during planned runs.
A typical planned run uses this structure:
<output_root>/
opt_images/
comfy_image_0000.png
comfy_image_0000_long.txt
comfy_image_0000_short.txt
comfy_image_0000_taggy.txt
comfy_image_0001.png
comfy_image_0001_long.txt
comfy_image_0001_short.txt
comfy_image_0001_taggy.txt
<run_name>__working/
<run_name>__A_RAW_CAPTIONS.jsonl
<run_name>__B_DISTILL.jsonl
<run_name>__B_DISTILL_readable.jsonl
<run_name>__B_DISTILL_readable.json
<run_name>__B_DISTILL_prompts.jsonl
<run_name>__C_VLM_VALIDATED.jsonl
<run_name>__C_VLM_VALIDATED_readable/
<run_name>__C_VLM_VALIDATOR_prompts.jsonl
<run_name>__D_FINAL_EXPORT.jsonl
<run_name>__output_paths.json
<run_name>__run_config.json
Folder-input images keep their source locations, and final TXT sidecars are written beside those original images.
Optional direct IMAGE inputs are copied into a visible output-root folder:
<output_root>/opt_images/
Final caption sidecars are written beside the resolved source image. For folder-input images, that means beside the original image. For optional direct images, that means beside the saved optional image inside opt_images/.
Final sidecars currently include:
<image_stem>_long.txt
<image_stem>_short.txt
<image_stem>_taggy.txt
Meaning:
_long.txt the authoritative VLM-validated natural caption
_short.txt a shorter LoRA-length caption derived from the long caption
_taggy.txt a compact comma-separated taggy caption derived from the long caption
Long captions are intentional in v0.1.x. The current release-candidate strategy favors preserving visible, trainable detail in the validated long caption, then deriving shorter and taggy outputs from that result.
Exact JSONL schemas may evolve during the preview phase.
Python dependencies are declared in pyproject.toml where applicable.
Typical local use may involve:
torch
transformers
accelerate
huggingface-hub
pillow
numpy
safetensors
qwen-vl-utils
Optional quantization support may involve:
bitsandbytes
Ollama-backed stages require a working local Ollama installation and installed Ollama model tags.
CaptionForge is designed for local workflows, but strong results may require large local models.
Practical performance depends on:
- GPU VRAM
- system RAM
- model size
- quantization mode
- Ollama version
- context length
- image size
- number of Pass A witness runs
- whether models are kept loaded or unloaded between runs
The author's active development environment includes an RTX 4090 Laptop GPU with 16 GB VRAM. Larger models may be slow, may require careful quantization, or may need more capable hardware.
Some experimental or unsupported code may exist in the repository for future A/B testing or research.
Experimental branches should be:
- clearly labeled
- kept out of the normal ComfyUI registration path
- not imported by
__init__.py - not shown as mainline nodes unless deliberately enabled
- treated as unsupported starting points rather than stable user features
The active public workflow should be the main Planner → Pass A witnesses → Distiller → VLM Validator → Export path.
CaptionForge currently prioritizes:
- local execution
- auditable intermediate records
- JSONL sidecars
- reusable engines separated from ComfyUI node wrappers
- planner-driven workflows
- model cache and VRAM hygiene
- strong defaults for LoRA captioning
- explicit prompt roles
- model-agnostic backends
- visible, trainable detail over generic caption prose
- practical feedback from real datasets
Useful feedback includes:
- comparisons against standalone JoyCaption, Qwen, or other captioners
- examples where CaptionForge improves caption quality
- examples where CaptionForge makes captions worse
- hallucination reports
- missed-detail reports
- model recommendations
- prompt improvements
- broken node reports
- workflow usability feedback
- VRAM/performance observations
- JSONL/audit trail suggestions
Please include enough context to reproduce the issue or evaluate the result: selected nodes, model tags, relevant settings, whether the run used direct IMAGE input or a folder path, and a small sample of generated captions when possible.
Concept and implementation by J. L. Córdova, with development assistance from ChatGPT (OpenAI).
CaptionForge's Joy/template-option workflow is locally adapted and was inspired in part by the practical template interface pattern used by the public JoyCaption Beta One Hugging Face Space:
https://huggingface.co/spaces/fffiloni/JoyCaption-Beta-One
Copyright (c) 2026 J. L. Córdova
Released under the MIT License. See LICENSE for details.
