Caption system overhaul V2#4613
Merged
vladmandic merged 27 commits intovladmandic:devfrom Feb 11, 2026
Merged
Conversation
vladmandic
reviewed
Feb 10, 2026
vladmandic
reviewed
Feb 10, 2026
vladmandic
reviewed
Feb 10, 2026
vladmandic
reviewed
Feb 10, 2026
Add comprehensive tooltips to Caption tab UI elements in locale_en.json: - Add new "llm" section for shared LLM/VLM parameters: System prompt, Prefill, Top-K, Top-P, Temperature, Num Beams, Use Samplers, Thinking Mode, Keep Thinking Trace, Keep Prefill - Add new "caption" section for caption-specific settings: VLM, OpenCLiP, Tagger tab labels and all their parameters including thresholds, tag formatting, batch options - Consolidate accordion labels in ui_caption.py: "Caption: Advanced Options" and "Caption: Batch" shared across VLM, OpenCLiP, and Tagger tabs (localized to "Advanced Options" and "Batch" in UI) - Remove duplicate entries from missing section
Add comprehensive caption/interrogate API with documentation: - GET /sdapi/v1/interrogate: List available interrogation models - POST /sdapi/v1/interrogate: Interrogate with OpenCLIP/BLIP/DeepDanbooru - POST /sdapi/v1/vqa: Caption with Vision-Language Models (VLM) - GET /sdapi/v1/vqa: List available VLM models - POST /sdapi/v1/vqa/batch: Batch caption multiple images - POST /sdapi/v1/tagger: Tag images with WaifuDiffusion/DeepBooru Updates: - Add detailed docstrings with usage examples - Fix analyze_image response parsing for Gradio update dicts - Add request/response models for all endpoints
- Remove unused paths import from deepbooru.py and openclip.py - Use shared.opts.clip_models_path instead of hardcoded paths
Comprehensive test script for all Caption API endpoints: - GET/POST /sdapi/v1/interrogate (OpenCLiP/DeepBooru) - POST /sdapi/v1/vqa (VLM captioning) - GET /sdapi/v1/vqa/models, /sdapi/v1/vqa/prompts - POST /sdapi/v1/tagger - GET /sdapi/v1/tagger/models Usage: python cli/test-caption-api.py [--url URL] [--image PATH]
Add optional LLM generation parameters to the VQA API request model, allowing per-request override of settings: - max_tokens, temperature, top_k, top_p, num_beams, do_sample - thinking_mode, prefill, keep_thinking, keep_prefill Changes: - Add 10 new optional fields to ReqVQA model with descriptive docs - Update get_kwargs() to support per-request overrides via singleton - Add helper functions get_keep_thinking(), get_keep_prefill() - Update post_vqa endpoint to pass generation kwargs - Add _generation_overrides instance variable to VQA class
Update API model field descriptions to match the hints in locale_en.json for consistency between UI and API documentation. Updated models: - ReqInterrogate: clip_model, blip_model, mode - ReqVQA: model, question, system - ReqTagger: model, threshold, character_threshold, max_tags, include_rating, sort_alpha, use_spaces, escape_brackets, exclude_tags, show_scores
Add prompt field to VQA endpoint and advanced settings to OpenCLIP endpoint to achieve full parity between UI and API capabilities. VLM endpoint changes: - Add prompt field for custom text input (required for 'Use Prompt' task) - Pass prompt to vqa.interrogate instead of hardcoded empty string OpenCLIP endpoint changes: - Add 7 optional per-request override fields: min_length, max_length, chunk_size, min_flavors, max_flavors, flavor_count, num_beams - Add get_clip_setting() helper for override support in openclip.py - Apply overrides via update_interrogate_params() before interrogation All new fields are optional with None defaults for backwards compatibility.
Add model architecture coverage tests: - VQA model family detection for 19 architectures - Florence special prompts test (<OD>, <OCR>, <CAPTION>, etc.) - Moondream detection features test - VQA architecture capabilities test - Tagger model types and WD version comparison tests Improve test validation: - Add is_meaningful_answer() to reject responses like "." - Verify parameters have actual effect (not just accepted) - Show actual output traces in PASS/FAIL messages - Fix prefill tests to verify keep_prefill behavior Add configurable timeout: - Default timeout increased to 300s for slow models - Add --timeout CLI argument for customization Other improvements: - Add JoyCaption to recognized model families - Reduce BLIP models to avoid reloading large models - Better detection result validation for annotated images
- Fix get_keep_thinking() infinite recursion (was calling itself) - Fix get_keep_prefill() infinite recursion (was calling itself) - Fix Florence-2 to use beam search instead of sampling Sampling causes probability tensor errors with Florence-2
DeepBooru/DeepDanbooru should only be accessed via the tagger endpoint. The interrogate endpoint is now exclusively for OpenCLIP/BLIP. - Remove DeepDanbooru handling from post_interrogate - Update docstring to reference tagger endpoint for anime tagging - Simplify code by removing if/else branching
- Update cli/api-interrogate.py to use /sdapi/v1/tagger for DeepBooru - Handle tagger response format (scores dict or tags string) - Remove DeepBooru test from interrogate endpoint tests - Update API model descriptions to reference tagger for anime tagging
Move all caption-related modules from modules/interrogate/ to modules/caption/ for better naming consistency: - Rename deepbooru, deepseek, joycaption, joytag, moondream3, openclip, tagger, vqa, vqa_detection, waifudiffusion modules - Add new caption.py dispatcher module - Remove old interrogate.py (functionality moved to caption.py)
Update all imports from modules.interrogate to modules.caption across: - modules/shared.py, modules/shared_legacy.py - modules/ui_caption.py, modules/ui_common.py - modules/ui_control.py, modules/ui_control_helpers.py - modules/ui_img2img.py, modules/ui_sections.py - modules/ui_symbols.py, modules/ui_video_vlm.py
Update API endpoints and models for caption module rename: - modules/api/api.py - update imports and endpoint handlers - modules/api/endpoints.py - update endpoint definitions - modules/api/models.py - update request/response models
- Rename cli/api-interrogate.py to cli/api-caption.py - Update cli/options.py, cli/process.py for new module paths - Update cli/test-tagger.py for caption module imports
Update cli/test-caption-api.py: - Update test structure for new caption API endpoints - Fix Moondream gaze detection test prompt to use 'Detect Gaze' instead of 'Where is the person looking?' to match handler trigger - Improve test result categorization and tracking
- Update html/locale_en.json with caption-related strings - Update README.md documentation
…prompts, fix gaze detection - Remove caption_openclip_min_length from settings, API models, endpoints, and UI (clip_interrogator library has no min_length support; parameter was never functional) - Split vlm_prompts_florence into base Florence prompts and PromptGen-only prompts (GENERATE_TAGS, Analyze, Mixed Caption require MiaoshouAI PromptGen fine-tune) - Add 'promptgen' category to /vqa/prompts API endpoint - Fix gaze detection: move DETECT_GAZE check before generic 'detect ' prefix to prevent "Detect Gaze" matching as detect target="Gaze" - Update test suite: remove min_length tests, fix min_flavors to use mode='best', add acceptance-only notes, fix thinking trace detection, improve bracket/OCR tests, split Florence/PromptGen test coverage
…tainability Comprehensive review of modules/caption/ addressing memory management, consistency, and code quality: Inference correctness: - Add devices.inference_context() to _qwen(), _smol(), _sa2() handlers - Remove redundant @torch.no_grad() decorator from joycaption predict() - Remove dead dtype=torch.bfloat16 kwarg from Florence loader Memory management: - Bound moondream3 image cache with LRU eviction (max 8 entries) - Replace fragile id(image) cache keys with content-based md5 hash - Add devices.torch_gc() after model loading in deepseek - Move deepbooru model to CPU before dropping reference on unload - Add external handler delegation to VQA.unload() (moondream3, joycaption, joytag, deepseek) - Protect batch offload mutation with try/finally Code deduplication: - Extract strip_think_xml_tags() shared helper for Qwen/Gemma/SmolVLM - Extract save_tags_to_file() into tagger.py from deepbooru and waifudiffusion Documentation and clarity: - Document deepseek global monkey-patches (LlamaFlashAttention2, attrdict) - Document Florence task="task" as intentional design choice - Add vendored-code comment to joytag.py - Document openclip direct .to() usage vs sd_models.move_model - Comment model.eval() calls that are required (trust_remote_code, custom loaders) vs removed where redundant (standard from_pretrained) API robustness: - Add HTTP 422 error response for VQA caption error strings in API endpoints (post_vqa, _dispatch_vlm)
…ad support - Add parse_florence_detections() and format_florence_response() to vqa_detection for handling Florence-2 detection output formats - Add bypass_sdpa_hijacks() context manager to devices.py for models incompatible with SageAttention or other SDPA hijacks - Add OpenCLIP model offload support when caption_offload is enabled
update_caption_params() was setting caption_max_length, chunk_size, and flavor_intermediate_count on the Interrogator instance, but the library reads them from self.config. The overrides were silently ignored.
…t, prefill tests - Add use_safetensors=True to all 16 model from_pretrained calls to avoid downloading redundant .bin files alongside safetensors - Add device property to JoyTag VisionModel so move_model can relocate it to CUDA (fixes 'ViT object has no attribute device') - Fix Pix2Struct dtype mismatch by casting float inputs to model dtype while preserving integer tensor types - Patch AutoConfig.register with exist_ok=True during Ovis loading to handle duplicate aimv2 registration on model reload - Detect Qwen VL fine-tune architecture from config model_type instead of repo name, fixing ToriiGate and similar third-party fine-tunes - Change UI default task from Short Caption to Normal Caption, and preserve it on model switch instead of resetting to Use Prompt - Add dual-prefill testing across 5 VQA test methods using a shared _check_prefill helper - Fix pre-existing ruff W605 in strip_think_xml_tags docstring
Move all caption/interrogate/tagger/VQA API code out of the monolithic endpoints.py and models.py into a new self-contained modules/api/caption.py, following the loras.py / nudenet.py self-registering pattern. - Move 15 Pydantic models (ReqCaption, ResCaption, ReqVQA, ResVQA, ReqTagger, ResTagger, dispatch union types, etc.) from models.py - Move 11 handler functions from endpoints.py - Deduplicate ~150 lines via shared _do_openclip, _do_tagger, _do_vqa core functions called by both direct and dispatch endpoints - Add register_api() that registers all 8 caption routes - Add promptgen field to ResVLMPrompts (bug fix: handler returned it but response model silently dropped it) - Improve all endpoint docstrings and Field descriptions for API docs
- Add _load_blip_model helper with explicit cache_dir so downloads go to hfcache_dir instead of default HF cache - Pre-load BLIP model/processor before creating Interrogator config to control download location and avoid redundant loads - Set clip_model_path on config for CLIP model cache location - Add cache_dir to Moondream model and tokenizer loading
- Rename shadowing import in waifudiffusion batch to avoid F823/E0606 - Fix import order in cli/api-caption.py (stdlib before third-party) - Rename local variable shadowing function name in cli/api-caption.py - Remove unnecessary global statement in devices.bypass_sdpa_hijacks
- Remove superfluous SimpleNamespace import in cli/api-caption.py, use Map instead - Drop _ prefix from internal helper functions in modules/api/caption.py - Move DeepDanbooru model path to top-level models folder instead of nesting under CLIP
89bb185 to
80014fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
War & Peace in the description but at this point might as well go full hog.
Overhaul the caption/interrogation subsystem: rename modules/interrogate/ to modules/caption/, extract the caption API into a self-contained module (modules/api/caption.py), and improve VLM model support with better offloading, inference fixes, and fixes to the model-aware prompt filtering (for the preset tasks). Full support for WaifuDiffusion and DeepBooru (DeepDanbooru) taggers was added as well.
The goal is to make captioning a subsystem with a clean API layer, consistent naming, and robust handling, up to date with our code conventions (as much as I can manage anyway). Also got the docs for the API available under
/docsenpoint, and fixed an array of breaking bugs, less breaking bugs, and things that just managed to miff me.Notes
Module rename: modules/interrogate/ → modules/caption/
All caption-related modules are renamed from interrogate to caption. This is a breaking change for any external code importing from modules.interrogate. All internal imports, settings keys, and UI references are updated:
Caption API extraction (modules/api/caption.py)
The caption API endpoints were previously scattered across modules/api/endpoints.py and modules/api/models.py. They are now consolidated into a single self-contained module with:
VLM improvements
UI changes
Other
Things to watch
Environment and Testing