Caption system overhaul V2 by CalamitousFelicitousness · Pull Request #4613 · vladmandic/sdnext

CalamitousFelicitousness · 2026-02-02T22:46:38Z

Description

War & Peace in the description but at this point might as well go full hog.

Overhaul the caption/interrogation subsystem: rename modules/interrogate/ to modules/caption/, extract the caption API into a self-contained module (modules/api/caption.py), and improve VLM model support with better offloading, inference fixes, and fixes to the model-aware prompt filtering (for the preset tasks). Full support for WaifuDiffusion and DeepBooru (DeepDanbooru) taggers was added as well.

The goal is to make captioning a subsystem with a clean API layer, consistent naming, and robust handling, up to date with our code conventions (as much as I can manage anyway). Also got the docs for the API available under /docs enpoint, and fixed an array of breaking bugs, less breaking bugs, and things that just managed to miff me.

Notes

Module rename: modules/interrogate/ → modules/caption/

All caption-related modules are renamed from interrogate to caption. This is a breaking change for any external code importing from modules.interrogate. All internal imports, settings keys, and UI references are updated:

Settings keys: interrogate_* → caption_* (e.g. interrogate_vlm_model → caption_vlm_model, interrogate_offload → caption_offload)
Function names: vqa.interrogate() → vqa.caption()
Existing user config.json files with old interrogate_* keys will fall back to defaults - as discussed with @vladmandic, no migration to save on code for a non-critical function.
Breaking also has a benefit (or a curse) of being a smoketest for any weirdness in the old-growth code that somehow still interacted with interrogate. There was an instance or two of things injecting itself halfway into the function for some reason.

Caption API extraction (modules/api/caption.py)

The caption API endpoints were previously scattered across modules/api/endpoints.py and modules/api/models.py. They are now consolidated into a single self-contained module with:

Direct endpoints: POST /sdapi/v1/openclip, POST /sdapi/v1/tagger, POST /sdapi/v1/vqa; each with backend-specific Pydantic request/response models
Unified dispatch: POST /sdapi/v1/caption; routes to any backend via a backend discriminator field
Discovery endpoints: GET /sdapi/v1/openclip, GET /sdapi/v1/vqa/models, GET /sdapi/v1/vqa/prompts, GET /sdapi/v1/tagger/models
Shared internal _do_openclip, _do_tagger, _do_vqa functions avoid duplication between direct and dispatch handlers
Old endpoint/model definitions removed from endpoints.py and models.py

VLM improvements

Model-aware prompt filtering fixed up and expanded: get_prompts_for_model() returns only the prompts supported by the selected model (e.g. Florence gets detection prompts, PromptGen gets analysis prompts and tagging prompts which previously were available to entire Florence2 family, Moondream gets point/detect/gaze prompts)
Florence-2 detection parsing: Object detection, phrase grounding, OCR, and region proposal results are parsed into structured bounding-box data with optional annotated image overlay (via vqa_detection.py)
SDPA bypass: New devices.bypass_sdpa_hijacks() context manager to temporarily restore original SDPA for models incompatible with SageAttention or other attention hijacks
Offload support: Caption models respect caption_offload setting, moving models to CPU when not in use, many of them did not or used deprecated methods
Moondream fixes: Fixed infinite recursion in Moondream3, disabled flex_attention to avoid torch.compile hang, added cache_dir for downloads - I went a bit far with implementing the functionality, but it's now feature complete, due to caching support you can run multiple tasks on the same image (multiple detections, multiple different tasks) in miliseconds.
BLIP model loading: New _load_blip_model() helper pre-loads BLIP models with explicit cache_dir and sets them on the Interrogator config directly, avoiding redundant downloads to the default HF cache
DeepBooru endpoint migration: DeepBooru is now exclusively accessed via the /sdapi/v1/tagger endpoint (or the despatcher), removed from the openclip/interrogate endpoint for full consistency

UI changes

Default caption task changed from "Short Caption" to "Normal Caption"
Task dropdown dynamically updates available prompts when VLM model selection changes - this has become wonky since the first time it was introduced, but it got retightened
Placeholder hints update per-task to guide user input - not new, but again, revisited and confirmed it all works well
Tooltips and hints added to Caption tab controls
Localization strings updated in locale_en.json - added hints where still missing, and moved them into their own subsection for convenvience.
Removed settings for Interrogate/Caption from Settings menu entirely, one function (offloading) got moved to the offloading section. Everything user ever needs to set is right there in the snazzy UI.

Other

CLI tools updated: cli/api-interrogate.py renamed to cli/api-caption.py with updated endpoints
Test suite: cli/test-caption-api.py (3400 lines); comprehensive test coverage for ALL caption API endpoints, parameter validation, and edge cases, error handling, deliberate induction of errors. Depending on how many models are available, but that is 150+ tests, all live, veryfing outputs.
modules/caption/caption.py: Thin facade module providing a unified entry point for the caption subsystem, saves on pointing to different internal functions directly

Things to watch

The interrogate_* → caption_* settings rename means existing user configs will lose their caption customizations on upgrade (they'll get defaults). No issues noted, new settings in Caption tab now save to config by default.
The modules/interrogate/ directory still exists after the rename (the old interrogate.py is deleted but the directory may linger if other files are present); verify clean state
cli/test-caption-api.py is large (3400 lines); it's a standalone test script, not a pytest suite

Environment and Testing

OS: Linux (WSL2, Ubuntu), Fedora, Windows 11, Android for mobile UI testing
GPU: NVIDIA RTX 3090 (CUDA 12.9), RTX 6000 Ada
Python: 3.12, 3.13
Tested: All three caption backends (OpenCLIP/BLIP, WaifuDiffusion/DeepBooru tagger, VLM) via both UI and API endpoints
Linting: ruff check and pylint on all modified files - I promise

cli/api-caption.py

modules/api/caption.py

modules/caption/deepbooru.py

modules/caption/moondream3.py

Add comprehensive tooltips to Caption tab UI elements in locale_en.json: - Add new "llm" section for shared LLM/VLM parameters: System prompt, Prefill, Top-K, Top-P, Temperature, Num Beams, Use Samplers, Thinking Mode, Keep Thinking Trace, Keep Prefill - Add new "caption" section for caption-specific settings: VLM, OpenCLiP, Tagger tab labels and all their parameters including thresholds, tag formatting, batch options - Consolidate accordion labels in ui_caption.py: "Caption: Advanced Options" and "Caption: Batch" shared across VLM, OpenCLiP, and Tagger tabs (localized to "Advanced Options" and "Batch" in UI) - Remove duplicate entries from missing section

Add comprehensive caption/interrogate API with documentation: - GET /sdapi/v1/interrogate: List available interrogation models - POST /sdapi/v1/interrogate: Interrogate with OpenCLIP/BLIP/DeepDanbooru - POST /sdapi/v1/vqa: Caption with Vision-Language Models (VLM) - GET /sdapi/v1/vqa: List available VLM models - POST /sdapi/v1/vqa/batch: Batch caption multiple images - POST /sdapi/v1/tagger: Tag images with WaifuDiffusion/DeepBooru Updates: - Add detailed docstrings with usage examples - Fix analyze_image response parsing for Gradio update dicts - Add request/response models for all endpoints

- Remove unused paths import from deepbooru.py and openclip.py - Use shared.opts.clip_models_path instead of hardcoded paths

Comprehensive test script for all Caption API endpoints: - GET/POST /sdapi/v1/interrogate (OpenCLiP/DeepBooru) - POST /sdapi/v1/vqa (VLM captioning) - GET /sdapi/v1/vqa/models, /sdapi/v1/vqa/prompts - POST /sdapi/v1/tagger - GET /sdapi/v1/tagger/models Usage: python cli/test-caption-api.py [--url URL] [--image PATH]

Add optional LLM generation parameters to the VQA API request model, allowing per-request override of settings: - max_tokens, temperature, top_k, top_p, num_beams, do_sample - thinking_mode, prefill, keep_thinking, keep_prefill Changes: - Add 10 new optional fields to ReqVQA model with descriptive docs - Update get_kwargs() to support per-request overrides via singleton - Add helper functions get_keep_thinking(), get_keep_prefill() - Update post_vqa endpoint to pass generation kwargs - Add _generation_overrides instance variable to VQA class

Update API model field descriptions to match the hints in locale_en.json for consistency between UI and API documentation. Updated models: - ReqInterrogate: clip_model, blip_model, mode - ReqVQA: model, question, system - ReqTagger: model, threshold, character_threshold, max_tags, include_rating, sort_alpha, use_spaces, escape_brackets, exclude_tags, show_scores

Add prompt field to VQA endpoint and advanced settings to OpenCLIP endpoint to achieve full parity between UI and API capabilities. VLM endpoint changes: - Add prompt field for custom text input (required for 'Use Prompt' task) - Pass prompt to vqa.interrogate instead of hardcoded empty string OpenCLIP endpoint changes: - Add 7 optional per-request override fields: min_length, max_length, chunk_size, min_flavors, max_flavors, flavor_count, num_beams - Add get_clip_setting() helper for override support in openclip.py - Apply overrides via update_interrogate_params() before interrogation All new fields are optional with None defaults for backwards compatibility.

Add model architecture coverage tests: - VQA model family detection for 19 architectures - Florence special prompts test (<OD>, <OCR>, <CAPTION>, etc.) - Moondream detection features test - VQA architecture capabilities test - Tagger model types and WD version comparison tests Improve test validation: - Add is_meaningful_answer() to reject responses like "." - Verify parameters have actual effect (not just accepted) - Show actual output traces in PASS/FAIL messages - Fix prefill tests to verify keep_prefill behavior Add configurable timeout: - Default timeout increased to 300s for slow models - Add --timeout CLI argument for customization Other improvements: - Add JoyCaption to recognized model families - Reduce BLIP models to avoid reloading large models - Better detection result validation for annotated images

- Fix get_keep_thinking() infinite recursion (was calling itself) - Fix get_keep_prefill() infinite recursion (was calling itself) - Fix Florence-2 to use beam search instead of sampling Sampling causes probability tensor errors with Florence-2

DeepBooru/DeepDanbooru should only be accessed via the tagger endpoint. The interrogate endpoint is now exclusively for OpenCLIP/BLIP. - Remove DeepDanbooru handling from post_interrogate - Update docstring to reference tagger endpoint for anime tagging - Simplify code by removing if/else branching

- Update cli/api-interrogate.py to use /sdapi/v1/tagger for DeepBooru - Handle tagger response format (scores dict or tags string) - Remove DeepBooru test from interrogate endpoint tests - Update API model descriptions to reference tagger for anime tagging

Move all caption-related modules from modules/interrogate/ to modules/caption/ for better naming consistency: - Rename deepbooru, deepseek, joycaption, joytag, moondream3, openclip, tagger, vqa, vqa_detection, waifudiffusion modules - Add new caption.py dispatcher module - Remove old interrogate.py (functionality moved to caption.py)

Update all imports from modules.interrogate to modules.caption across: - modules/shared.py, modules/shared_legacy.py - modules/ui_caption.py, modules/ui_common.py - modules/ui_control.py, modules/ui_control_helpers.py - modules/ui_img2img.py, modules/ui_sections.py - modules/ui_symbols.py, modules/ui_video_vlm.py

Update API endpoints and models for caption module rename: - modules/api/api.py - update imports and endpoint handlers - modules/api/endpoints.py - update endpoint definitions - modules/api/models.py - update request/response models

- Rename cli/api-interrogate.py to cli/api-caption.py - Update cli/options.py, cli/process.py for new module paths - Update cli/test-tagger.py for caption module imports

Update cli/test-caption-api.py: - Update test structure for new caption API endpoints - Fix Moondream gaze detection test prompt to use 'Detect Gaze' instead of 'Where is the person looking?' to match handler trigger - Improve test result categorization and tracking

- Update html/locale_en.json with caption-related strings - Update README.md documentation

…prompts, fix gaze detection - Remove caption_openclip_min_length from settings, API models, endpoints, and UI (clip_interrogator library has no min_length support; parameter was never functional) - Split vlm_prompts_florence into base Florence prompts and PromptGen-only prompts (GENERATE_TAGS, Analyze, Mixed Caption require MiaoshouAI PromptGen fine-tune) - Add 'promptgen' category to /vqa/prompts API endpoint - Fix gaze detection: move DETECT_GAZE check before generic 'detect ' prefix to prevent "Detect Gaze" matching as detect target="Gaze" - Update test suite: remove min_length tests, fix min_flavors to use mode='best', add acceptance-only notes, fix thinking trace detection, improve bracket/OCR tests, split Florence/PromptGen test coverage

…tainability Comprehensive review of modules/caption/ addressing memory management, consistency, and code quality: Inference correctness: - Add devices.inference_context() to _qwen(), _smol(), _sa2() handlers - Remove redundant @torch.no_grad() decorator from joycaption predict() - Remove dead dtype=torch.bfloat16 kwarg from Florence loader Memory management: - Bound moondream3 image cache with LRU eviction (max 8 entries) - Replace fragile id(image) cache keys with content-based md5 hash - Add devices.torch_gc() after model loading in deepseek - Move deepbooru model to CPU before dropping reference on unload - Add external handler delegation to VQA.unload() (moondream3, joycaption, joytag, deepseek) - Protect batch offload mutation with try/finally Code deduplication: - Extract strip_think_xml_tags() shared helper for Qwen/Gemma/SmolVLM - Extract save_tags_to_file() into tagger.py from deepbooru and waifudiffusion Documentation and clarity: - Document deepseek global monkey-patches (LlamaFlashAttention2, attrdict) - Document Florence task="task" as intentional design choice - Add vendored-code comment to joytag.py - Document openclip direct .to() usage vs sd_models.move_model - Comment model.eval() calls that are required (trust_remote_code, custom loaders) vs removed where redundant (standard from_pretrained) API robustness: - Add HTTP 422 error response for VQA caption error strings in API endpoints (post_vqa, _dispatch_vlm)

…ad support - Add parse_florence_detections() and format_florence_response() to vqa_detection for handling Florence-2 detection output formats - Add bypass_sdpa_hijacks() context manager to devices.py for models incompatible with SageAttention or other SDPA hijacks - Add OpenCLIP model offload support when caption_offload is enabled

update_caption_params() was setting caption_max_length, chunk_size, and flavor_intermediate_count on the Interrogator instance, but the library reads them from self.config. The overrides were silently ignored.

…t, prefill tests - Add use_safetensors=True to all 16 model from_pretrained calls to avoid downloading redundant .bin files alongside safetensors - Add device property to JoyTag VisionModel so move_model can relocate it to CUDA (fixes 'ViT object has no attribute device') - Fix Pix2Struct dtype mismatch by casting float inputs to model dtype while preserving integer tensor types - Patch AutoConfig.register with exist_ok=True during Ovis loading to handle duplicate aimv2 registration on model reload - Detect Qwen VL fine-tune architecture from config model_type instead of repo name, fixing ToriiGate and similar third-party fine-tunes - Change UI default task from Short Caption to Normal Caption, and preserve it on model switch instead of resetting to Use Prompt - Add dual-prefill testing across 5 VQA test methods using a shared _check_prefill helper - Fix pre-existing ruff W605 in strip_think_xml_tags docstring

Move all caption/interrogate/tagger/VQA API code out of the monolithic endpoints.py and models.py into a new self-contained modules/api/caption.py, following the loras.py / nudenet.py self-registering pattern. - Move 15 Pydantic models (ReqCaption, ResCaption, ReqVQA, ResVQA, ReqTagger, ResTagger, dispatch union types, etc.) from models.py - Move 11 handler functions from endpoints.py - Deduplicate ~150 lines via shared _do_openclip, _do_tagger, _do_vqa core functions called by both direct and dispatch endpoints - Add register_api() that registers all 8 caption routes - Add promptgen field to ResVLMPrompts (bug fix: handler returned it but response model silently dropped it) - Improve all endpoint docstrings and Field descriptions for API docs

- Add _load_blip_model helper with explicit cache_dir so downloads go to hfcache_dir instead of default HF cache - Pre-load BLIP model/processor before creating Interrogator config to control download location and avoid redundant loads - Set clip_model_path on config for CLIP model cache location - Add cache_dir to Moondream model and tokenizer loading

- Rename shadowing import in waifudiffusion batch to avoid F823/E0606 - Fix import order in cli/api-caption.py (stdlib before third-party) - Rename local variable shadowing function name in cli/api-caption.py - Remove unnecessary global statement in devices.bypass_sdpa_hijacks

- Remove superfluous SimpleNamespace import in cli/api-caption.py, use Map instead - Drop _ prefix from internal helper functions in modules/api/caption.py - Move DeepDanbooru model path to top-level models folder instead of nesting under CLIP

vladmandic

lgtm

vladmandic reviewed Feb 10, 2026

View reviewed changes

cli/api-caption.py Outdated Show resolved Hide resolved

vladmandic reviewed Feb 10, 2026

View reviewed changes

modules/api/caption.py Outdated Show resolved Hide resolved

vladmandic reviewed Feb 10, 2026

View reviewed changes

modules/caption/deepbooru.py Outdated Show resolved Hide resolved

vladmandic reviewed Feb 10, 2026

View reviewed changes

modules/caption/moondream3.py Show resolved Hide resolved

CalamitousFelicitousness added 26 commits February 11, 2026 02:47

refactor(interrogate): use configurable clip_models_path

ef79716

- Remove unused paths import from deepbooru.py and openclip.py - Use shared.opts.clip_models_path instead of hardcoded paths

refactor: update API for caption module

f4b5abd

Update API endpoints and models for caption module rename: - modules/api/api.py - update imports and endpoint handlers - modules/api/endpoints.py - update endpoint definitions - modules/api/models.py - update request/response models

refactor: update CLI tools for caption module

d78c5c1

- Rename cli/api-interrogate.py to cli/api-caption.py - Update cli/options.py, cli/process.py for new module paths - Update cli/test-tagger.py for caption module imports

docs: update localization and README for caption module

0c45e58

- Update html/locale_en.json with caption-related strings - Update README.md documentation

feat(caption): add debug logging for Florence-2 handler

fba942b

fix(caption): set clip_interrogator params on config, not instance

57659ab

update_caption_params() was setting caption_max_length, chunk_size, and flavor_intermediate_count on the Interrogator instance, but the library reads them from self.config. The overrides were silently ignored.

fix(caption): address PR review feedback

80014fa

- Remove superfluous SimpleNamespace import in cli/api-caption.py, use Map instead - Drop _ prefix from internal helper functions in modules/api/caption.py - Move DeepDanbooru model path to top-level models folder instead of nesting under CLIP

CalamitousFelicitousness force-pushed the feat/caption-improvements-v2_backup branch from 89bb185 to 80014fa Compare February 11, 2026 02:50

CalamitousFelicitousness requested a review from vladmandic February 11, 2026 02:50

vladmandic approved these changes Feb 11, 2026

View reviewed changes

vladmandic merged commit 2c4d075 into vladmandic:dev Feb 11, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Caption system overhaul V2#4613

Caption system overhaul V2#4613
vladmandic merged 27 commits intovladmandic:devfrom
CalamitousFelicitousness:feat/caption-improvements-v2_backup

CalamitousFelicitousness commented Feb 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladmandic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

CalamitousFelicitousness commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Notes

Caption API extraction (modules/api/caption.py)

VLM improvements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladmandic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CalamitousFelicitousness commented Feb 2, 2026 •

edited

Loading