Skip to content

Caption system overhaul V2#4613

Merged
vladmandic merged 27 commits intovladmandic:devfrom
CalamitousFelicitousness:feat/caption-improvements-v2_backup
Feb 11, 2026
Merged

Caption system overhaul V2#4613
vladmandic merged 27 commits intovladmandic:devfrom
CalamitousFelicitousness:feat/caption-improvements-v2_backup

Conversation

@CalamitousFelicitousness
Copy link
Contributor

@CalamitousFelicitousness CalamitousFelicitousness commented Feb 2, 2026

Description

War & Peace in the description but at this point might as well go full hog.

Overhaul the caption/interrogation subsystem: rename modules/interrogate/ to modules/caption/, extract the caption API into a self-contained module (modules/api/caption.py), and improve VLM model support with better offloading, inference fixes, and fixes to the model-aware prompt filtering (for the preset tasks). Full support for WaifuDiffusion and DeepBooru (DeepDanbooru) taggers was added as well.

The goal is to make captioning a subsystem with a clean API layer, consistent naming, and robust handling, up to date with our code conventions (as much as I can manage anyway). Also got the docs for the API available under /docs enpoint, and fixed an array of breaking bugs, less breaking bugs, and things that just managed to miff me.

Notes

Module rename: modules/interrogate/ → modules/caption/

All caption-related modules are renamed from interrogate to caption. This is a breaking change for any external code importing from modules.interrogate. All internal imports, settings keys, and UI references are updated:

  • Settings keys: interrogate_* → caption_* (e.g. interrogate_vlm_model → caption_vlm_model, interrogate_offload → caption_offload)
  • Function names: vqa.interrogate() → vqa.caption()
  • Existing user config.json files with old interrogate_* keys will fall back to defaults - as discussed with @vladmandic, no migration to save on code for a non-critical function.
  • Breaking also has a benefit (or a curse) of being a smoketest for any weirdness in the old-growth code that somehow still interacted with interrogate. There was an instance or two of things injecting itself halfway into the function for some reason.

Caption API extraction (modules/api/caption.py)

The caption API endpoints were previously scattered across modules/api/endpoints.py and modules/api/models.py. They are now consolidated into a single self-contained module with:

  • Direct endpoints: POST /sdapi/v1/openclip, POST /sdapi/v1/tagger, POST /sdapi/v1/vqa; each with backend-specific Pydantic request/response models
  • Unified dispatch: POST /sdapi/v1/caption; routes to any backend via a backend discriminator field
  • Discovery endpoints: GET /sdapi/v1/openclip, GET /sdapi/v1/vqa/models, GET /sdapi/v1/vqa/prompts, GET /sdapi/v1/tagger/models
  • Shared internal _do_openclip, _do_tagger, _do_vqa functions avoid duplication between direct and dispatch handlers
  • Old endpoint/model definitions removed from endpoints.py and models.py

VLM improvements

  • Model-aware prompt filtering fixed up and expanded: get_prompts_for_model() returns only the prompts supported by the selected model (e.g. Florence gets detection prompts, PromptGen gets analysis prompts and tagging prompts which previously were available to entire Florence2 family, Moondream gets point/detect/gaze prompts)
  • Florence-2 detection parsing: Object detection, phrase grounding, OCR, and region proposal results are parsed into structured bounding-box data with optional annotated image overlay (via vqa_detection.py)
  • SDPA bypass: New devices.bypass_sdpa_hijacks() context manager to temporarily restore original SDPA for models incompatible with SageAttention or other attention hijacks
  • Offload support: Caption models respect caption_offload setting, moving models to CPU when not in use, many of them did not or used deprecated methods
  • Moondream fixes: Fixed infinite recursion in Moondream3, disabled flex_attention to avoid torch.compile hang, added cache_dir for downloads - I went a bit far with implementing the functionality, but it's now feature complete, due to caching support you can run multiple tasks on the same image (multiple detections, multiple different tasks) in miliseconds.
  • BLIP model loading: New _load_blip_model() helper pre-loads BLIP models with explicit cache_dir and sets them on the Interrogator config directly, avoiding redundant downloads to the default HF cache
  • DeepBooru endpoint migration: DeepBooru is now exclusively accessed via the /sdapi/v1/tagger endpoint (or the despatcher), removed from the openclip/interrogate endpoint for full consistency

UI changes

  • Default caption task changed from "Short Caption" to "Normal Caption"
  • Task dropdown dynamically updates available prompts when VLM model selection changes - this has become wonky since the first time it was introduced, but it got retightened
  • Placeholder hints update per-task to guide user input - not new, but again, revisited and confirmed it all works well
  • Tooltips and hints added to Caption tab controls
  • Localization strings updated in locale_en.json - added hints where still missing, and moved them into their own subsection for convenvience.
  • Removed settings for Interrogate/Caption from Settings menu entirely, one function (offloading) got moved to the offloading section. Everything user ever needs to set is right there in the snazzy UI.

Other

  • CLI tools updated: cli/api-interrogate.py renamed to cli/api-caption.py with updated endpoints
  • Test suite: cli/test-caption-api.py (3400 lines); comprehensive test coverage for ALL caption API endpoints, parameter validation, and edge cases, error handling, deliberate induction of errors. Depending on how many models are available, but that is 150+ tests, all live, veryfing outputs.
  • modules/caption/caption.py: Thin facade module providing a unified entry point for the caption subsystem, saves on pointing to different internal functions directly

Things to watch

  • The interrogate_* → caption_* settings rename means existing user configs will lose their caption customizations on upgrade (they'll get defaults). No issues noted, new settings in Caption tab now save to config by default.
  • The modules/interrogate/ directory still exists after the rename (the old interrogate.py is deleted but the directory may linger if other files are present); verify clean state
  • cli/test-caption-api.py is large (3400 lines); it's a standalone test script, not a pytest suite

Environment and Testing

  • OS: Linux (WSL2, Ubuntu), Fedora, Windows 11, Android for mobile UI testing
  • GPU: NVIDIA RTX 3090 (CUDA 12.9), RTX 6000 Ada
  • Python: 3.12, 3.13
  • Tested: All three caption backends (OpenCLIP/BLIP, WaifuDiffusion/DeepBooru tagger, VLM) via both UI and API endpoints
  • Linting: ruff check and pylint on all modified files - I promise

Add comprehensive tooltips to Caption tab UI elements in locale_en.json:

- Add new "llm" section for shared LLM/VLM parameters:
  System prompt, Prefill, Top-K, Top-P, Temperature, Num Beams,
  Use Samplers, Thinking Mode, Keep Thinking Trace, Keep Prefill

- Add new "caption" section for caption-specific settings:
  VLM, OpenCLiP, Tagger tab labels and all their parameters
  including thresholds, tag formatting, batch options

- Consolidate accordion labels in ui_caption.py:
  "Caption: Advanced Options" and "Caption: Batch" shared across
  VLM, OpenCLiP, and Tagger tabs (localized to "Advanced Options"
  and "Batch" in UI)

- Remove duplicate entries from missing section
Add comprehensive caption/interrogate API with documentation:

- GET /sdapi/v1/interrogate: List available interrogation models
- POST /sdapi/v1/interrogate: Interrogate with OpenCLIP/BLIP/DeepDanbooru
- POST /sdapi/v1/vqa: Caption with Vision-Language Models (VLM)
- GET /sdapi/v1/vqa: List available VLM models
- POST /sdapi/v1/vqa/batch: Batch caption multiple images
- POST /sdapi/v1/tagger: Tag images with WaifuDiffusion/DeepBooru

Updates:
- Add detailed docstrings with usage examples
- Fix analyze_image response parsing for Gradio update dicts
- Add request/response models for all endpoints
- Remove unused paths import from deepbooru.py and openclip.py
- Use shared.opts.clip_models_path instead of hardcoded paths
Comprehensive test script for all Caption API endpoints:
- GET/POST /sdapi/v1/interrogate (OpenCLiP/DeepBooru)
- POST /sdapi/v1/vqa (VLM captioning)
- GET /sdapi/v1/vqa/models, /sdapi/v1/vqa/prompts
- POST /sdapi/v1/tagger
- GET /sdapi/v1/tagger/models

Usage: python cli/test-caption-api.py [--url URL] [--image PATH]
Add optional LLM generation parameters to the VQA API request model,
allowing per-request override of settings:

- max_tokens, temperature, top_k, top_p, num_beams, do_sample
- thinking_mode, prefill, keep_thinking, keep_prefill

Changes:
- Add 10 new optional fields to ReqVQA model with descriptive docs
- Update get_kwargs() to support per-request overrides via singleton
- Add helper functions get_keep_thinking(), get_keep_prefill()
- Update post_vqa endpoint to pass generation kwargs
- Add _generation_overrides instance variable to VQA class
Update API model field descriptions to match the hints in locale_en.json
for consistency between UI and API documentation.

Updated models:
- ReqInterrogate: clip_model, blip_model, mode
- ReqVQA: model, question, system
- ReqTagger: model, threshold, character_threshold, max_tags,
  include_rating, sort_alpha, use_spaces, escape_brackets,
  exclude_tags, show_scores
Add prompt field to VQA endpoint and advanced settings to OpenCLIP endpoint
to achieve full parity between UI and API capabilities.

VLM endpoint changes:
- Add prompt field for custom text input (required for 'Use Prompt' task)
- Pass prompt to vqa.interrogate instead of hardcoded empty string

OpenCLIP endpoint changes:
- Add 7 optional per-request override fields: min_length, max_length,
  chunk_size, min_flavors, max_flavors, flavor_count, num_beams
- Add get_clip_setting() helper for override support in openclip.py
- Apply overrides via update_interrogate_params() before interrogation

All new fields are optional with None defaults for backwards compatibility.
Add model architecture coverage tests:
- VQA model family detection for 19 architectures
- Florence special prompts test (<OD>, <OCR>, <CAPTION>, etc.)
- Moondream detection features test
- VQA architecture capabilities test
- Tagger model types and WD version comparison tests

Improve test validation:
- Add is_meaningful_answer() to reject responses like "."
- Verify parameters have actual effect (not just accepted)
- Show actual output traces in PASS/FAIL messages
- Fix prefill tests to verify keep_prefill behavior

Add configurable timeout:
- Default timeout increased to 300s for slow models
- Add --timeout CLI argument for customization

Other improvements:
- Add JoyCaption to recognized model families
- Reduce BLIP models to avoid reloading large models
- Better detection result validation for annotated images
- Fix get_keep_thinking() infinite recursion (was calling itself)
- Fix get_keep_prefill() infinite recursion (was calling itself)
- Fix Florence-2 to use beam search instead of sampling
  Sampling causes probability tensor errors with Florence-2
DeepBooru/DeepDanbooru should only be accessed via the tagger endpoint.
The interrogate endpoint is now exclusively for OpenCLIP/BLIP.

- Remove DeepDanbooru handling from post_interrogate
- Update docstring to reference tagger endpoint for anime tagging
- Simplify code by removing if/else branching
- Update cli/api-interrogate.py to use /sdapi/v1/tagger for DeepBooru
- Handle tagger response format (scores dict or tags string)
- Remove DeepBooru test from interrogate endpoint tests
- Update API model descriptions to reference tagger for anime tagging
Move all caption-related modules from modules/interrogate/ to modules/caption/
for better naming consistency:
- Rename deepbooru, deepseek, joycaption, joytag, moondream3, openclip, tagger,
  vqa, vqa_detection, waifudiffusion modules
- Add new caption.py dispatcher module
- Remove old interrogate.py (functionality moved to caption.py)
Update all imports from modules.interrogate to modules.caption across:
- modules/shared.py, modules/shared_legacy.py
- modules/ui_caption.py, modules/ui_common.py
- modules/ui_control.py, modules/ui_control_helpers.py
- modules/ui_img2img.py, modules/ui_sections.py
- modules/ui_symbols.py, modules/ui_video_vlm.py
Update API endpoints and models for caption module rename:
- modules/api/api.py - update imports and endpoint handlers
- modules/api/endpoints.py - update endpoint definitions
- modules/api/models.py - update request/response models
- Rename cli/api-interrogate.py to cli/api-caption.py
- Update cli/options.py, cli/process.py for new module paths
- Update cli/test-tagger.py for caption module imports
Update cli/test-caption-api.py:
- Update test structure for new caption API endpoints
- Fix Moondream gaze detection test prompt to use 'Detect Gaze'
  instead of 'Where is the person looking?' to match handler trigger
- Improve test result categorization and tracking
- Update html/locale_en.json with caption-related strings
- Update README.md documentation
…prompts, fix gaze detection

- Remove caption_openclip_min_length from settings, API models, endpoints, and UI
  (clip_interrogator library has no min_length support; parameter was never functional)
- Split vlm_prompts_florence into base Florence prompts and PromptGen-only prompts
  (GENERATE_TAGS, Analyze, Mixed Caption require MiaoshouAI PromptGen fine-tune)
- Add 'promptgen' category to /vqa/prompts API endpoint
- Fix gaze detection: move DETECT_GAZE check before generic 'detect ' prefix
  to prevent "Detect Gaze" matching as detect target="Gaze"
- Update test suite: remove min_length tests, fix min_flavors to use mode='best',
  add acceptance-only notes, fix thinking trace detection, improve bracket/OCR tests,
  split Florence/PromptGen test coverage
…tainability

Comprehensive review of modules/caption/ addressing memory management,
consistency, and code quality:

Inference correctness:
- Add devices.inference_context() to _qwen(), _smol(), _sa2() handlers
- Remove redundant @torch.no_grad() decorator from joycaption predict()
- Remove dead dtype=torch.bfloat16 kwarg from Florence loader

Memory management:
- Bound moondream3 image cache with LRU eviction (max 8 entries)
- Replace fragile id(image) cache keys with content-based md5 hash
- Add devices.torch_gc() after model loading in deepseek
- Move deepbooru model to CPU before dropping reference on unload
- Add external handler delegation to VQA.unload() (moondream3,
  joycaption, joytag, deepseek)
- Protect batch offload mutation with try/finally

Code deduplication:
- Extract strip_think_xml_tags() shared helper for Qwen/Gemma/SmolVLM
- Extract save_tags_to_file() into tagger.py from deepbooru and
  waifudiffusion

Documentation and clarity:
- Document deepseek global monkey-patches (LlamaFlashAttention2, attrdict)
- Document Florence task="task" as intentional design choice
- Add vendored-code comment to joytag.py
- Document openclip direct .to() usage vs sd_models.move_model
- Comment model.eval() calls that are required (trust_remote_code,
  custom loaders) vs removed where redundant (standard from_pretrained)

API robustness:
- Add HTTP 422 error response for VQA caption error strings in API
  endpoints (post_vqa, _dispatch_vlm)
…ad support

- Add parse_florence_detections() and format_florence_response() to
  vqa_detection for handling Florence-2 detection output formats
- Add bypass_sdpa_hijacks() context manager to devices.py for models
  incompatible with SageAttention or other SDPA hijacks
- Add OpenCLIP model offload support when caption_offload is enabled
update_caption_params() was setting caption_max_length, chunk_size, and
flavor_intermediate_count on the Interrogator instance, but the library
reads them from self.config. The overrides were silently ignored.
…t, prefill tests

- Add use_safetensors=True to all 16 model from_pretrained calls to
  avoid downloading redundant .bin files alongside safetensors
- Add device property to JoyTag VisionModel so move_model can relocate
  it to CUDA (fixes 'ViT object has no attribute device')
- Fix Pix2Struct dtype mismatch by casting float inputs to model dtype
  while preserving integer tensor types
- Patch AutoConfig.register with exist_ok=True during Ovis loading to
  handle duplicate aimv2 registration on model reload
- Detect Qwen VL fine-tune architecture from config model_type instead
  of repo name, fixing ToriiGate and similar third-party fine-tunes
- Change UI default task from Short Caption to Normal Caption, and
  preserve it on model switch instead of resetting to Use Prompt
- Add dual-prefill testing across 5 VQA test methods using a shared
  _check_prefill helper
- Fix pre-existing ruff W605 in strip_think_xml_tags docstring
Move all caption/interrogate/tagger/VQA API code out of the monolithic
endpoints.py and models.py into a new self-contained modules/api/caption.py,
following the loras.py / nudenet.py self-registering pattern.

- Move 15 Pydantic models (ReqCaption, ResCaption, ReqVQA, ResVQA,
  ReqTagger, ResTagger, dispatch union types, etc.) from models.py
- Move 11 handler functions from endpoints.py
- Deduplicate ~150 lines via shared _do_openclip, _do_tagger, _do_vqa
  core functions called by both direct and dispatch endpoints
- Add register_api() that registers all 8 caption routes
- Add promptgen field to ResVLMPrompts (bug fix: handler returned it
  but response model silently dropped it)
- Improve all endpoint docstrings and Field descriptions for API docs
- Add _load_blip_model helper with explicit cache_dir so downloads
  go to hfcache_dir instead of default HF cache
- Pre-load BLIP model/processor before creating Interrogator config
  to control download location and avoid redundant loads
- Set clip_model_path on config for CLIP model cache location
- Add cache_dir to Moondream model and tokenizer loading
- Rename shadowing import in waifudiffusion batch to avoid F823/E0606
- Fix import order in cli/api-caption.py (stdlib before third-party)
- Rename local variable shadowing function name in cli/api-caption.py
- Remove unnecessary global statement in devices.bypass_sdpa_hijacks
- Remove superfluous SimpleNamespace import in cli/api-caption.py, use Map instead
- Drop _ prefix from internal helper functions in modules/api/caption.py
- Move DeepDanbooru model path to top-level models folder instead of nesting under CLIP
Copy link
Owner

@vladmandic vladmandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vladmandic vladmandic merged commit 2c4d075 into vladmandic:dev Feb 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants