You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit adds comprehensive support for image-text-to-text models to optimum-executorch, extending the existing recipe system to handle multimodal vision-language models.
Key changes:
- Added new image-text-to-text task to task registry
- Created ImageTextToTextExportableModule for multimodal model export
- Extended integrations to support both vision encoder and text decoder export
- Added comprehensive tests for multimodal functionality
- CLI now supports --task image-text-to-text for multimodal models
This enables users to export models like Gemma-3, LLaVA, and other vision-language models using the familiar optimum-executorch workflow:
optimum-cli export executorch --model google/gemma-3-4b-it --task image-text-to-text --recipe xnnpack
f"Exporting text decoder using inputs_embeds({inputs_embeds.shape}), cache_position({cache_position.shape}), dynamic_shapes={text_dynamic_shapes}, strict={text_strict}"
586
+
)
587
+
588
+
# Use the enhanced transformers integration for multimodal support
from ..integrationsimportImageTextToTextExportableModule
21
+
from ..quantizationimportquantize_model_
22
+
from ..task_registryimportregister_task
23
+
24
+
25
+
# NOTE: It's important to map the registered task name to the pipeline name in https://github.com/huggingface/transformers/blob/main/utils/update_metadata.py.
26
+
# This will streamline using inferred task names and make exporting models to Hugging Face pipelines easier.
0 commit comments