feat: add MiniMax provider (VLM + TTS)#2346
Conversation
- Add MiniMaxVlModel (M3) for vision-language tasks via MiniMax's OpenAI-compatible endpoint at https://api.minimax.io/v1. - Add MiniMaxTTSNode backed by MiniMax's /v1/t2a_v2 HTTP API (default model: speech-2.8-hd). - Add MINIMAX_API_KEY / MINIMAX_BASE_URL env vars and a skipif_no_minimax pytest marker. - Add unit tests for both providers (11 tests, all passing).
Greptile SummaryThis PR adds MiniMax as a new provider with a VLM (
Confidence Score: 3/5The VLM The
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant MiniMaxVlModel
participant OpenAI SDK
participant MiniMax VLM API
Caller->>MiniMaxVlModel: query(image, prompt)
MiniMaxVlModel->>MiniMaxVlModel: _prepare_image() → base64
MiniMaxVlModel->>OpenAI SDK: chat.completions.create(model, messages, temperature=1.0)
OpenAI SDK->>MiniMax VLM API: POST /v1/chat/completions
MiniMax VLM API-->>OpenAI SDK: choices[0].message.content
OpenAI SDK-->>MiniMaxVlModel: response
MiniMaxVlModel-->>Caller: str (or None if content missing)
participant TextSrc as Text Observable
participant MiniMaxTTSNode
participant urllib
participant MiniMax TTS API
TextSrc->>MiniMaxTTSNode: on_next(text)
MiniMaxTTSNode->>MiniMaxTTSNode: _queue_text()
MiniMaxTTSNode->>MiniMaxTTSNode: _process_queue() [thread]
MiniMaxTTSNode->>urllib: POST /v1/t2a_v2 (JSON payload)
urllib->>MiniMax TTS API: HTTP POST
MiniMax TTS API-->>urllib: { data: { audio: "hex..." }, base_resp: {...} }
urllib-->>MiniMaxTTSNode: raw bytes
MiniMaxTTSNode->>MiniMaxTTSNode: bytes.fromhex() → mp3 bytes
MiniMaxTTSNode->>MiniMaxTTSNode: soundfile.read() → audio_array
MiniMaxTTSNode->>MiniMaxTTSNode: audio_subject.on_next(AudioEvent)
MiniMaxTTSNode->>MiniMaxTTSNode: text_subject.on_next(text)
Reviews (1): Last reviewed commit: "feat: add MiniMax provider (VLM + TTS)" | Re-trigger Greptile |
|
|
||
| response = self._client.chat.completions.create(**api_kwargs) | ||
|
|
||
| return response.choices[0].message.content # type: ignore[no-any-return] |
There was a problem hiding this comment.
response.choices[0].message.content is typed str | None in the OpenAI SDK. If the model returns a tool call or an empty completion the value will be None, and any downstream caller that treats the return as a plain str (e.g., query_json, caption, query_detections) will raise AttributeError. The sibling query_batch already guards against this with or ""; query should do the same.
| return response.choices[0].message.content # type: ignore[no-any-return] | |
| return response.choices[0].message.content or "" |
| import urllib.error | ||
| import urllib.request |
There was a problem hiding this comment.
urllib.error is imported at the top of the module but is never referenced — no except urllib.error.HTTPError / except urllib.error.URLError block exists anywhere. HTTP-level errors from urlopen (401, 429, 500, …) are caught only by the broad except Exception in _synthesize_speech, which means the error body is lost. Either remove the dead import, or use it to surface a more informative error.
| import urllib.error | |
| import urllib.request | |
| import urllib.request |
| def query_batch( | ||
| self, | ||
| images: list[Image], | ||
| query: str, | ||
| response_format: dict[str, Any] | None = None, | ||
| **kwargs: Any, | ||
| ) -> list[str]: | ||
| """Query VLM with multiple images using a single API call.""" | ||
| if not images: | ||
| return [] | ||
|
|
||
| if response_format: | ||
| logger.warning( | ||
| "MiniMax does not support response_format; ignoring and relying on prompt." | ||
| ) | ||
|
|
||
| content: list[dict[str, Any]] = [ | ||
| { | ||
| "type": "image_url", | ||
| "image_url": { | ||
| "url": f"data:image/png;base64,{self._prepare_image(img)[0].to_base64()}" | ||
| }, | ||
| } | ||
| for img in images | ||
| ] | ||
| content.append({"type": "text", "text": query}) | ||
|
|
||
| messages = [{"role": "user", "content": content}] | ||
| api_kwargs: dict[str, Any] = { | ||
| "model": self.config.model_name, | ||
| "messages": messages, | ||
| "temperature": 1.0, | ||
| } | ||
|
|
||
| response = self._client.chat.completions.create(**api_kwargs) | ||
| response_text = response.choices[0].message.content or "" | ||
| # Return one response per image (same response since API analyzes all images together) | ||
| return [response_text] * len(images) |
There was a problem hiding this comment.
query_batch sends all images to the model in a single call and then replicates the one response — [response_text] * len(images) — for every image. Callers that follow the standard VlModel.query_batch contract (one independent answer per image) will silently receive the same string for every element. The comment acknowledges this, but the behavior is meaningfully different from the base-class contract and could produce incorrect results in any pipeline that routes results per-image (e.g., object-detection scoring).
| def consume_text(self, text_observable: Observable) -> "AbstractTextConsumer": # type: ignore[type-arg] | ||
| logger.info("Starting MiniMaxTTSNode") | ||
|
|
||
| self.processing_thread = threading.Thread(target=self._process_queue, daemon=True) # type: ignore[assignment] | ||
| self.processing_thread.start() # type: ignore[attr-defined] | ||
|
|
||
| self.subscription = text_observable.subscribe( # type: ignore[assignment] | ||
| on_next=self._queue_text, | ||
| on_error=lambda e: logger.error(f"Error in MiniMaxTTSNode: {e}"), | ||
| ) | ||
|
|
||
| return self |
There was a problem hiding this comment.
consume_text has no guard against being called more than once. A second call starts a new _process_queue thread and adds a new subscription while the first thread and subscription continue running. The leaked thread and subscription will each try to push items through audio_subject / text_subject concurrently, and they can never be cleaned up by dispose. The OpenAI TTS node may have the same pattern, but it's worth adding an early-return (or dispose-then-reinitialize) guard here.
Summary
Adds MiniMax as a first-class provider for dimos, following the same
pattern already used for OpenAI, Qwen, and the OpenAI TTS node.
MiniMaxVlModel(default modelMiniMax-M3) speaks to MiniMax'sOpenAI-compatible endpoint at
https://api.minimax.io/v1. Supports bothsingle-image
queryand multi-imagequery_batch.MiniMaxTTSNodecalls MiniMax's/v1/t2a_v2HTTP API. Defaultmodel
speech-2.8-hd, default voiceEnglish_Graceful_Lady.MINIMAX_API_KEY(required),MINIMAX_BASE_URL(optionaloverride — defaults to overseas endpoint).
request shape, hex-audio decoding, error status handling, and the
VlModel interface contract.
API references
Implementation notes
openaiSDK with a custombase_url, so nonew dependency is required.
temperatureis hard-pinned to1.0in bothqueryandquery_batchbecause MiniMax rejects
0.response_formatis silently dropped with a warning (not supported onMiniMax) — callers should drive JSON via prompt + downstream parsing.
urllib(no new dependency) and decodes MiniMax'shex-encoded audio chunks via
bytes.fromhex.Test plan
uv run pytest dimos/models/vl/test_minimax.py dimos/stream/audio/tts/test_node_minimax.py→ 11/11 passing
MINIMAX_API_KEY