feat: add MiniMax provider (VLM + TTS) by octo-patch · Pull Request #2346 · dimensionalOS/dimos

octo-patch · 2026-06-03T09:29:43Z

Summary

Adds MiniMax as a first-class provider for dimos, following the same
pattern already used for OpenAI, Qwen, and the OpenAI TTS node.

VLM: MiniMaxVlModel (default model MiniMax-M3) speaks to MiniMax's
OpenAI-compatible endpoint at https://api.minimax.io/v1. Supports both
single-image query and multi-image query_batch.
TTS: MiniMaxTTSNode calls MiniMax's /v1/t2a_v2 HTTP API. Default
model speech-2.8-hd, default voice English_Graceful_Lady.
Env vars: MINIMAX_API_KEY (required), MINIMAX_BASE_URL (optional
override — defaults to overseas endpoint).
Tests: 11 unit tests covering config defaults, missing-key errors,
request shape, hex-audio decoding, error status handling, and the
VlModel interface contract.

API references

Chat (OpenAI compatible): https://platform.minimax.io/docs/api-reference/text-openai-api
TTS (t2a_v2): https://platform.minimax.io/docs/api-reference/speech-t2a-http

Implementation notes

The VLM uses the existing openai SDK with a custom base_url, so no
new dependency is required.
temperature is hard-pinned to 1.0 in both query and query_batch
because MiniMax rejects 0.
response_format is silently dropped with a warning (not supported on
MiniMax) — callers should drive JSON via prompt + downstream parsing.
The TTS node uses urllib (no new dependency) and decodes MiniMax's
hex-encoded audio chunks via bytes.fromhex.

Test plan

uv run pytest dimos/models/vl/test_minimax.py dimos/stream/audio/tts/test_node_minimax.py
→ 11/11 passing
Optional manual end-to-end with a real MINIMAX_API_KEY

- Add MiniMaxVlModel (M3) for vision-language tasks via MiniMax's OpenAI-compatible endpoint at https://api.minimax.io/v1. - Add MiniMaxTTSNode backed by MiniMax's /v1/t2a_v2 HTTP API (default model: speech-2.8-hd). - Add MINIMAX_API_KEY / MINIMAX_BASE_URL env vars and a skipif_no_minimax pytest marker. - Add unit tests for both providers (11 tests, all passing).

greptile-apps · 2026-06-03T09:32:58Z

Greptile Summary

This PR adds MiniMax as a new provider with a VLM (MiniMaxVlModel) using the OpenAI SDK against MiniMax's compatible endpoint, and a TTS node (MiniMaxTTSNode) using urllib to call the t2a_v2 API with hex-encoded audio decoding. Both nodes follow established patterns in the codebase and introduce no new dependencies.

VLM: query() can return None (the OpenAI SDK types message.content as str | None) while the base-class contract requires str; query_batch already guards with or "" but query does not. Additionally, query_batch replicates a single shared response for every image in the list, which diverges from the expected per-image contract.
TTS: urllib.error is imported but never referenced explicitly; consume_text has no guard against being invoked more than once, which would leak threads and subscriptions.

Confidence Score: 3/5

The VLM query() method can silently return None instead of a string, which will cause AttributeError in any downstream pipeline that calls string methods on the result.

The query() method in minimax.py returns response.choices[0].message.content without a null guard. Every higher-level helper on VlModel — caption, query_json, query_detections, query_points — passes the return value directly into string operations or JSON parsing. A model response that carries a tool call or empty content will produce a None return and crash these callers at runtime. The fix is a one-line or "", already present in query_batch, so the risk is contained and straightforward to address.

dimos/models/vl/minimax.py — the query() return value and query_batch duplication semantics both need attention before the node is wired into live pipelines.

Important Files Changed

Filename	Overview
dimos/models/vl/minimax.py	New MiniMax VLM provider; `query()` can return `None` (OpenAI SDK types `message.content` as `str
dimos/stream/audio/tts/node_minimax.py	New MiniMax TTS node using urllib + hex-decode; dead `urllib.error` import, and `consume_text` has no guard against double-invocation which would leak threads/subscriptions.
dimos/models/vl/test_minimax.py	Unit tests for VLM; the `patch.object` context manager in `test_query_uses_openai_compatible_client` is unused (the `__dict__` injection alone drives the test), making the mock setup misleading but not incorrect.
dimos/stream/audio/tts/test_node_minimax.py	Unit tests for TTS node; covers hex decode, error-status handling, and request shape — well structured and complete.
dimos/models/vl/README.md	Adds MiniMax VLM documentation section with env vars, example usage, and known limitations.
default.env	Adds `MINIMAX_API_KEY` and `MINIMAX_BASE_URL` placeholders, consistent with existing provider entries.
dimos/conftest.py	Registers `skipif_no_minimax` pytest marker, matching the existing pattern for other providers.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MiniMaxVlModel
    participant OpenAI SDK
    participant MiniMax VLM API

    Caller->>MiniMaxVlModel: query(image, prompt)
    MiniMaxVlModel->>MiniMaxVlModel: _prepare_image() → base64
    MiniMaxVlModel->>OpenAI SDK: chat.completions.create(model, messages, temperature=1.0)
    OpenAI SDK->>MiniMax VLM API: POST /v1/chat/completions
    MiniMax VLM API-->>OpenAI SDK: choices[0].message.content
    OpenAI SDK-->>MiniMaxVlModel: response
    MiniMaxVlModel-->>Caller: str (or None if content missing)

    participant TextSrc as Text Observable
    participant MiniMaxTTSNode
    participant urllib
    participant MiniMax TTS API

    TextSrc->>MiniMaxTTSNode: on_next(text)
    MiniMaxTTSNode->>MiniMaxTTSNode: _queue_text()
    MiniMaxTTSNode->>MiniMaxTTSNode: _process_queue() [thread]
    MiniMaxTTSNode->>urllib: POST /v1/t2a_v2 (JSON payload)
    urllib->>MiniMax TTS API: HTTP POST
    MiniMax TTS API-->>urllib: { data: { audio: "hex..." }, base_resp: {...} }
    urllib-->>MiniMaxTTSNode: raw bytes
    MiniMaxTTSNode->>MiniMaxTTSNode: bytes.fromhex() → mp3 bytes
    MiniMaxTTSNode->>MiniMaxTTSNode: soundfile.read() → audio_array
    MiniMaxTTSNode->>MiniMaxTTSNode: audio_subject.on_next(AudioEvent)
    MiniMaxTTSNode->>MiniMaxTTSNode: text_subject.on_next(text)

_{Reviews (1): Last reviewed commit: "feat: add MiniMax provider (VLM + TTS)" | Re-trigger Greptile}

greptile-apps · 2026-06-03T09:33:02Z

+
+        response = self._client.chat.completions.create(**api_kwargs)
+
+        return response.choices[0].message.content  # type: ignore[no-any-return]


response.choices[0].message.content is typed str | None in the OpenAI SDK. If the model returns a tool call or an empty completion the value will be None, and any downstream caller that treats the return as a plain str (e.g., query_json, caption, query_detections) will raise AttributeError. The sibling query_batch already guards against this with or ""; query should do the same.

Suggested change

return response.choices[0].message.content # type: ignore[no-any-return]

return response.choices[0].message.content or ""

greptile-apps · 2026-06-03T09:33:03Z

+import urllib.error
+import urllib.request


urllib.error is imported at the top of the module but is never referenced — no except urllib.error.HTTPError / except urllib.error.URLError block exists anywhere. HTTP-level errors from urlopen (401, 429, 500, …) are caught only by the broad except Exception in _synthesize_speech, which means the error body is lost. Either remove the dead import, or use it to surface a more informative error.

Suggested change

import urllib.error

import urllib.request

import urllib.request

greptile-apps · 2026-06-03T09:33:03Z

+    def query_batch(
+        self,
+        images: list[Image],
+        query: str,
+        response_format: dict[str, Any] | None = None,
+        **kwargs: Any,
+    ) -> list[str]:
+        """Query VLM with multiple images using a single API call."""
+        if not images:
+            return []
+
+        if response_format:
+            logger.warning(
+                "MiniMax does not support response_format; ignoring and relying on prompt."
+            )
+
+        content: list[dict[str, Any]] = [
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": f"data:image/png;base64,{self._prepare_image(img)[0].to_base64()}"
+                },
+            }
+            for img in images
+        ]
+        content.append({"type": "text", "text": query})
+
+        messages = [{"role": "user", "content": content}]
+        api_kwargs: dict[str, Any] = {
+            "model": self.config.model_name,
+            "messages": messages,
+            "temperature": 1.0,
+        }
+
+        response = self._client.chat.completions.create(**api_kwargs)
+        response_text = response.choices[0].message.content or ""
+        # Return one response per image (same response since API analyzes all images together)
+        return [response_text] * len(images)


query_batch sends all images to the model in a single call and then replicates the one response — [response_text] * len(images) — for every image. Callers that follow the standard VlModel.query_batch contract (one independent answer per image) will silently receive the same string for every element. The comment acknowledges this, but the behavior is meaningfully different from the base-class contract and could produce incorrect results in any pipeline that routes results per-image (e.g., object-detection scoring).

greptile-apps · 2026-06-03T09:33:04Z

+    def consume_text(self, text_observable: Observable) -> "AbstractTextConsumer":  # type: ignore[type-arg]
+        logger.info("Starting MiniMaxTTSNode")
+
+        self.processing_thread = threading.Thread(target=self._process_queue, daemon=True)  # type: ignore[assignment]
+        self.processing_thread.start()  # type: ignore[attr-defined]
+
+        self.subscription = text_observable.subscribe(  # type: ignore[assignment]
+            on_next=self._queue_text,
+            on_error=lambda e: logger.error(f"Error in MiniMaxTTSNode: {e}"),
+        )
+
+        return self


consume_text has no guard against being called more than once. A second call starts a new _process_queue thread and adds a new subscription while the first thread and subscription continue running. The leaked thread and subscription will each try to push items through audio_subject / text_subject concurrently, and they can never be cleaned up by dispose. The OpenAI TTS node may have the same pattern, but it's worth adding an early-return (or dispose-then-reinitialize) guard here.

octo-patch requested review from leshy, mustafab0, paul-nechifor and spomichter as code owners June 3, 2026 09:29

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MiniMax provider (VLM + TTS)#2346

feat: add MiniMax provider (VLM + TTS)#2346
octo-patch wants to merge 1 commit into
dimensionalOS:mainfrom
octo-patch:feature/add-minimax-provider

octo-patch commented Jun 3, 2026

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		response = self._client.chat.completions.create(**api_kwargs)

		return response.choices[0].message.content # type: ignore[no-any-return]

	import urllib.error
	import urllib.request
	import urllib.request

Conversation

octo-patch commented Jun 3, 2026

Summary

API references

Implementation notes

Test plan

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant