Skip to content

feat: add MiniMax provider (VLM + TTS)#2346

Open
octo-patch wants to merge 1 commit into
dimensionalOS:mainfrom
octo-patch:feature/add-minimax-provider
Open

feat: add MiniMax provider (VLM + TTS)#2346
octo-patch wants to merge 1 commit into
dimensionalOS:mainfrom
octo-patch:feature/add-minimax-provider

Conversation

@octo-patch
Copy link
Copy Markdown

Summary

Adds MiniMax as a first-class provider for dimos, following the same
pattern already used for OpenAI, Qwen, and the OpenAI TTS node.

  • VLM: MiniMaxVlModel (default model MiniMax-M3) speaks to MiniMax's
    OpenAI-compatible endpoint at https://api.minimax.io/v1. Supports both
    single-image query and multi-image query_batch.
  • TTS: MiniMaxTTSNode calls MiniMax's /v1/t2a_v2 HTTP API. Default
    model speech-2.8-hd, default voice English_Graceful_Lady.
  • Env vars: MINIMAX_API_KEY (required), MINIMAX_BASE_URL (optional
    override — defaults to overseas endpoint).
  • Tests: 11 unit tests covering config defaults, missing-key errors,
    request shape, hex-audio decoding, error status handling, and the
    VlModel interface contract.

API references

Implementation notes

  • The VLM uses the existing openai SDK with a custom base_url, so no
    new dependency is required.
  • temperature is hard-pinned to 1.0 in both query and query_batch
    because MiniMax rejects 0.
  • response_format is silently dropped with a warning (not supported on
    MiniMax) — callers should drive JSON via prompt + downstream parsing.
  • The TTS node uses urllib (no new dependency) and decodes MiniMax's
    hex-encoded audio chunks via bytes.fromhex.

Test plan

  • uv run pytest dimos/models/vl/test_minimax.py dimos/stream/audio/tts/test_node_minimax.py
    → 11/11 passing
  • Optional manual end-to-end with a real MINIMAX_API_KEY

- Add MiniMaxVlModel (M3) for vision-language tasks via MiniMax's
  OpenAI-compatible endpoint at https://api.minimax.io/v1.
- Add MiniMaxTTSNode backed by MiniMax's /v1/t2a_v2 HTTP API
  (default model: speech-2.8-hd).
- Add MINIMAX_API_KEY / MINIMAX_BASE_URL env vars and a
  skipif_no_minimax pytest marker.
- Add unit tests for both providers (11 tests, all passing).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR adds MiniMax as a new provider with a VLM (MiniMaxVlModel) using the OpenAI SDK against MiniMax's compatible endpoint, and a TTS node (MiniMaxTTSNode) using urllib to call the t2a_v2 API with hex-encoded audio decoding. Both nodes follow established patterns in the codebase and introduce no new dependencies.

  • VLM: query() can return None (the OpenAI SDK types message.content as str | None) while the base-class contract requires str; query_batch already guards with or "" but query does not. Additionally, query_batch replicates a single shared response for every image in the list, which diverges from the expected per-image contract.
  • TTS: urllib.error is imported but never referenced explicitly; consume_text has no guard against being invoked more than once, which would leak threads and subscriptions.

Confidence Score: 3/5

The VLM query() method can silently return None instead of a string, which will cause AttributeError in any downstream pipeline that calls string methods on the result.

The query() method in minimax.py returns response.choices[0].message.content without a null guard. Every higher-level helper on VlModelcaption, query_json, query_detections, query_points — passes the return value directly into string operations or JSON parsing. A model response that carries a tool call or empty content will produce a None return and crash these callers at runtime. The fix is a one-line or "", already present in query_batch, so the risk is contained and straightforward to address.

dimos/models/vl/minimax.py — the query() return value and query_batch duplication semantics both need attention before the node is wired into live pipelines.

Important Files Changed

Filename Overview
dimos/models/vl/minimax.py New MiniMax VLM provider; query() can return None (OpenAI SDK types message.content as `str
dimos/stream/audio/tts/node_minimax.py New MiniMax TTS node using urllib + hex-decode; dead urllib.error import, and consume_text has no guard against double-invocation which would leak threads/subscriptions.
dimos/models/vl/test_minimax.py Unit tests for VLM; the patch.object context manager in test_query_uses_openai_compatible_client is unused (the __dict__ injection alone drives the test), making the mock setup misleading but not incorrect.
dimos/stream/audio/tts/test_node_minimax.py Unit tests for TTS node; covers hex decode, error-status handling, and request shape — well structured and complete.
dimos/models/vl/README.md Adds MiniMax VLM documentation section with env vars, example usage, and known limitations.
default.env Adds MINIMAX_API_KEY and MINIMAX_BASE_URL placeholders, consistent with existing provider entries.
dimos/conftest.py Registers skipif_no_minimax pytest marker, matching the existing pattern for other providers.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MiniMaxVlModel
    participant OpenAI SDK
    participant MiniMax VLM API

    Caller->>MiniMaxVlModel: query(image, prompt)
    MiniMaxVlModel->>MiniMaxVlModel: _prepare_image() → base64
    MiniMaxVlModel->>OpenAI SDK: chat.completions.create(model, messages, temperature=1.0)
    OpenAI SDK->>MiniMax VLM API: POST /v1/chat/completions
    MiniMax VLM API-->>OpenAI SDK: choices[0].message.content
    OpenAI SDK-->>MiniMaxVlModel: response
    MiniMaxVlModel-->>Caller: str (or None if content missing)

    participant TextSrc as Text Observable
    participant MiniMaxTTSNode
    participant urllib
    participant MiniMax TTS API

    TextSrc->>MiniMaxTTSNode: on_next(text)
    MiniMaxTTSNode->>MiniMaxTTSNode: _queue_text()
    MiniMaxTTSNode->>MiniMaxTTSNode: _process_queue() [thread]
    MiniMaxTTSNode->>urllib: POST /v1/t2a_v2 (JSON payload)
    urllib->>MiniMax TTS API: HTTP POST
    MiniMax TTS API-->>urllib: { data: { audio: "hex..." }, base_resp: {...} }
    urllib-->>MiniMaxTTSNode: raw bytes
    MiniMaxTTSNode->>MiniMaxTTSNode: bytes.fromhex() → mp3 bytes
    MiniMaxTTSNode->>MiniMaxTTSNode: soundfile.read() → audio_array
    MiniMaxTTSNode->>MiniMaxTTSNode: audio_subject.on_next(AudioEvent)
    MiniMaxTTSNode->>MiniMaxTTSNode: text_subject.on_next(text)
Loading

Reviews (1): Last reviewed commit: "feat: add MiniMax provider (VLM + TTS)" | Re-trigger Greptile


response = self._client.chat.completions.create(**api_kwargs)

return response.choices[0].message.content # type: ignore[no-any-return]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 response.choices[0].message.content is typed str | None in the OpenAI SDK. If the model returns a tool call or an empty completion the value will be None, and any downstream caller that treats the return as a plain str (e.g., query_json, caption, query_detections) will raise AttributeError. The sibling query_batch already guards against this with or ""; query should do the same.

Suggested change
return response.choices[0].message.content # type: ignore[no-any-return]
return response.choices[0].message.content or ""

Comment on lines +31 to +32
import urllib.error
import urllib.request
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 urllib.error is imported at the top of the module but is never referenced — no except urllib.error.HTTPError / except urllib.error.URLError block exists anywhere. HTTP-level errors from urlopen (401, 429, 500, …) are caught only by the broad except Exception in _synthesize_speech, which means the error body is lost. Either remove the dead import, or use it to surface a more informative error.

Suggested change
import urllib.error
import urllib.request
import urllib.request

Comment on lines +128 to +165
def query_batch(
self,
images: list[Image],
query: str,
response_format: dict[str, Any] | None = None,
**kwargs: Any,
) -> list[str]:
"""Query VLM with multiple images using a single API call."""
if not images:
return []

if response_format:
logger.warning(
"MiniMax does not support response_format; ignoring and relying on prompt."
)

content: list[dict[str, Any]] = [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{self._prepare_image(img)[0].to_base64()}"
},
}
for img in images
]
content.append({"type": "text", "text": query})

messages = [{"role": "user", "content": content}]
api_kwargs: dict[str, Any] = {
"model": self.config.model_name,
"messages": messages,
"temperature": 1.0,
}

response = self._client.chat.completions.create(**api_kwargs)
response_text = response.choices[0].message.content or ""
# Return one response per image (same response since API analyzes all images together)
return [response_text] * len(images)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 query_batch sends all images to the model in a single call and then replicates the one response — [response_text] * len(images) — for every image. Callers that follow the standard VlModel.query_batch contract (one independent answer per image) will silently receive the same string for every element. The comment acknowledges this, but the behavior is meaningfully different from the base-class contract and could produce incorrect results in any pipeline that routes results per-image (e.g., object-detection scoring).

Comment on lines +129 to +140
def consume_text(self, text_observable: Observable) -> "AbstractTextConsumer": # type: ignore[type-arg]
logger.info("Starting MiniMaxTTSNode")

self.processing_thread = threading.Thread(target=self._process_queue, daemon=True) # type: ignore[assignment]
self.processing_thread.start() # type: ignore[attr-defined]

self.subscription = text_observable.subscribe( # type: ignore[assignment]
on_next=self._queue_text,
on_error=lambda e: logger.error(f"Error in MiniMaxTTSNode: {e}"),
)

return self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 consume_text has no guard against being called more than once. A second call starts a new _process_queue thread and adds a new subscription while the first thread and subscription continue running. The leaked thread and subscription will each try to push items through audio_subject / text_subject concurrently, and they can never be cleaned up by dispose. The OpenAI TTS node may have the same pattern, but it's worth adding an early-return (or dispose-then-reinitialize) guard here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant