Skip to content

MiniMax M3 real engineering worker probe on Hermes WebUI PR-derived cases #52

@franksong2702

Description

@franksong2702

Third-party M3 real engineering worker probe on Hermes WebUI PR-derived cases

Good-faith technical feedback from a third-party developer using real Hermes WebUI PR cases. This is a small diagnostic probe, not a general model benchmark.

Context

I received early access to M3 through MiniMax community outreach and wanted to share a concrete engineering signal back to the MiniMax / M3 team.

The test material comes from real Hermes WebUI issue / PR cases I worked with as a developer. The goal was not to rank general coding ability, but to answer a narrower question:

Can M3 act as a bounded coding worker on real bug-fix tasks, starting from the same commit and prompt, and produce fixes that pass independent PR-derived oracles?

Artifacts

Please see the companion artifact gist: https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4

Key files:

Evaluation Setup

  • Case set: 11 real PR-derived Hermes WebUI bug-fix cases.
  • Each case starts from the PR's pre-fix start_commit.
  • Prompt is derived from the corresponding real issue / bug description.
  • Main correctness signal: correctness_independent, using the PR-derived behavioral oracle.
  • Oracle sanity check: Method C teeth-check — base commit should fail, applying that PR's own code-file diff should pass.
  • M3 runs: 3 attempts per case through a real Hermes WebUI 8787 session, workspace set to realcase/sandbox.
  • Reference model: gpt-5.4, 1 attempt per case under the same bounded-worker prep/score flow.

Because M3 has 3 attempts per case while GPT 5.4 has 1, the report does not treat "M3 best-of-3" and "GPT one-shot" as the same metric. It reports run-level, any-pass, and stable-all-pass views separately.

Result Summary

Two result scopes are reported because pr2280 is mechanically valid but has a prompt/oracle boundary caveat.

With all 11 current cases:

  • GPT 5.4: 6/11 pass, one attempt per case.
  • M3: 13/33 run-level pass; 6/11 cases pass at least once; 3/11 cases pass all three attempts.
  • pr2280 is the only case where GPT 5.4 passes and M3 fails all three attempts.

Excluding pr2280 as the stricter core-10 view:

  • GPT 5.4: 5/10 pass, one attempt per case.
  • M3: 13/30 run-level pass; 6/10 cases pass at least once; 3/10 cases pass all three attempts.
  • In this view, there is no GPT-only pass case where M3 fails all three attempts.

Main Observations

  1. M3 is usable as a real engineering worker candidate.
    It can read the repo, locate relevant files, write patches, run checks, and pass independent oracles on multiple real cases.

  2. The main issue is stability rather than basic task understanding.
    Some cases pass in one M3 attempt but fail in another. This suggests M3 is better used with repeat sampling and an independent oracle / review loop than as a single unverified final output.

  3. correctness_bounded is not enough.
    In the core 10-case view, all 40 rows have correctness_bounded=True, but 22 of them still fail the independent oracle. This is a useful signal for agentic coding: local syntax checks or worker-written tests can look healthy while the target behavior is still wrong.

  4. The hardest failures are deep state-system tasks.
    Both models struggled on cases involving session state, journal recovery, state.db replay, sidecar repair, and streaming/session-list consistency. These failures look more like incomplete system-state modeling than simple file-location misses.

  5. pr2280 should be read with care.
    The oracle is not misconfigured; it includes the correct i18n parity test. However, it also exposes a pt translation gap not explicitly stated in the prompt. The report therefore shows both including and excluding pr2280.

Suggested Areas To Investigate

  1. Stability across repeated runs on the same real engineering task.
  2. Stronger self-checking against independent behavioral acceptance criteria.
  3. Better convergence on deep state-system bugs involving persisted session/journal/runtime state.
  4. Evaluation hygiene: fixed manifest, explicit run grouping, exact journal/session binding, and regular oracle teeth-checks.

Limitations

  • Small sample: 10-11 cases, all from one repository and one product domain.
  • Sampling is asymmetric: GPT 5.4 has 1 attempt per case; M3 has 3 attempts per case.
  • This is not a statistically significant benchmark and should not be used as a general model ranking.
  • Some auxiliary signals (scope_vs_codex, correctness_bounded) are proxies and should not replace the independent oracle.

Chinese Summary

这是一份第三方独立开发者基于真实 Hermes WebUI PR case 整理的 M3 工程 worker 参考报告。目标不是做通用模型排行榜,而是观察 M3 在真实 bug-fix agentic coding 场景里,是否能在相同起始 commit、相同 prompt 下产出能通过 independent oracle 的修复。

核心结论是:M3 已经具备真实工程 worker 的基本能力,但单次输出稳定性仍不足,更适合多次采样后配合 independent oracle 或人工 review 筛选。双方共同失败的 deep-state case 说明这组任务本身难度较高,尤其涉及 session state、journal、state.db、stream lifecycle 等多层状态链路。

pr2280 是一个需要单独说明的 case:oracle 机械有效,但包含 prompt 未显式声明的 pt 翻译要求。因此报告同时给出含 / 不含 pr2280 两种口径,避免隐藏这个边界。


Shared as good-faith technical feedback. Report and data were prepared with AI assistance; all runs and result checks were performed locally by me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions