MiniMax M3 real engineering worker probe on Hermes WebUI PR-derived cases

# Third-party M3 real engineering worker probe on Hermes WebUI PR-derived cases

> Good-faith technical feedback from a third-party developer using real Hermes WebUI PR cases. This is a small diagnostic probe, not a general model benchmark.

## Context

I received early access to M3 through MiniMax community outreach and wanted to share a concrete engineering signal back to the MiniMax / M3 team.

The test material comes from real Hermes WebUI issue / PR cases I worked with as a developer. The goal was not to rank general coding ability, but to answer a narrower question:

> Can M3 act as a bounded coding worker on real bug-fix tasks, starting from the same commit and prompt, and produce fixes that pass independent PR-derived oracles?

## Artifacts

Please see the companion artifact gist: https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4

Key files:

- Full report: [REPORT.md](https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4#file-report-md)
- Per-case result summary: [results_by_case.csv](https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4#file-results_by_case-csv)
- Full sanitized results: [results_full_sanitized.csv](https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4#file-results_full_sanitized-csv)
- Case manifest: [cases_manifest.csv](https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4#file-cases_manifest-csv)
- Oracle inventory: [oracle_inventory.csv](https://gist.github.com/franksong2702/c06c3c2ad110baaa1fe8a599520935b4#file-oracle_inventory-csv)

## Evaluation Setup

- Case set: 11 real PR-derived Hermes WebUI bug-fix cases.
- Each case starts from the PR's pre-fix `start_commit`.
- Prompt is derived from the corresponding real issue / bug description.
- Main correctness signal: `correctness_independent`, using the PR-derived behavioral oracle.
- Oracle sanity check: Method C teeth-check — base commit should fail, applying that PR's own code-file diff should pass.
- M3 runs: 3 attempts per case through a real Hermes WebUI 8787 session, workspace set to `realcase/sandbox`.
- Reference model: `gpt-5.4`, 1 attempt per case under the same bounded-worker prep/score flow.

Because M3 has 3 attempts per case while GPT 5.4 has 1, the report does not treat "M3 best-of-3" and "GPT one-shot" as the same metric. It reports run-level, any-pass, and stable-all-pass views separately.

## Result Summary

Two result scopes are reported because `pr2280` is mechanically valid but has a prompt/oracle boundary caveat.

With all 11 current cases:

- GPT 5.4: 6/11 pass, one attempt per case.
- M3: 13/33 run-level pass; 6/11 cases pass at least once; 3/11 cases pass all three attempts.
- `pr2280` is the only case where GPT 5.4 passes and M3 fails all three attempts.

Excluding `pr2280` as the stricter core-10 view:

- GPT 5.4: 5/10 pass, one attempt per case.
- M3: 13/30 run-level pass; 6/10 cases pass at least once; 3/10 cases pass all three attempts.
- In this view, there is no GPT-only pass case where M3 fails all three attempts.

## Main Observations

1. M3 is usable as a real engineering worker candidate.
   It can read the repo, locate relevant files, write patches, run checks, and pass independent oracles on multiple real cases.

2. The main issue is stability rather than basic task understanding.
   Some cases pass in one M3 attempt but fail in another. This suggests M3 is better used with repeat sampling and an independent oracle / review loop than as a single unverified final output.

3. `correctness_bounded` is not enough.
   In the core 10-case view, all 40 rows have `correctness_bounded=True`, but 22 of them still fail the independent oracle. This is a useful signal for agentic coding: local syntax checks or worker-written tests can look healthy while the target behavior is still wrong.

4. The hardest failures are deep state-system tasks.
   Both models struggled on cases involving session state, journal recovery, state.db replay, sidecar repair, and streaming/session-list consistency. These failures look more like incomplete system-state modeling than simple file-location misses.

5. `pr2280` should be read with care.
   The oracle is not misconfigured; it includes the correct i18n parity test. However, it also exposes a `pt` translation gap not explicitly stated in the prompt. The report therefore shows both including and excluding `pr2280`.

## Suggested Areas To Investigate

1. Stability across repeated runs on the same real engineering task.
2. Stronger self-checking against independent behavioral acceptance criteria.
3. Better convergence on deep state-system bugs involving persisted session/journal/runtime state.
4. Evaluation hygiene: fixed manifest, explicit run grouping, exact journal/session binding, and regular oracle teeth-checks.

## Limitations

- Small sample: 10-11 cases, all from one repository and one product domain.
- Sampling is asymmetric: GPT 5.4 has 1 attempt per case; M3 has 3 attempts per case.
- This is not a statistically significant benchmark and should not be used as a general model ranking.
- Some auxiliary signals (`scope_vs_codex`, `correctness_bounded`) are proxies and should not replace the independent oracle.

## Chinese Summary

这是一份第三方独立开发者基于真实 Hermes WebUI PR case 整理的 M3 工程 worker 参考报告。目标不是做通用模型排行榜，而是观察 M3 在真实 bug-fix agentic coding 场景里，是否能在相同起始 commit、相同 prompt 下产出能通过 independent oracle 的修复。

核心结论是：M3 已经具备真实工程 worker 的基本能力，但单次输出稳定性仍不足，更适合多次采样后配合 independent oracle 或人工 review 筛选。双方共同失败的 deep-state case 说明这组任务本身难度较高，尤其涉及 session state、journal、state.db、stream lifecycle 等多层状态链路。

`pr2280` 是一个需要单独说明的 case：oracle 机械有效，但包含 prompt 未显式声明的 pt 翻译要求。因此报告同时给出含 / 不含 `pr2280` 两种口径，避免隐藏这个边界。

---

Shared as good-faith technical feedback. Report and data were prepared with AI assistance; all runs and result checks were performed locally by me.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniMax M3 real engineering worker probe on Hermes WebUI PR-derived cases #52

Third-party M3 real engineering worker probe on Hermes WebUI PR-derived cases

Context

Artifacts

Evaluation Setup

Result Summary

Main Observations

Suggested Areas To Investigate

Limitations

Chinese Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MiniMax M3 real engineering worker probe on Hermes WebUI PR-derived cases #52

Description

Third-party M3 real engineering worker probe on Hermes WebUI PR-derived cases

Context

Artifacts

Evaluation Setup

Result Summary

Main Observations

Suggested Areas To Investigate

Limitations

Chinese Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions