docs(spec): SHIP-TWO-001 §65 — SHIP-005 NOT-DISCHARGE (gx10 164-run pass@1 = 34.15%, 50pp below floor) by noahgift · Pull Request #1626 · paiml/aprender

noahgift · 2026-05-11T16:12:59Z

Summary

Records the empirical RED outcome from the gx10 164-problem HumanEval run. SHIP-005 does NOT LIVE-discharge.

Verdict

passed = 56/164
pass@1   = 0.3415  ← FAIL (50.65pp below 0.848 effective floor)
pass@10  = 0.9868
pass@100 = 1.0000

Critical signal — model IS capable

pass@1 = 34.15% (FAIL)
pass@10 = 98.68%
pass@100 = 100.00%

The bug is in greedy-temperature-0 sampling/decoding, not in model knowledge. The model can solve every problem given enough samples.

Three falsifiable hypotheses

H1 (priority 50%): gx10 teacher sha256 differs from lambda-vector

gx10: 0a854098d05b...
lambda-vector: a394dd2867...
Sync canonical artifact + rerun → resolves in ~5h compute

H2 (priority 30%): align_continuation_indent (PR #1617) too aggressive

May dedent completions that are already correctly indented

H3 (priority 20%): BPE tokenization artifacts on complex prompts

Methodology Lesson #12 (NEW)

A directional empirical sample can lie about full-distribution performance. The 10-problem lambda-vector sample (80% pass@1) was within 95% CI [44%, 97%] of the 86% nominal floor — appeared a strong directional signal. The 164-run revealed 34.15% — well outside that CI. The first 10 problems happen to be the easiest; harder problems concentrate later (HumanEval/100+).

Ship-% movement

MODEL-1 ship %: stays at 94% (ceiling without further investigation). SHIP-005 remains PARTIAL pending H1/H2/H3.
MODEL-2 ship %: unchanged at 57%.

Changes

docs/specifications/aprender-train/ship-two-models-spec.md:
- Atomic next action: v3.10.0 → v3.11.0
- New §65 section ABOVE §63 (newest-first)
evidence/section-65-ship-005-not-discharge-2026-05-11/ (NEW):
- humaneval-164-gx10.json (raw apr eval --json, 1174 lines)
- per-problem-summary.json (passed/failed task IDs)
- findings.json (structured H1/H2/H3 analysis + verdict)

🤖 Generated with Claude Code

…64-run pass@1 = 34.15%) (PMAT-CODE-SHIP-TWO-SECTION-65) Records the empirical RED outcome from the gx10 164-problem HumanEval run. SHIP-005 does NOT LIVE-discharge; 50pp gap from the 84.80% effective floor. Critical signal that the model IS capable: - pass@1 = 34.15% (FAIL) - pass@10 = 98.68% - pass@100 = 100.00% The bug is in greedy-temperature-0 sampling/decoding, not in model knowledge. Three falsifiable hypotheses: H1 (50%): gx10 teacher sha256 differs from lambda-vector - gx10: 0a854098d05b15921c173b7c8deb87c1cbecdffc66e918825c11a02775c73666 - lambda-vector: a394dd286732a5f32dfb983fd2ea0eeba4d6239ac4c47e44bcfe62f590ddeb28 - Sync canonical artifact + rerun → resolves H1 in ~5h. H2 (30%): align_continuation_indent (PR #1617) too aggressive H3 (20%): BPE tokenization artifacts on complex prompts Methodology lesson #12 NEW: A directional empirical sample (10-problem 80%) can lie about full-distribution performance (164-problem 34%). Spec movement: v3.10.0 → v3.11.0. MODEL-1 ship %: stays at 94%. Closes task #39 PMAT-CODE-SHIP-TWO-SECTION-65. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T15:30:44Z

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

noahgift enabled auto-merge (squash) May 11, 2026 16:13

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 15:30
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §65 — SHIP-005 NOT-DISCHARGE (gx10 164-run pass@1 = 34.15%, 50pp below floor)#1626

docs(spec): SHIP-TWO-001 §65 — SHIP-005 NOT-DISCHARGE (gx10 164-run pass@1 = 34.15%, 50pp below floor)#1626
noahgift wants to merge 1 commit into
mainfrom
docs/section-65-ship-005-not-discharge

noahgift commented May 11, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 11, 2026

Summary

Verdict

Critical signal — model IS capable

Three falsifiable hypotheses

Methodology Lesson #12 (NEW)

Ship-% movement

Changes

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant