docs(research): 5 research artifacts + 3 bench harnesses from overnight fleet run#222
docs(research): 5 research artifacts + 3 bench harnesses from overnight fleet run#222Gradata wants to merge 8 commits into
Conversation
…ht fleet run Overnight 2026-05-20→21 the autonomous research fleet (analyst agent on claude-sonnet-4-6) produced these deliverables but heartbeat budgets ran out before agents pushed them to git. Surfaced today as uncommitted-but-real work in the SDK working tree. Promoting them manually so the work isn't lost. Research docs: - convergence-curve-math.md (206 lines): exponential / power law / smoothed-MA / cumulative-plateau comparison with shipping recommendation - patch-acceptance-2026-05-21.md (222): self-healing patch acceptance rate study - graduation-quality-2026-05-21.md (231): meta-rule graduation quality audit - many-shot-ablation-2026-05-21.md (244): k=10/20/50 ablation - embedding-vs-bm25-2026-05-21.md (283): cross-language scoring comparison Bench harnesses (runnable): - bench/curve_fitting.py (537): fits the 4 curve models, exports PNG charts + JSON results for the convergence research recommendation - bench/many_shot_ablation.py (628): bench harness for the many-shot ablation - bench/cross_language_scoring.py (630): bench for BM25 vs sentence-embedding scoring Authored: analyst agent (claude-sonnet-4-6) via Paperclip company fleet. Reviewed-by: parent agent (this PR) — content is substantive, references real codebase paths, recommendations have R²/AIC math behind them. Refs: research issues afeac9d4, b3e07178, 6cecf363, 4f527f65
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThis PR adds three new benchmark harnesses (cross_language_scoring.py, curve_fitting.py, many_shot_ablation.py), multiple research/write-up documents, self-improvement telemetry and an oscillation-cycle guard (with emits wired into Brain), and tests for the oscillation guard plus a parameterized CLI install smoke test. ChangesBenchmark and Research Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.22.0)OpenGrep fatal error (exit code 2): �[32m✔�[39m �[1mOpengrep OSS�[0m �[1m Loading rules from local config...�[0m Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Gradata/bench/cross_language_scoring.py`:
- Line 425: The call a(f"**GRA-1299**: embedding vs BM25 on zero-term-overlap
probes ") uses an unnecessary f-string prefix causing Ruff F541; replace the
f-string with a plain string literal by removing the leading "f" in the argument
to function a (i.e., change the call in cross_language_scoring.py where
a(f"...") is used to a("...")), and verify there are no interpolations that
require f-strings before committing.
- Line 391: The p95 calculation uses int(0.95 * len(latencies)) which can
overshoot; change it to compute the 95th-rank as ceil(0.95 * n) - 1 and index
sorted(latencies) with that bounded index (and handle empty latencies). Update
the line computing p95_lat (referencing the latencies list and p95_lat variable)
to calculate rank = max(0, min(len(latencies)-1, math.ceil(0.95 *
len(latencies)) - 1)) and then use sorted(latencies)[rank]; ensure math.ceil is
imported/available and guard against empty latencies to avoid IndexError.
In `@Gradata/bench/curve_fitting.py`:
- Around line 272-280: When Matplotlib is unavailable make_charts currently
returns {} which breaks ProfileResult.chart_path and JSON output; change the
exception path to return a dict mapping the profile's model identifier to an
empty string (e.g., {profile.model: ""}) so the function always returns string
paths. Update the same pattern in the other try/except blocks referenced (around
the blocks at the later occurrences) so each returns a mapping with the
appropriate model key to an empty string instead of an empty dict; ensure
references are to make_charts and ProfileResult.chart_path so callers always get
a str path value.
- Around line 479-488: The current SQL filters out sessions with zero
corrections by using WHERE type = 'CORRECTION' and COUNT(*); update the query
used where db_path/conn are defined to compute per-session correction counts
including zeros: remove the type filter and replace COUNT(*) with SUM(CASE WHEN
type = 'CORRECTION' THEN 1 ELSE 0 END) AS cnt, keeping the session IS NOT NULL
AND session > 0 condition and the GROUP BY session ORDER BY session; make the
same change to the second occurrence around lines 491-495.
- Around line 132-137: The _r2 function currently returns 1.0 whenever ss_tot ==
0, which yields false perfect scores for constant y_true; change the logic in
_r2 to only return 1.0 when both ss_tot == 0 and ss_res == 0 (i.e., predictions
exactly match the constant target), otherwise return 0.0 for the constant-target
case so a non-matching prediction is not reported as perfect; update the branch
in _r2 that handles ss_tot == 0 accordingly and keep the existing 1.0 - ss_res /
ss_tot behavior for the general case.
In `@Gradata/bench/many_shot_ablation.py`:
- Around line 133-140: The math assumes k injected items even though top_k is
truncated to corpus length; update calculations to use the actual number of
retrieved/injected items (n = len(top_k)) instead of k when computing precision,
fp_count, and any downstream cost like context_tokens; specifically, replace
uses of k in the precision and fp_count formulas with n (and guard division by
zero), and apply the same change near the other occurrence (context_tokens /
related logic) so all noise/cost metrics reflect the true number of returned
items (use symbols ranked_indices, top_k, relevant_set, relevant_in_top_k,
precision, fp_count, context_tokens, and probe.relevant_indices to locate and
fix the code).
- Around line 153-164: Guard against empty probe sets by checking probe_results
and latencies before doing aggregations: if probe_results is empty (e.g.,
num_probes==0 or _build_probes returned []), avoid dividing by n and computing
p95; instead set n=0 and safe defaults (coverage_rate=0.0, mean_precision=0.0,
mean_fp_count=0.0, fp_rate=0.0, compliance_est=0.0) and for latencies set
avg_lat and p95_lat to None or 0.0; implement this check immediately before the
existing calculations that compute n, coverage_rate, mean_precision,
mean_fp_count, fp_rate, compliance_est, avg_lat and p95_lat, and ensure
NOISE_FACTOR usage remains guarded by the empty-check so you never divide by
zero or index into an empty sorted(latencies).
In `@Gradata/docs/research/embedding-vs-bm25-2026-05-21.md`:
- Around line 49-55: Update the construction rule wording to match the reported
outcomes: replace the strict "Jaccard token similarity < 0.05" requirement with
a relaxed/accurate statement (e.g., "Jaccard token similarity ≤ 0.07" or
"primarily < 0.05, with two exceptions at 0.06–0.07") so the rule and the
results (28 probes J=0.00; probes 11 and 12 at 0.06–0.07) are consistent; keep
reference to applying the same stopword list and tokenizer as jit_inject.py and
update the phrase that currently reads "the probe must have a Jaccard token
similarity < 0.05" accordingly.
In `@Gradata/docs/research/graduation-quality-2026-05-21.md`:
- Around line 47-53: The fenced code blocks in the markdown are missing a
language tag and lack surrounding blank lines; update each problematic fence
(the block shown and the ones starting at the other noted fences) to use a
language (e.g., ```text) and ensure there is a blank line immediately before and
after each fenced block so they pass MD040 and MD031. Also mirror this change
where similar examples are generated in the generator code path around the
_passes_beta_lb_gate() area in _graduation.py so future output includes the
```text fence and blank-line padding.
- Around line 1-2: The H1 heading "Graduation Quality Audit — GRA-1293" lacks a
trailing blank line; add a single blank line immediately after that heading so
there's an empty line between the H1 and the following metadata line to satisfy
MD022 (blanks-around-headings).
- Line 137: The sentence in REC-2's rationale ("Beta(α=4, β=1) at the 5th
percentile gives ~0.48 LB, which can exceed 0.75") is self-contradictory; update
the phrasing to a correct quantitative claim by replacing "which can exceed
0.75" with a correct relation (e.g., "which is well below 0.75" or specify the
correct percentile/value if you meant a different prior), and ensure the
surrounding sentence about requiring 5 observations instead of 3 is consistent
with the corrected numeric statement; look for the exact phrase "Beta(α=4, β=1)
at the 5th percentile gives ~0.48 LB" in the REC-2 rationale and edit that
sentence only.
In `@Gradata/docs/research/many-shot-ablation-2026-05-21.md`:
- Around line 40-42: Add a language tag to the fenced code block containing the
formula `compliance_est(k) = coverage(k) × (1 − 0.30 × fp_rate(k))` (e.g.,
change ``` to ```text) so markdownlint rule MD040 is satisfied; ensure the
opening fence contains the language token and the closing fence remains
unchanged.
In `@Gradata/docs/research/patch-acceptance-2026-05-21.md`:
- Line 6: Replace the stale branch identifier "GRA-1291-prompt-injection-survey"
in the document header with the correct branch name from the PR metadata
("docs/research-overnight-fleet-deliverables") so the header accurately reflects
the delivering branch; locate the header line containing the branch token and
update that string accordingly to maintain correct traceability.
- Around line 113-115: The fenced code block containing the snippet
"<original_rule> (especially in context: word1 word2 word3)" is missing a
language identifier; update that fenced block in the document so the opening
triple-backticks include a language (e.g., use "text") to satisfy markdownlint
MD040 and ensure consistent rendering—locate the fenced block that begins with
``` before the "<original_rule>" line and change it to ```text.
- Around line 176-191: The fenced JSON example (the block starting with ```json
and the shown rule_patch_observed object) lacks blank lines before and after the
code fence, violating MD031; fix it by inserting a blank line immediately above
the opening ```json fence and another blank line immediately below the closing
``` fence so the fenced code block is separated from surrounding text and
satisfies markdownlint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2d84943b-fd9e-450f-ac67-fd9e64bb87f1
📒 Files selected for processing (8)
Gradata/bench/cross_language_scoring.pyGradata/bench/curve_fitting.pyGradata/bench/many_shot_ablation.pyGradata/docs/research/convergence-curve-math.mdGradata/docs/research/embedding-vs-bm25-2026-05-21.mdGradata/docs/research/graduation-quality-2026-05-21.mdGradata/docs/research/many-shot-ablation-2026-05-21.mdGradata/docs/research/patch-acceptance-2026-05-21.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: pytest (py3.12)
- GitHub Check: pytest macos-latest / py3.12
- GitHub Check: pytest windows-latest / py3.12
- GitHub Check: pytest ubuntu-latest / py3.12
- GitHub Check: pytest (py3.11)
- GitHub Check: pytest macos-latest / py3.11
- GitHub Check: pytest windows-latest / py3.11
- GitHub Check: pytest ubuntu-latest / py3.11
🧰 Additional context used
🪛 LanguageTool
Gradata/docs/research/patch-acceptance-2026-05-21.md
[style] ~107-~107: Consider an alternative for the overused word “exactly”.
Context: ... behavioral filter is needed — which is exactly what _patches.py provides. ### 2. Th...
(EXACTLY_PRECISELY)
Gradata/docs/research/convergence-curve-math.md
[style] ~119-~119: Consider an alternative for the overused word “exactly”.
Context: ... both parametric models fail — which is exactly where Mann-Kendall (already implemented...
(EXACTLY_PRECISELY)
[style] ~143-~143: To form a complete sentence, be sure to include a subject.
Context: ...LS slope:** The current implementation. Should be replaced. Linear fit on a decaying s...
(MISSING_IT_THERE)
Gradata/docs/research/many-shot-ablation-2026-05-21.md
[uncategorized] ~117-~117: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...eals a corpus gap, not a k problem. The two TONE probes that miss at k=50 are stylistica...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
[style] ~149-~149: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ctive on this corpus. k=20 breaks even. k=50 is near-global injection rather than...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Gradata/docs/research/embedding-vs-bm25-2026-05-21.md
[style] ~82-~82: If ‘chance’ means ‘possibility’, this phrase is redundant. Consider writing “chance”.
Context: ...33** | 0.714 | 0.833 | 14.40 | Random chance on a 30-document corpus is P@1 = 0.033 ...
(RANDOM_CHANCE)
🪛 markdownlint-cli2 (0.22.1)
Gradata/docs/research/patch-acceptance-2026-05-21.md
[warning] 113-113: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 177-177: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
Gradata/docs/research/many-shot-ablation-2026-05-21.md
[warning] 40-40: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
Gradata/docs/research/graduation-quality-2026-05-21.md
[warning] 1-1: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 47-47: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 109-109: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 129-129: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 151-151: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 182-182: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
🪛 Ruff (0.15.13)
Gradata/bench/curve_fitting.py
[warning] 104-107: Use ternary operator base = 5.0 * math.exp(-0.15 * n) + 1.0 if n <= 20 else 1.0 + rng.gauss(0, 0.3) instead of if-else-block
Replace if-else-block with base = 5.0 * math.exp(-0.15 * n) + 1.0 if n <= 20 else 1.0 + rng.gauss(0, 0.3)
(SIM108)
Gradata/bench/many_shot_ablation.py
[error] 353-353: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 363-363: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 373-373: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 394-394: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 608-608: f-string without any placeholders
Remove extraneous f prefix
(F541)
Gradata/bench/cross_language_scoring.py
[warning] 35-35: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
[error] 425-425: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (2)
Gradata/docs/research/convergence-curve-math.md (1)
1-207: LGTM!Gradata/bench/many_shot_ablation.py (1)
353-353: ⚡ Quick winProvide the full original review comment and any verification outputs (shell/web results) to rewrite it
I don’t have the<review_comment>content or the verification results needed to produce an updated, accurate rewritten comment.
| } | ||
|
|
||
| avg_lat = sum(latencies) / len(latencies) | ||
| p95_lat = sorted(latencies)[int(0.95 * len(latencies))] |
There was a problem hiding this comment.
Fix p95 latency percentile indexing.
Current index selection can over-shoot the intended 95th-percentile rank for common sample sizes, so reported p95 can be inaccurate.
Proposed fix
- p95_lat = sorted(latencies)[int(0.95 * len(latencies))]
+ sorted_lat = sorted(latencies)
+ p95_idx = max(0, math.ceil(0.95 * len(sorted_lat)) - 1)
+ p95_lat = sorted_lat[p95_idx]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/bench/cross_language_scoring.py` at line 391, The p95 calculation
uses int(0.95 * len(latencies)) which can overshoot; change it to compute the
95th-rank as ceil(0.95 * n) - 1 and index sorted(latencies) with that bounded
index (and handle empty latencies). Update the line computing p95_lat
(referencing the latencies list and p95_lat variable) to calculate rank = max(0,
min(len(latencies)-1, math.ceil(0.95 * len(latencies)) - 1)) and then use
sorted(latencies)[rank]; ensure math.ceil is imported/available and guard
against empty latencies to avoid IndexError.
|
|
||
| a(f"# cross-language-scoring benchmark — {run_date}") | ||
| a("") | ||
| a(f"**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ") |
There was a problem hiding this comment.
Remove the unused f-string prefix to satisfy lint.
This is flagged as Ruff F541 and can block CI if lint errors are enforced.
Proposed fix
- a(f"**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ")
+ a("**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| a(f"**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ") | |
| a("**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ") |
🧰 Tools
🪛 Ruff (0.15.13)
[error] 425-425: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/bench/cross_language_scoring.py` at line 425, The call
a(f"**GRA-1299**: embedding vs BM25 on zero-term-overlap probes ") uses an
unnecessary f-string prefix causing Ruff F541; replace the f-string with a plain
string literal by removing the leading "f" in the argument to function a (i.e.,
change the call in cross_language_scoring.py where a(f"...") is used to
a("...")), and verify there are no interpolations that require f-strings before
committing.
| def _r2(y_true: np.ndarray, y_pred: np.ndarray) -> float: | ||
| ss_res = float(np.sum((y_true - y_pred) ** 2)) | ||
| ss_tot = float(np.sum((y_true - np.mean(y_true)) ** 2)) | ||
| if ss_tot == 0: | ||
| return 1.0 | ||
| return 1.0 - ss_res / ss_tot |
There was a problem hiding this comment.
Handle constant-series R² without false perfect scores.
For constant y_true, returning 1.0 unconditionally can misreport a bad fit as perfect.
Proposed fix
def _r2(y_true: np.ndarray, y_pred: np.ndarray) -> float:
ss_res = float(np.sum((y_true - y_pred) ** 2))
ss_tot = float(np.sum((y_true - np.mean(y_true)) ** 2))
if ss_tot == 0:
- return 1.0
+ return 1.0 if ss_res == 0 else 0.0
return 1.0 - ss_res / ss_tot🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/bench/curve_fitting.py` around lines 132 - 137, The _r2 function
currently returns 1.0 whenever ss_tot == 0, which yields false perfect scores
for constant y_true; change the logic in _r2 to only return 1.0 when both ss_tot
== 0 and ss_res == 0 (i.e., predictions exactly match the constant target),
otherwise return 0.0 for the constant-target case so a non-matching prediction
is not reported as perfect; update the branch in _r2 that handles ss_tot == 0
accordingly and keep the existing 1.0 - ss_res / ss_tot behavior for the general
case.
| def make_charts(profile: ProfileResult, out_dir: Path) -> dict[str, str]: | ||
| """Generate one multi-panel chart for this profile. Returns {model: path}.""" | ||
| try: | ||
| import matplotlib | ||
|
|
||
| matplotlib.use("Agg") | ||
| import matplotlib.pyplot as plt | ||
| except ImportError: | ||
| return {} |
There was a problem hiding this comment.
Fix make_charts return contract to always be a string path.
make_charts currently returns {} when Matplotlib is unavailable, which propagates a non-string into ProfileResult.chart_path and JSON output.
Proposed fix
-def make_charts(profile: ProfileResult, out_dir: Path) -> dict[str, str]:
- """Generate one multi-panel chart for this profile. Returns {model: path}."""
+def make_charts(profile: ProfileResult, out_dir: Path) -> str:
+ """Generate one multi-panel chart for this profile. Returns chart path or empty string."""
@@
- except ImportError:
- return {}
+ except ImportError:
+ return ""
@@
- return str(chart_path)
+ return str(chart_path)Also applies to: 366-367, 412-413
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/bench/curve_fitting.py` around lines 272 - 280, When Matplotlib is
unavailable make_charts currently returns {} which breaks
ProfileResult.chart_path and JSON output; change the exception path to return a
dict mapping the profile's model identifier to an empty string (e.g.,
{profile.model: ""}) so the function always returns string paths. Update the
same pattern in the other try/except blocks referenced (around the blocks at the
later occurrences) so each returns a mapping with the appropriate model key to
an empty string instead of an empty dict; ensure references are to make_charts
and ProfileResult.chart_path so callers always get a str path value.
| import sqlite3 | ||
|
|
||
| db_path = Path(args.brain_path) / "system.db" | ||
| if db_path.exists(): | ||
| conn = sqlite3.connect(str(db_path)) | ||
| rows = conn.execute( | ||
| "SELECT session, COUNT(*) as cnt FROM events " | ||
| "WHERE type = 'CORRECTION' AND session IS NOT NULL AND session > 0 " | ||
| "GROUP BY session ORDER BY session" | ||
| ).fetchall() |
There was a problem hiding this comment.
Include zero-correction sessions when loading real brain data.
This query drops sessions with zero corrections, which biases the real profile and can change model ranking/recommendation.
Proposed fix
- rows = conn.execute(
- "SELECT session, COUNT(*) as cnt FROM events "
- "WHERE type = 'CORRECTION' AND session IS NOT NULL AND session > 0 "
- "GROUP BY session ORDER BY session"
- ).fetchall()
+ rows = conn.execute(
+ "SELECT session, "
+ "SUM(CASE WHEN type = 'CORRECTION' THEN 1 ELSE 0 END) AS cnt "
+ "FROM events "
+ "WHERE session IS NOT NULL AND session > 0 "
+ "GROUP BY session ORDER BY session"
+ ).fetchall()
@@
- real_sessions = [r[0] for r in rows]
- real_corrections = [float(r[1]) for r in rows]
+ real_sessions = [int(r[0]) for r in rows]
+ real_corrections = [float(r[1]) for r in rows]Also applies to: 491-495
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/bench/curve_fitting.py` around lines 479 - 488, The current SQL
filters out sessions with zero corrections by using WHERE type = 'CORRECTION'
and COUNT(*); update the query used where db_path/conn are defined to compute
per-session correction counts including zeros: remove the type filter and
replace COUNT(*) with SUM(CASE WHEN type = 'CORRECTION' THEN 1 ELSE 0 END) AS
cnt, keeping the session IS NOT NULL AND session > 0 condition and the GROUP BY
session ORDER BY session; make the same change to the second occurrence around
lines 491-495.
| ``` | ||
| compliance_est(k) = coverage(k) × (1 − 0.30 × fp_rate(k)) | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced code block (MD040).
Use a language like text for the formula block to satisfy markdownlint.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 40-40: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/docs/research/many-shot-ablation-2026-05-21.md` around lines 40 - 42,
Add a language tag to the fenced code block containing the formula
`compliance_est(k) = coverage(k) × (1 − 0.30 × fp_rate(k))` (e.g., change ``` to
```text) so markdownlint rule MD040 is satisfied; ensure the opening fence
contains the language token and the closing fence remains unchanged.
| | k jump | Δ coverage | Δ compliance | Δ context tokens | Compliance / 100 extra tokens | | ||
| |--------|-----------|-------------|-----------------|-------------------------------| | ||
| | 5→10 | +0.025 | **−0.005** | +75 | −0.007 | | ||
| | 10→20 | +0.025 | +0.006 | +150 | +0.004 | | ||
| | 20→50 | +0.050 | +0.027 | +450 | +0.006 | |
There was a problem hiding this comment.
Align marginal-efficiency metric with harness output for reproducibility.
This table is compliance-per-100-tokens, but the harness/report generator currently emits coverage-per-100-tokens. With the raw output link on Line 240, readers won’t be able to reproduce this section as-is.
Also applies to: 240-240
| **Status:** INSTRUMENTED — telemetry live, behavioral data pending | ||
| **Date:** 2026-05-21 | ||
| **Author:** analyst (claude_local / sonnet-4-6) | ||
| **Branch:** GRA-1291-prompt-injection-survey |
There was a problem hiding this comment.
Fix stale branch metadata in the document header.
Line 6 lists GRA-1291-prompt-injection-survey, but this artifact is being delivered from docs/research-overnight-fleet-deliverables per PR metadata. This can mislead traceability for future audits.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/docs/research/patch-acceptance-2026-05-21.md` at line 6, Replace the
stale branch identifier "GRA-1291-prompt-injection-survey" in the document
header with the correct branch name from the PR metadata
("docs/research-overnight-fleet-deliverables") so the header accurately reflects
the delivering branch; locate the header line containing the branch token and
update that string accordingly to maintain correct traceability.
| ``` | ||
| <original_rule> (especially in context: word1 word2 word3) | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the fenced code block.
Line 113 starts a fenced block without a language tag (markdownlint MD040). Please label it (for example, text) to keep linting and rendering consistent.
Proposed fix
-```
+```text
<original_rule> (especially in context: word1 word2 word3)</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 113-113: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/docs/research/patch-acceptance-2026-05-21.md` around lines 113 - 115,
The fenced code block containing the snippet "<original_rule> (especially in
context: word1 word2 word3)" is missing a language identifier; update that
fenced block in the document so the opening triple-backticks include a language
(e.g., use "text") to satisfy markdownlint MD040 and ensure consistent
rendering—locate the fenced block that begins with ``` before the
"<original_rule>" line and change it to ```text.
| Expected event schema: | ||
| ```json | ||
| { | ||
| "type": "rule_patch_observed", | ||
| "source": "_patches.observe_patch", | ||
| "data": { | ||
| "category": "TONE", | ||
| "old_rule_text": "Never use exclamation marks", | ||
| "new_rule_text": "Never use exclamation marks (especially in context: email removed draft)", | ||
| "applied_at": "2026-05-21T12:00:00+00:00", | ||
| "observed_compliance_before": 3, | ||
| "observed_compliance_after_3_sessions": null | ||
| }, | ||
| "tags": ["category:TONE", "self_healing", "patch_telemetry"] | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Surround the JSON fence with blank lines to satisfy markdownlint.
The fenced JSON example around Line 177 should be separated by blank lines (MD031), which improves markdown parser compatibility.
Proposed fix
Expected event schema:
+
```json
{
"type": "rule_patch_observed",
"source": "_patches.observe_patch",
"data": {
"category": "TONE",
"old_rule_text": "Never use exclamation marks",
"new_rule_text": "Never use exclamation marks (especially in context: email removed draft)",
"applied_at": "2026-05-21T12:00:00+00:00",
"observed_compliance_before": 3,
"observed_compliance_after_3_sessions": null
},
"tags": ["category:TONE", "self_healing", "patch_telemetry"]
}</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
Expected event schema:
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 177-177: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/docs/research/patch-acceptance-2026-05-21.md` around lines 176 - 191,
The fenced JSON example (the block starting with ```json and the shown
rule_patch_observed object) lacks blank lines before and after the code fence,
violating MD031; fix it by inserting a blank line immediately above the opening
```json fence and another blank line immediately below the closing ``` fence so
the fenced code block is separated from surrounding text and satisfies
markdownlint.
Real-world bug observed 2026-05-21 on production brain: lesson 911130b3 oscillated between two rule phrasings A and B for 5 consecutive rollbacks spanning 20 days, each marked '100% reduction' in the dashboard. Root cause: `observe_patch()` had no cycle detection — every time the compliance scorer flagged the current text as 'failing,' the patcher rewrote it back to the previous text without checking it had just patched away from that text. The 'reduction' metric games itself: the new text trivially shows zero failures because it has zero observations yet. Fix: new module `_oscillation_guard.py` is consulted before each `observe_patch()` call. Detects direct A→B then B→A cycles within a 30-day / 5-patch lookback window. On detection, emits a `rule_patch_cycle_detected` event and aborts the patch instead of recording another fake-reduction row. Conservative scope (only direct cycles, only within a single category, whitespace-normalized comparison) to minimize false-positive risk. The next sibling fix (`recurrence_change → insufficient_data when <3 sessions`) is filed as a separate cloud-side issue [040a09dd]. Tests: 12 new (oscillation_guard) + 166 existing self-healing/patch tests still green. Refs: paperclip issue 1983a5c6
Activates the oscillation guard restored in 41390bf — every successful Brain.patch_rule call now consults observe_patch, which detects A→B→A→B cycles via _oscillation_guard.detect_cycle and aborts the patch if a cycle is in flight. Without this hook the cherry-picked guard files were dead code. Also includes per-rule injection telemetry (second hunk, line 1200+) that was already in flight in the working tree; preserved so cloud sync / dashboard surfaces can attribute applied rules to real sessions.
Comprehensive Bayesian analysis of lift report Beta-binomial claims: - 7 claims evaluated for statistical defensibility - Core methodology is sound but observational, not causal - Identifies claims requiring A/B testing vs paired experiments - Proposes honest reframing for each claim - Recommends prior publication caveats (rename confidence to empirical success rate) The framework correctly estimates posterior success rates from observational data. Causal claims (rule improved agent) are indefensible without control group. Practical safeguards (drift detection, explicit contradictions) work well. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Gradata/src/gradata/brain.py`:
- Around line 763-768: The current patch observation call swallows all
exceptions with "except Exception: pass" (observe_patch(self, category,
old_description, new_description)), hide failures; replace this with either
specific exception handlers (e.g., ImportError/ModuleNotFoundError or the exact
exceptions observe_patch may raise) or at minimum catch Exception as e and call
the module/logger warning with exception info (e.g.,
logger.warning("observe_patch failed for %s: %s", category, e, exc_info=True))
so errors are not silently dropped while preserving defensive behavior.
In `@Gradata/src/gradata/enhancements/self_improvement/_oscillation_guard.py`:
- Around line 114-120: The try/except in the cycle detection path (around the
brain.query_events call that returns events) and the similar block later (lines
~171-190) silently swallow exceptions and return None; update both to catch
Exception as e and log a warning including context (e.g., which guard, the query
parameters, and that we're failing open) using the module/class logger with
exc_info=True before returning the fail-open default so failures are visible in
logs; reference the brain.query_events call and the early return None locations
when making the change.
In `@Gradata/src/gradata/enhancements/self_improvement/_patches.py`:
- Around line 37-45: Replace the bare except handlers that silently "fail open"
with logged warnings including exception info: for the try around events =
brain.query_events(...) change "except Exception: return 0" to "except Exception
as e: logger.warning('failed to query RULE_FAILURE events; failing open',
exc_info=True)" and then return 0; apply the same pattern to the other silent
except blocks noted (the try/excepts around the blocks at ~60-66, 128-129,
146-149, 192-193, 213-216) — catch Exception as e and call logger.warning(...)
with a short context message and exc_info=True before preserving the original
fail-open return/flow; if no module logger exists, create one via
logging.getLogger(__name__).
- Around line 154-193: The loop currently emits a resolved event via brain.emit
in resolve_patch_compliance but leaves the original pending event unchanged so
it gets reprocessed; after a successful emit (the variable updated), mark the
original pending event id (ev.get("id")) as resolved by emitting or updating it
via brain.emit with a payload that sets "observed_compliance_after_3_sessions":
compliance_after (and/or a "resolved" tag) so pending_events queries will
exclude it; change the try block after updated is assigned to call brain.emit
(or the existing event-update mechanism) to persist the marker for the original
event id so resolve_patch_compliance becomes idempotent and prevents metric
drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0c893d35-201f-473f-bb42-145ec622d7f7
📒 Files selected for processing (6)
Gradata/docs/research/lift-report-defensibility.mdGradata/src/gradata/brain.pyGradata/src/gradata/enhancements/self_improvement/_oscillation_guard.pyGradata/src/gradata/enhancements/self_improvement/_patches.pyGradata/tests/test_cli_install_agent.pyGradata/tests/test_oscillation_guard.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: pytest ubuntu-latest / py3.12
- GitHub Check: pytest windows-latest / py3.12
- GitHub Check: pytest macos-latest / py3.12
- GitHub Check: pytest ubuntu-latest / py3.11
- GitHub Check: pytest windows-latest / py3.11
- GitHub Check: pytest macos-latest / py3.11
- GitHub Check: pytest (py3.11)
- GitHub Check: pytest (py3.12)
🧰 Additional context used
📓 Path-based instructions (2)
Gradata/tests/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/tests/**/*.py: SetBRAIN_DIRenvironment variable viatmp_pathin conftest.py for test isolation — ensure_paths.pymodule cache refreshes when callingBrain.init()directly inside tests
Add unit tests intests/test_*.pyfor every CI push without LLM calls (deterministic); mark integration tests with@pytest.mark.integrationand skip them by default (they hit real LLM APIs)
Files:
Gradata/tests/test_cli_install_agent.pyGradata/tests/test_oscillation_guard.py
Gradata/src/**/*.py
📄 CodeRabbit inference engine (Gradata/AGENTS.md)
Gradata/src/**/*.py: Prefersentence-transformersfor local embeddings,google-genaifor Gemini embeddings,cryptographyfor AES-GCM encrypted system.db,bm25sfor BM25 rule ranking, andmem0aifor external memory adapters — guard all optional dependency imports withtry / except ImportErrorat the call site, never at module level
Maintain strict layering: Layer 0 (Primitives: _types.py, _db.py, _events.py, _paths.py, _file_lock.py; Patterns: contrib/patterns/) must never import from Layer 1 (Enhancements: enhancements/, rules/) or Layer 2 (Public API: brain.py, cli.py, daemon.py, mcp_server.py)
Never use bareexcept: pass— use typed exceptions or at minimumlogger.warning(...)withexc_info=Trueto avoid silent failure in a memory product
Never import from out-of-scope sibling directories../Sprites/or../Hausgem/withingradata/*code — that is a layering bug
Never leak private-sibling paths into public docs/code — no references to../Sprites/,../Hausgem/, email addresses, OneDrive paths, or Sprites-specific examples from insidegradata/*
Use atomic-write helper when writing JSON files to prevent corruption from mid-write crashes
Files:
Gradata/src/gradata/brain.pyGradata/src/gradata/enhancements/self_improvement/_oscillation_guard.pyGradata/src/gradata/enhancements/self_improvement/_patches.py
🪛 LanguageTool
Gradata/docs/research/lift-report-defensibility.md
[style] ~11-~11: The wording of this phrase can be improved.
Context: ...he Gradata SDK evaluates whether rules "made the agent better" using a Bayesian Beta-binomial framewo...
(MAKE_STYLE_BETTER)
[style] ~11-~11: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...d as binomial outcomes with a Beta(1,1) prior to compute a posterior mean and 95% CI. Th...
(EN_WORDINESS_PREMIUM_PRIOR_TO)
[style] ~45-~45: The wording of this phrase can be improved.
Context: ...s." Undefendable claim: "This rule made the agent better" without an experiment. A paired test (...
(MAKE_STYLE_BETTER)
[style] ~56-~56: Consider an alternative for the overused word “exactly”.
Context: ...rior, the Bayesian credible interval is exactly the HDI (highest density interval) or q...
(EXACTLY_PRECISELY)
[style] ~137-~137: ‘in decline’ might be wordy. Consider a shorter alternative.
Context: ...fter it happens), not proactive. A rule in decline from session 100 to 200 might maintain ...
(EN_WORDINESS_PREMIUM_IN_DECLINE)
🪛 markdownlint-cli2 (0.22.1)
Gradata/docs/research/lift-report-defensibility.md
[warning] 145-145: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 150-150: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 180-180: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 186-186: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 191-191: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
🔇 Additional comments (4)
Gradata/tests/test_cli_install_agent.py (1)
8-9: LGTM!Also applies to: 60-158
Gradata/docs/research/lift-report-defensibility.md (1)
1-205: LGTM!Gradata/src/gradata/brain.py (1)
1203-1231: LGTM!Gradata/tests/test_oscillation_guard.py (1)
92-232: LGTM!
| try: | ||
| from gradata.enhancements.self_improvement._patches import observe_patch | ||
|
|
||
| observe_patch(self, category, old_description, new_description) | ||
| except Exception: # pragma: no cover — defensive | ||
| pass |
There was a problem hiding this comment.
Replace silent swallow in patch observation path.
Line 767 swallows all errors (except Exception: pass), which hides telemetry/guard regressions after successful patching.
Proposed fix
try:
from gradata.enhancements.self_improvement._patches import observe_patch
observe_patch(self, category, old_description, new_description)
except Exception: # pragma: no cover — defensive
- pass
+ logger.warning("observe_patch failed in Brain.patch_rule", exc_info=True)As per coding guidelines, "Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| from gradata.enhancements.self_improvement._patches import observe_patch | |
| observe_patch(self, category, old_description, new_description) | |
| except Exception: # pragma: no cover — defensive | |
| pass | |
| try: | |
| from gradata.enhancements.self_improvement._patches import observe_patch | |
| observe_patch(self, category, old_description, new_description) | |
| except Exception: # pragma: no cover — defensive | |
| logger.warning("observe_patch failed in Brain.patch_rule", exc_info=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/brain.py` around lines 763 - 768, The current patch
observation call swallows all exceptions with "except Exception: pass"
(observe_patch(self, category, old_description, new_description)), hide
failures; replace this with either specific exception handlers (e.g.,
ImportError/ModuleNotFoundError or the exact exceptions observe_patch may raise)
or at minimum catch Exception as e and call the module/logger warning with
exception info (e.g., logger.warning("observe_patch failed for %s: %s",
category, e, exc_info=True)) so errors are not silently dropped while preserving
defensive behavior.
| try: | ||
| events = brain.query_events( | ||
| event_type="rule_patch_observed", | ||
| limit=200, | ||
| ) | ||
| except Exception: | ||
| return None # Fail open — never block patches on a query failure. |
There was a problem hiding this comment.
Avoid silent failure in cycle detection/emission paths.
Line 119 and Line 189 return fail-open defaults with no logging. That hides guard failures and makes patch-loop incidents hard to debug.
As per coding guidelines, "Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product".
Also applies to: 171-190
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/enhancements/self_improvement/_oscillation_guard.py`
around lines 114 - 120, The try/except in the cycle detection path (around the
brain.query_events call that returns events) and the similar block later (lines
~171-190) silently swallow exceptions and return None; update both to catch
Exception as e and log a warning including context (e.g., which guard, the query
parameters, and that we're failing open) using the module/class logger with
exc_info=True before returning the fail-open default so failures are visible in
logs; reference the brain.query_events call and the early return None locations
when making the change.
| try: | ||
| events = brain.query_events( | ||
| event_type="RULE_FAILURE", | ||
| last_n_sessions=lookback_sessions, | ||
| limit=500, | ||
| ) | ||
| except Exception: | ||
| return 0 | ||
|
|
There was a problem hiding this comment.
Add warning logs for fail-open exception paths.
These handlers fail open silently. In this telemetry module, that makes production diagnosis difficult when queries/emits degrade.
As per coding guidelines, "Never use bare except: pass — use typed exceptions or at minimum logger.warning(...) with exc_info=True to avoid silent failure in a memory product".
Also applies to: 60-66, 128-129, 146-149, 192-193, 213-216
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/enhancements/self_improvement/_patches.py` around lines
37 - 45, Replace the bare except handlers that silently "fail open" with logged
warnings including exception info: for the try around events =
brain.query_events(...) change "except Exception: return 0" to "except Exception
as e: logger.warning('failed to query RULE_FAILURE events; failing open',
exc_info=True)" and then return 0; apply the same pattern to the other silent
except blocks noted (the try/excepts around the blocks at ~60-66, 128-129,
146-149, 192-193, 213-216) — catch Exception as e and call logger.warning(...)
with a short context message and exc_info=True before preserving the original
fail-open return/flow; if no module logger exists, create one via
logging.getLogger(__name__).
| for ev in pending_events: | ||
| data = ev.get("data", {}) | ||
| if data.get("observed_compliance_after_3_sessions") is not None: | ||
| continue | ||
|
|
||
| patch_session = ev.get("session") or 0 | ||
| if current_session - patch_session < min_session_gap: | ||
| continue | ||
|
|
||
| category = data.get("category", "") | ||
| new_rule_text = data.get("new_rule_text", "") | ||
|
|
||
| compliance_after = _count_failures_for_rule(brain, category, new_rule_text) | ||
| compliance_before = data.get("observed_compliance_before") or 0 | ||
| improved = compliance_after < compliance_before | ||
|
|
||
| try: | ||
| updated = brain.emit( | ||
| "rule_patch_observed", | ||
| "_patches.resolve_patch_compliance", | ||
| { | ||
| **data, | ||
| "observed_compliance_after_3_sessions": compliance_after, | ||
| "compliance_improved": improved, | ||
| "resolution_session": current_session, | ||
| "original_event_id": ev.get("id"), | ||
| }, | ||
| [f"category:{category}", "self_healing", _PATCH_TAG, "resolved"], | ||
| ) | ||
| updates.append( | ||
| { | ||
| "category": category, | ||
| "compliance_before": compliance_before, | ||
| "compliance_after": compliance_after, | ||
| "improved": improved, | ||
| "event": updated if isinstance(updated, dict) else {}, | ||
| } | ||
| ) | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
Make compliance resolution idempotent to prevent metric drift.
resolve_patch_compliance() appends resolved events but leaves the original pending event eligible for future runs. That causes repeated re-resolution of the same patch, inflating resolved counts and leaving pending counts permanently noisy in patch_acceptance_rate().
Proposed fix
- try:
- pending_events = brain.query_events(event_type="rule_patch_observed", limit=200)
+ try:
+ events = brain.query_events(event_type="rule_patch_observed", limit=500)
except Exception:
return []
+ resolved_original_ids = {
+ (e.get("data", {}) or {}).get("original_event_id")
+ for e in events
+ if (e.get("data", {}) or {}).get("observed_compliance_after_3_sessions") is not None
+ and (e.get("data", {}) or {}).get("original_event_id")
+ }
+ pending_events = [
+ e
+ for e in events
+ if (e.get("data", {}) or {}).get("observed_compliance_after_3_sessions") is None
+ and e.get("id") not in resolved_original_ids
+ ]
+
current_session = _get_current_session(brain)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Gradata/src/gradata/enhancements/self_improvement/_patches.py` around lines
154 - 193, The loop currently emits a resolved event via brain.emit in
resolve_patch_compliance but leaves the original pending event unchanged so it
gets reprocessed; after a successful emit (the variable updated), mark the
original pending event id (ev.get("id")) as resolved by emitting or updating it
via brain.emit with a payload that sets "observed_compliance_after_3_sessions":
compliance_after (and/or a "resolved" tag) so pending_events queries will
exclude it; change the try block after updated is assigned to call brain.emit
(or the existing event-update mechanism) to persist the marker for the original
event id so resolve_patch_compliance becomes idempotent and prevents metric
drift.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
This reverts commit 4adc548.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Summary
Overnight 2026-05-20→21 the autonomous research fleet produced these deliverables but heartbeat budgets ran out before agents called
gh pr create. The files sat uncommitted on local disk all morning. Promoting them manually so the work isn't lost.Research docs (1,186 lines)
convergence-curve-math.md— 4-model comparison (exponential / power-law / smoothed-MA / cumulative-plateau) with shipping recommendationpatch-acceptance-2026-05-21.md— self-healing patch acceptance rate studygraduation-quality-2026-05-21.md— meta-rule graduation quality auditmany-shot-ablation-2026-05-21.md— k=10/20/50 ablationembedding-vs-bm25-2026-05-21.md— cross-language scoring comparisonBench harnesses (1,795 lines, runnable)
bench/curve_fitting.py— fits 4 curve models, exports PNG + JSONbench/many_shot_ablation.py— many-shot benchbench/cross_language_scoring.py— BM25 vs embedding benchReview focus
Content is substantive (not LLM slop). References real codebase paths, recommendations have R²/AIC math. Skim the convergence-curve doc first — that's the highest-leverage one and directly informs in-flight ENG issues [441311ff] (smoothed cumulative curve) and [029731fe] (exponential-fit overlay).
Out of scope
Implementation of the recommendations — those are separate ENG PRs in flight.