Date: 2026-02-05
Scope: existing benchmark outputs are kept as-is for presentation; no reruns planned before presentation.
This note records known benchmark script/reporting caveats so interpretation is transparent.
loci_with_validated_precursornaming is misleading in lasso holdout summary.
- Script:
scripts/benchmark_pipeline_validation.py - The value is incremented per evaluated locus-like event, not strict unique GBK files.
- Impact: label/readability issue.
- Performance bias risk: does not inflate top-k recall metrics.
- Redundant summary fields in lasso holdout summary.
- Script:
scripts/benchmark_pipeline_validation.py topkandlocus_level_topkcurrently report the same calculation.- Impact: presentation ambiguity (duplicate stats), not numerical corruption.
- Performance bias risk: none (same values, not optimistic values).
- Silent exception handling while indexing GBK features.
- Scripts:
scripts/benchmark_pipeline_validation.pybeta-lactamase-bench/scripts/benchmark_beta_lactamase.py
- Some parse/record failures may be skipped without surfaced counts.
- Impact: potential undercount of evaluable loci.
- Performance bias risk: typically conservative/neutral; it does not create artificial “better” hits.
record_idalone inholdout_topk.tsvis not globally unique.
- Scripts:
scripts/benchmark_pipeline_validation.pybeta-lactamase-bench/scripts/benchmark_beta_lactamase.py
- Same
record_idcould appear across different GBK files. - Impact: traceability issue when auditing rows manually.
- Performance bias risk: none for computed aggregate recall.
- Core benchmark top-k helper is slightly unclear in denominator intent.
- Script:
scripts/benchmark_core_prediction_lab_dataset.py - Implementation treats unresolved ranks (
None) as failures via denominator choice, but code style is easy to misread. - Impact: readability/maintainability.
- Performance bias risk: conservative if anything (not optimistic).
- Reported benchmark recall values are still valid for ranking performance discussion.
- Known issues are primarily naming/reporting/traceability quality problems.
- None of the listed issues are known to artificially improve benchmark recall or create optimistic bias.
- Align summary field names with true counting semantics.
- Remove redundant summary blocks.
- Log and count skipped records/exceptions explicitly.
- Add globally unique row identifiers (
gbk_path + record_id) in top-k outputs. - Clarify/standardize denominator semantics in core benchmark helper.