Skip to content

feat: test failure comparison system with HTML diff reports#134

Merged
joellabes merged 43 commits intomainfrom
feature/test-failure-comparison
Mar 18, 2026
Merged

feat: test failure comparison system with HTML diff reports#134
joellabes merged 43 commits intomainfrom
feature/test-failure-comparison

Conversation

@joellabes
Copy link
Collaborator

@joellabes joellabes commented Mar 13, 2026

Summary

  • Adds a post-test comparison pipeline that detects failing dbt equality tests, dumps both sides to parquet, runs structured comparison, and generates HTML diff reports
  • Detects systematic diffs (e.g. truet across all rows) via column-exclusion pre-scan, shown as a compact banner instead of per-row noise
  • Remaining row diffs render as a compact flat table with inline expected → actual notation
  • Comparison artifacts are extracted from containers and integrated into the existing HTML experiment reports with a "Comparisons" link

New files

  • shared/scripts/detect_failing_equality_tests.py — reads run_results.json + manifest.json to find failing equality test pairs
  • shared/scripts/dump_tables.py — exports relations to parquet/CSV (DuckDB + Snowflake)
  • shared/scripts/compare_tables.py — column diff, EXCEPT ALL row matching, fuzzy pairing with numeric tolerance, systematic diff detection, HTML rendering
  • shared/scripts/run-comparison.sh — container-side orchestration
  • tests/scripts/ — 34 unit tests

Modified files

  • ade_bench/harness.py — extracts /app/comparisons/ from container after tests
  • ade_bench/handlers/trial_handler.py — adds comparison scripts to test_scripts list
  • shared/scripts/run-dbt-test.sh — calls run-comparison.sh after dbt test
  • scripts_python/generate_results_html.py — generates comparisons.html detail page
  • scripts_python/summarize_results.py — adds conditional Comparisons link to index
  • docker/base/Dockerfile.snowflake-* — adds duckdb pip dependency for comparison engine

Test plan

  • 34 unit tests passing (uv run pytest tests/scripts/ -v)
  • Run benchmark with known-failing task, verify comparisons dir appears in trial output
  • Verify HTML report shows Comparisons link and rendered diff page
  • Verify systematic diffs show as banner (e.g. boolean formatting True → t)
image

🤖 Generated with Claude Code

joellabes and others added 30 commits March 5, 2026 15:32
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…th inserts)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds an alternate solution seed for snap__hosts and updates the equality
test to check against multiple answer keys (pass if any match).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… snap_hosts seed

Three bugs fixed in interact.py:
1. Pass allowed_tools from plugin set config when creating agent
2. Use run-tests.sh pipeline instead of bare dbt test
3. Copy test_setup script to container

Simplify airbnb010 solution seeds:
- Remove alternate snap_hosts seed with renamed columns
- Keep only raw-column seed (ID/NAME) as single answer key
- Update _no-op.txt seed schema and regenerate test files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sion compat

Add daily cache-busting for the dbt-fusion curl install in Docker
builds so new fusion versions are picked up automatically. Also add
dbt-fusion variants to airbnb010/011 task configs and update test SQL
to use load_relation(ref()) for fusion compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the current_superhost_tenure column from prompts, solution
models, and seed CSVs for both airbnb010 and airbnb011. This column
relied on now() being frozen, which agents reasonably use CURRENT_DATE
for instead — making it an unfair test of snapshot logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds script to detect failing equality tests from dbt run_results.json
and resolve actual/expected relation pairs from manifest.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Changed _extract_comparison_artifacts to use trial_handler._task_output_path
(e.g. airbnb011/airbnb011.base.1-of-1/) instead of trial_handler.output_path
(experiment root). This ensures each trial run gets its own comparisons dir.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a column differs in >=90% of paired rows, it's promoted to a
"Systematic Diffs" banner showing the distinct value mappings (e.g.
true → t, false → f). Those columns are stripped from per-row diffs,
and remaining diffs render as a compact flat table with arrow notation
(expected → actual) instead of collapsible per-row details.

Also fixes comparison artifact extraction to scope to the trial
directory (_task_output_path) instead of the experiment root.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
joellabes and others added 4 commits March 13, 2026 15:33
When zero rows match exactly and both tables have equal row counts,
try excluding each column one at a time from EXCEPT ALL. If excluding
a column recovers >50% of rows as matches, flag it as systematic.
Then redo EXCEPT ALL without those columns so that fuzzy matching
and row counts work correctly.

This handles the case where a formatting difference (e.g. true→t)
in a single column causes every row to fail exact matching, which
previously exceeded the fuzzy matching cap on large tables and
produced an unhelpful "2180 missing / 2180 extra" report.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The JOIN-based approach for collecting sample value pairs produced
empty results on large tables with many-to-many joins. Replaced with
simpler approach: get distinct values from each side independently,
pair them positionally, and filter to mismatches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused imports and variables, apply black formatting to all
files changed on this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joellabes joellabes changed the base branch from main to feature/airbnb010-scd2-snapshot March 13, 2026 03:27
joellabes and others added 2 commits March 16, 2026 16:34
@joellabes joellabes changed the base branch from feature/airbnb010-scd2-snapshot to main March 16, 2026 05:27
joellabes and others added 7 commits March 16, 2026 18:28
…comparison

# Conflicts:
#	tasks/airbnb010/solutions/dim_superhost_evolution.sql
#	tasks/airbnb010/solutions/snap__hosts.sql
#	tasks/airbnb010/solutions/src_hosts.sql
#	tasks/airbnb010/task.yaml
#	tasks/airbnb011/seeds/solution__snap__hosts.csv
#	tasks/airbnb011/solutions/dim_superhost_evolution.sql
#	tasks/airbnb011/solutions/snap__hosts.sql
#	tasks/airbnb011/solutions/src_hosts.sql
#	tasks/airbnb011/task.yaml
- Use parameterized queries for read_parquet() to avoid path injection
- Add 30s timeout to docker exec/cp in _extract_comparison_artifacts
- Wrap DuckDB connections in try/finally for proper cleanup
- Pass relation names via temp file instead of inline shell interpolation
- Cap rendered HTML rows at 250 with truncation notice

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…comparisons

- Add alternate seed snap__hosts_aliased with HOST_ID/HOST_NAME columns
  for airbnb010 and airbnb011 tasks
- Rename comparisons directory to data_comparisons everywhere
- Rename HTML buttons: "Diffs" → "File Diffs", "Comparisons" → "Data Comparisons"
- Only compare against first solution seed (primary answer key) when
  multiple alternates exist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… matching

The old approach rendered dbt_utils.test_equality() for each answer key
variant in separate CTEs. When column names didn't match between the
actual table and a variant seed, the macro threw a compiler error that
killed the entire test — even if another variant would have passed.

Now all equality tests use a unified Jinja loop that:
1. Gets columns via adapter.get_columns_in_relation() for both sides
2. Skips variants where column names don't match (instead of erroring)
3. Runs the EXCEPT-based comparison via run_query() at compile time
4. Short-circuits on first matching variant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 75-line inline Jinja block in all AUTO_*_equality.sql tests with
a call to a new ade_bench_equality_test macro. The macro is generated
alongside the test files and copied into /app/macros/ in the container
at test runtime.

- Add EQUALITY_MACRO_FILENAME and get_equality_macro_content() to test_generator.py
- generate_equality_test() now emits a single macro call instead of inline logic
- generate_solution_tests() accepts macros_dir and writes the macro there
- harness.py copies generated macros to /app/macros/ and writes them back
  to tasks/<id>/macros/ for debugging
- Regenerate all 108 existing AUTO_*_equality.sql files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker's put_archive requires the target directory to exist. Most dbt
project images don't ship a macros/ directory, causing a 404 error when
copying the generated ade_bench_equality_test macro. Create the directory
with exec_run before the copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joellabes joellabes merged commit f39ad21 into main Mar 18, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant