feat: test failure comparison system with HTML diff reports by joellabes · Pull Request #134 · dbt-labs/ade-bench

joellabes · 2026-03-13T03:20:44Z

Summary

Adds a post-test comparison pipeline that detects failing dbt equality tests, dumps both sides to parquet, runs structured comparison, and generates HTML diff reports
Detects systematic diffs (e.g. true → t across all rows) via column-exclusion pre-scan, shown as a compact banner instead of per-row noise
Remaining row diffs render as a compact flat table with inline expected → actual notation
Comparison artifacts are extracted from containers and integrated into the existing HTML experiment reports with a "Comparisons" link

New files

shared/scripts/detect_failing_equality_tests.py — reads run_results.json + manifest.json to find failing equality test pairs
shared/scripts/dump_tables.py — exports relations to parquet/CSV (DuckDB + Snowflake)
shared/scripts/compare_tables.py — column diff, EXCEPT ALL row matching, fuzzy pairing with numeric tolerance, systematic diff detection, HTML rendering
shared/scripts/run-comparison.sh — container-side orchestration
tests/scripts/ — 34 unit tests

Modified files

ade_bench/harness.py — extracts /app/comparisons/ from container after tests
ade_bench/handlers/trial_handler.py — adds comparison scripts to test_scripts list
shared/scripts/run-dbt-test.sh — calls run-comparison.sh after dbt test
scripts_python/generate_results_html.py — generates comparisons.html detail page
scripts_python/summarize_results.py — adds conditional Comparisons link to index
docker/base/Dockerfile.snowflake-* — adds duckdb pip dependency for comparison engine

Test plan

34 unit tests passing (uv run pytest tests/scripts/ -v)
Run benchmark with known-failing task, verify comparisons dir appears in trial output
Verify HTML report shows Comparisons link and rendered diff page
Verify systematic diffs show as banner (e.g. boolean formatting True → t)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…th inserts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…bnb010-scd2-snapshot

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds an alternate solution seed for snap__hosts and updates the equality test to check against multiple answer keys (pass if any match). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…2-snapshot

… snap_hosts seed Three bugs fixed in interact.py: 1. Pass allowed_tools from plugin set config when creating agent 2. Use run-tests.sh pipeline instead of bare dbt test 3. Copy test_setup script to container Simplify airbnb010 solution seeds: - Remove alternate snap_hosts seed with renamed columns - Keep only raw-column seed (ID/NAME) as single answer key - Update _no-op.txt seed schema and regenerate test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sion compat Add daily cache-busting for the dbt-fusion curl install in Docker builds so new fusion versions are picked up automatically. Also add dbt-fusion variants to airbnb010/011 task configs and update test SQL to use load_relation(ref()) for fusion compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the current_superhost_tenure column from prompts, solution models, and seed CSVs for both airbnb010 and airbnb011. This column relied on now() being frozen, which agents reasonably use CURRENT_DATE for instead — making it an unfair test of snapshot logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds script to detect failing equality tests from dbt run_results.json and resolve actual/expected relation pairs from manifest.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Changed _extract_comparison_artifacts to use trial_handler._task_output_path (e.g. airbnb011/airbnb011.base.1-of-1/) instead of trial_handler.output_path (experiment root). This ensures each trial run gets its own comparisons dir. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a column differs in >=90% of paired rows, it's promoted to a "Systematic Diffs" banner showing the distinct value mappings (e.g. true → t, false → f). Those columns are stripped from per-row diffs, and remaining diffs render as a compact flat table with arrow notation (expected → actual) instead of collapsible per-row details. Also fixes comparison artifact extraction to scope to the trial directory (_task_output_path) instead of the experiment root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When zero rows match exactly and both tables have equal row counts, try excluding each column one at a time from EXCEPT ALL. If excluding a column recovers >50% of rows as matches, flag it as systematic. Then redo EXCEPT ALL without those columns so that fuzzy matching and row counts work correctly. This handles the case where a formatting difference (e.g. true→t) in a single column causes every row to fail exact matching, which previously exceeded the fuzzy matching cap on large tables and produced an unhelpful "2180 missing / 2180 extra" report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The JOIN-based approach for collecting sample value pairs produced empty results on large tables with many-to-many joins. Replaced with simpler approach: get distinct values from each side independently, pair them positionally, and filter to mismatches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unused imports and variables, apply black formatting to all files changed on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Commenting out until dbt-labs/dbt-fusion#1447 is resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ure-comparison

…comparison # Conflicts: # tasks/airbnb010/solutions/dim_superhost_evolution.sql # tasks/airbnb010/solutions/snap__hosts.sql # tasks/airbnb010/solutions/src_hosts.sql # tasks/airbnb010/task.yaml # tasks/airbnb011/seeds/solution__snap__hosts.csv # tasks/airbnb011/solutions/dim_superhost_evolution.sql # tasks/airbnb011/solutions/snap__hosts.sql # tasks/airbnb011/solutions/src_hosts.sql # tasks/airbnb011/task.yaml

- Use parameterized queries for read_parquet() to avoid path injection - Add 30s timeout to docker exec/cp in _extract_comparison_artifacts - Wrap DuckDB connections in try/finally for proper cleanup - Pass relation names via temp file instead of inline shell interpolation - Cap rendered HTML rows at 250 with truncation notice Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…comparisons - Add alternate seed snap__hosts_aliased with HOST_ID/HOST_NAME columns for airbnb010 and airbnb011 tasks - Rename comparisons directory to data_comparisons everywhere - Rename HTML buttons: "Diffs" → "File Diffs", "Comparisons" → "Data Comparisons" - Only compare against first solution seed (primary answer key) when multiple alternates exist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… matching The old approach rendered dbt_utils.test_equality() for each answer key variant in separate CTEs. When column names didn't match between the actual table and a variant seed, the macro threw a compiler error that killed the entire test — even if another variant would have passed. Now all equality tests use a unified Jinja loop that: 1. Gets columns via adapter.get_columns_in_relation() for both sides 2. Skips variants where column names don't match (instead of erroring) 3. Runs the EXCEPT-based comparison via run_query() at compile time 4. Short-circuits on first matching variant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace 75-line inline Jinja block in all AUTO_*_equality.sql tests with a call to a new ade_bench_equality_test macro. The macro is generated alongside the test files and copied into /app/macros/ in the container at test runtime. - Add EQUALITY_MACRO_FILENAME and get_equality_macro_content() to test_generator.py - generate_equality_test() now emits a single macro call instead of inline logic - generate_solution_tests() accepts macros_dir and writes the macro there - harness.py copies generated macros to /app/macros/ and writes them back to tasks/<id>/macros/ for debugging - Regenerate all 108 existing AUTO_*_equality.sql files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Docker's put_archive requires the target directory to exist. Most dbt project images don't ship a macros/ directory, causing a 404 error when copying the generated ade_bench_equality_test macro. Create the directory with exec_run before the copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joellabes and others added 30 commits March 5, 2026 15:32

feat(airbnb010): add solution SQL files for SCD2 snapshot task

1ed9333

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb010): add watchdog and mutation scripts

7ee75a4

feat(airbnb010): add setup.sh with watchdog launcher

b6af143

feat(airbnb010): add task.yaml and solution.sh

084ac01

feat(airbnb010): add solution seed CSVs

1baa016

feat(airbnb010): add auto-generated test SQL files

0e302a6

style(airbnb010): apply ruff formatting to Python scripts

391fe9e

feat(airbnb011): add solution SQL files for check-strategy SCD2 task

62b481a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb011): add watchdog and mutation scripts (no UPDATED_AT, wi…

5eef4da

…th inserts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb011): add setup.sh with UPDATED_AT corruption and watchdog

127a75a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb011): add task.yaml and solution.sh

933ea9b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb011): add generated seeds and AUTO test files

0f48c1b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'feature/airbnb011-scd2-check-strategy' into feature/air…

1a4526f

…bnb010-scd2-snapshot

style(airbnb011): format Python files with ruff

d6b6516

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(airbnb010): add alternate seed and multi-answer-key test support

3551bfb

Adds an alternate solution seed for snap__hosts and updates the equality test to check against multiple answer keys (pass if any match). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feature/airbnb010-scd…

981412e

…2-snapshot

Merge branch 'main' into feature/airbnb010-scd2-snapshot

9869558

chore: add tests/scripts dir and duckdb to Snowflake Dockerfiles

50b204c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add detect_failing_equality_tests.py using run_results.json

1b82b1a

Adds script to detect failing equality tests from dbt run_results.json and resolve actual/expected relation pairs from manifest.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add dump_tables.py generic table export utility

60aaf59

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add compare_tables.py with column diff and row matching

91d7771

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add HTML diff rendering to compare_tables.py

adf59de

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: container-side comparison orchestration via run-comparison.sh

b7978ed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: extract comparison artifacts from container

8fba747

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add comparisons page to HTML experiment reports

e338257

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joellabes and others added 4 commits March 13, 2026 15:33

chore: increase fuzzy match row limit from 500 to 10k

1cc4029

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix ruff and black lint issues

5c10d9c

Remove unused imports and variables, apply black formatting to all files changed on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

joellabes changed the base branch from main to feature/airbnb010-scd2-snapshot March 13, 2026 03:27

joellabes and others added 2 commits March 16, 2026 16:34

fix(fusion): disable dbt-fusion/duckdb variants for airbnb001, 002, 008

15f4aae

Commenting out until dbt-labs/dbt-fusion#1447 is resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'feature/airbnb010-scd2-snapshot' into feature/test-fail…

25d4fc5

…ure-comparison

joellabes changed the base branch from feature/airbnb010-scd2-snapshot to main March 16, 2026 05:27

joellabes and others added 7 commits March 16, 2026 18:28

style: apply black formatting to compare_tables.py

aeca009

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joellabes merged commit f39ad21 into main Mar 18, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: test failure comparison system with HTML diff reports#134

feat: test failure comparison system with HTML diff reports#134
joellabes merged 43 commits intomainfrom
feature/test-failure-comparison

joellabes commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joellabes commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Modified files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joellabes commented Mar 13, 2026 •

edited

Loading