feat: test failure comparison system with HTML diff reports#134
Merged
feat: test failure comparison system with HTML diff reports#134
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…th inserts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…bnb010-scd2-snapshot
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds an alternate solution seed for snap__hosts and updates the equality test to check against multiple answer keys (pass if any match). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… snap_hosts seed Three bugs fixed in interact.py: 1. Pass allowed_tools from plugin set config when creating agent 2. Use run-tests.sh pipeline instead of bare dbt test 3. Copy test_setup script to container Simplify airbnb010 solution seeds: - Remove alternate snap_hosts seed with renamed columns - Keep only raw-column seed (ID/NAME) as single answer key - Update _no-op.txt seed schema and regenerate test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sion compat Add daily cache-busting for the dbt-fusion curl install in Docker builds so new fusion versions are picked up automatically. Also add dbt-fusion variants to airbnb010/011 task configs and update test SQL to use load_relation(ref()) for fusion compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the current_superhost_tenure column from prompts, solution models, and seed CSVs for both airbnb010 and airbnb011. This column relied on now() being frozen, which agents reasonably use CURRENT_DATE for instead — making it an unfair test of snapshot logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds script to detect failing equality tests from dbt run_results.json and resolve actual/expected relation pairs from manifest.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Changed _extract_comparison_artifacts to use trial_handler._task_output_path (e.g. airbnb011/airbnb011.base.1-of-1/) instead of trial_handler.output_path (experiment root). This ensures each trial run gets its own comparisons dir. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a column differs in >=90% of paired rows, it's promoted to a "Systematic Diffs" banner showing the distinct value mappings (e.g. true → t, false → f). Those columns are stripped from per-row diffs, and remaining diffs render as a compact flat table with arrow notation (expected → actual) instead of collapsible per-row details. Also fixes comparison artifact extraction to scope to the trial directory (_task_output_path) instead of the experiment root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When zero rows match exactly and both tables have equal row counts, try excluding each column one at a time from EXCEPT ALL. If excluding a column recovers >50% of rows as matches, flag it as systematic. Then redo EXCEPT ALL without those columns so that fuzzy matching and row counts work correctly. This handles the case where a formatting difference (e.g. true→t) in a single column causes every row to fail exact matching, which previously exceeded the fuzzy matching cap on large tables and produced an unhelpful "2180 missing / 2180 extra" report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The JOIN-based approach for collecting sample value pairs produced empty results on large tables with many-to-many joins. Replaced with simpler approach: get distinct values from each side independently, pair them positionally, and filter to mismatches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused imports and variables, apply black formatting to all files changed on this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Commenting out until dbt-labs/dbt-fusion#1447 is resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…comparison # Conflicts: # tasks/airbnb010/solutions/dim_superhost_evolution.sql # tasks/airbnb010/solutions/snap__hosts.sql # tasks/airbnb010/solutions/src_hosts.sql # tasks/airbnb010/task.yaml # tasks/airbnb011/seeds/solution__snap__hosts.csv # tasks/airbnb011/solutions/dim_superhost_evolution.sql # tasks/airbnb011/solutions/snap__hosts.sql # tasks/airbnb011/solutions/src_hosts.sql # tasks/airbnb011/task.yaml
- Use parameterized queries for read_parquet() to avoid path injection - Add 30s timeout to docker exec/cp in _extract_comparison_artifacts - Wrap DuckDB connections in try/finally for proper cleanup - Pass relation names via temp file instead of inline shell interpolation - Cap rendered HTML rows at 250 with truncation notice Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…comparisons - Add alternate seed snap__hosts_aliased with HOST_ID/HOST_NAME columns for airbnb010 and airbnb011 tasks - Rename comparisons directory to data_comparisons everywhere - Rename HTML buttons: "Diffs" → "File Diffs", "Comparisons" → "Data Comparisons" - Only compare against first solution seed (primary answer key) when multiple alternates exist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… matching The old approach rendered dbt_utils.test_equality() for each answer key variant in separate CTEs. When column names didn't match between the actual table and a variant seed, the macro threw a compiler error that killed the entire test — even if another variant would have passed. Now all equality tests use a unified Jinja loop that: 1. Gets columns via adapter.get_columns_in_relation() for both sides 2. Skips variants where column names don't match (instead of erroring) 3. Runs the EXCEPT-based comparison via run_query() at compile time 4. Short-circuits on first matching variant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 75-line inline Jinja block in all AUTO_*_equality.sql tests with a call to a new ade_bench_equality_test macro. The macro is generated alongside the test files and copied into /app/macros/ in the container at test runtime. - Add EQUALITY_MACRO_FILENAME and get_equality_macro_content() to test_generator.py - generate_equality_test() now emits a single macro call instead of inline logic - generate_solution_tests() accepts macros_dir and writes the macro there - harness.py copies generated macros to /app/macros/ and writes them back to tasks/<id>/macros/ for debugging - Regenerate all 108 existing AUTO_*_equality.sql files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker's put_archive requires the target directory to exist. Most dbt project images don't ship a macros/ directory, causing a 404 error when copying the generated ade_bench_equality_test macro. Create the directory with exec_run before the copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
true→tacross all rows) via column-exclusion pre-scan, shown as a compact banner instead of per-row noiseexpected → actualnotationNew files
shared/scripts/detect_failing_equality_tests.py— readsrun_results.json+manifest.jsonto find failing equality test pairsshared/scripts/dump_tables.py— exports relations to parquet/CSV (DuckDB + Snowflake)shared/scripts/compare_tables.py— column diff, EXCEPT ALL row matching, fuzzy pairing with numeric tolerance, systematic diff detection, HTML renderingshared/scripts/run-comparison.sh— container-side orchestrationtests/scripts/— 34 unit testsModified files
ade_bench/harness.py— extracts/app/comparisons/from container after testsade_bench/handlers/trial_handler.py— adds comparison scripts totest_scriptslistshared/scripts/run-dbt-test.sh— callsrun-comparison.shafterdbt testscripts_python/generate_results_html.py— generatescomparisons.htmldetail pagescripts_python/summarize_results.py— adds conditional Comparisons link to indexdocker/base/Dockerfile.snowflake-*— addsduckdbpip dependency for comparison engineTest plan
uv run pytest tests/scripts/ -v)True → t)🤖 Generated with Claude Code