Skip to content

fix(athena): partition tmp table in incremental to reduce batch scan cost#1711

Open
dtaniwaki wants to merge 1 commit intodbt-labs:mainfrom
dtaniwaki:feat/athena-incremental-tmp-partition
Open

fix(athena): partition tmp table in incremental to reduce batch scan cost#1711
dtaniwaki wants to merge 1 commit intodbt-labs:mainfrom
dtaniwaki:feat/athena-incremental-tmp-partition

Conversation

@dtaniwaki
Copy link
Contributor

@dtaniwaki dtaniwaki commented Mar 4, 2026

resolves #1744
docs dbt-labs/docs.getdbt.com/#

Thank you for maintaining this project! I'd appreciate your review on this fix to the Athena adapter's incremental materialization.

Problem

In safe_create_table_as, the temporary=True branch always created the tmp table (__dbt_tmp) without partitioning (skip_partitioning=True), bypassing TOO_MANY_OPEN_PARTITIONS handling entirely.

This caused an O(N) scan cost for batch incremental inserts: since the tmp table had no partitions, every batch in batch_incremental_insert performed a full scan of __dbt_tmp. With N batches, the total cost was (N+1) × full scan.

Solution

Unified the temporary=True and temporary=False code paths in safe_create_table_as so that tmp tables now go through run_query_with_partitions_limit_catching and fall back to create_table_as_with_partitions on TOO_MANY_OPEN_PARTITIONS.

With this change, __dbt_tmp is created with partitions (when partitioned_by is configured), so each batch benefits from partition pruning. Total scan cost drops from (N+1) × full scan to roughly 2 × full scan.

Models without partitioned_by are unaffected — they produce an unpartitioned CTAS as before.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes The PR author has signed the CLA label Mar 4, 2026
@dtaniwaki dtaniwaki marked this pull request as ready for review March 4, 2026 07:37
@dtaniwaki dtaniwaki requested a review from a team as a code owner March 4, 2026 07:37
@dtaniwaki dtaniwaki force-pushed the feat/athena-incremental-tmp-partition branch 2 times, most recently from bdd91c6 to 6ee0b04 Compare March 4, 2026 08:36
dtaniwaki added a commit to dtaniwaki/dbt-adapters that referenced this pull request Mar 6, 2026
@dtaniwaki dtaniwaki force-pushed the feat/athena-incremental-tmp-partition branch from 6ee0b04 to 7cde3fb Compare March 11, 2026 05:13
…cost

Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>
@dtaniwaki dtaniwaki force-pushed the feat/athena-incremental-tmp-partition branch from 7cde3fb to 589e727 Compare March 11, 2026 05:14
@nicor88
Copy link
Contributor

nicor88 commented Mar 11, 2026

@dtaniwaki, what you proposed here is against the original implementation decision - when we create a temporary table, we always write unpartitioned data, to then finally write to the final target location in a partitioned table.
We are aware that the first unpartititioned write lead to a full scan - this is because as you wrote, we want to bypass the TOO_MANY_OPEN_PARTITIONS entirely - therefore, I discurage to proceed with your approach as it can lead to other hidden issues.

@dtaniwaki
Copy link
Contributor Author

@nicor88 I see. Then, how can we avoid massive full scan queries against huge data with massive number of partitions? I mistakenly created a model of this situation and waisted lots of money...

dtaniwaki added a commit to dtaniwaki/dbt-adapters that referenced this pull request Mar 12, 2026
@nicor88
Copy link
Contributor

nicor88 commented Mar 13, 2026

@dtaniwaki how about introducing the approach that you suggested, but being able to control it via a "config" flag? The flag can be false by default, and properly documented, but then in your case you can set to "true".

doing so, we avoid any regretion, and you and other users are covered for those edge cases - I believe that that's the best compromise. Think about a good descriptive name for such configuration flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla:yes The PR author has signed the CLA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] safe_create_table_as skips partition handling for tmp tables, causing O(N) scan cost in batch incremental

2 participants