fix(athena): partition tmp table in incremental to reduce batch scan cost#1711
fix(athena): partition tmp table in incremental to reduce batch scan cost#1711dtaniwaki wants to merge 1 commit intodbt-labs:mainfrom
Conversation
bdd91c6 to
6ee0b04
Compare
…al to reduce batch scan cost
6ee0b04 to
7cde3fb
Compare
…cost Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>
7cde3fb to
589e727
Compare
|
@dtaniwaki, what you proposed here is against the original implementation decision - when we create a temporary table, we always write unpartitioned data, to then finally write to the final target location in a partitioned table. |
|
@nicor88 I see. Then, how can we avoid massive full scan queries against huge data with massive number of partitions? I mistakenly created a model of this situation and waisted lots of money... |
…al to reduce batch scan cost
|
@dtaniwaki how about introducing the approach that you suggested, but being able to control it via a "config" flag? The flag can be false by default, and properly documented, but then in your case you can set to "true". doing so, we avoid any regretion, and you and other users are covered for those edge cases - I believe that that's the best compromise. Think about a good descriptive name for such configuration flag. |
resolves #1744
docs dbt-labs/docs.getdbt.com/#
Thank you for maintaining this project! I'd appreciate your review on this fix to the Athena adapter's incremental materialization.
Problem
In
safe_create_table_as, thetemporary=Truebranch always created the tmp table (__dbt_tmp) without partitioning (skip_partitioning=True), bypassingTOO_MANY_OPEN_PARTITIONShandling entirely.This caused an O(N) scan cost for batch incremental inserts: since the tmp table had no partitions, every batch in
batch_incremental_insertperformed a full scan of__dbt_tmp. With N batches, the total cost was (N+1) × full scan.Solution
Unified the
temporary=Trueandtemporary=Falsecode paths insafe_create_table_asso that tmp tables now go throughrun_query_with_partitions_limit_catchingand fall back tocreate_table_as_with_partitionsonTOO_MANY_OPEN_PARTITIONS.With this change,
__dbt_tmpis created with partitions (whenpartitioned_byis configured), so each batch benefits from partition pruning. Total scan cost drops from (N+1) × full scan to roughly 2 × full scan.Models without
partitioned_byare unaffected — they produce an unpartitioned CTAS as before.Checklist