-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SAO doc improvements #8234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current
Are you sure you want to change the base?
SAO doc improvements #8234
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| You can use the following optional parameters to customize your state-aware orchestration: | ||
|
|
||
| - `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data. | ||
| |Parameter | Description | Allowed values | Supports Jinja | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted the parameter descriptions to a table format
reubenmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luna-bianca! @evabgood and I just added some feedback. Things are starting to look great!
| - 🕐 For example if most of our records for `2022-01-30` come in the raw schema of our warehouse on the morning of `2022-01-31`, but a handful don’t get loaded til `2022-02-02`, how might we tackle that? There will already be `max(updated_at)` timestamps of `2022-01-31` in the warehouse, filtering out those late records. **They’ll never make it to our model.** | ||
| - 🪟 To mitigate this, we can add a **lookback window** to our **cutoff** point. By **subtracting a few days** from the `max(updated_at)`, we would capture any late data within the window of what we subtracted. | ||
| - 👯 As long as we have a **`unique_key` defined in our config**, we’ll simply update existing rows and avoid duplication. We process more data this way, but in a fixed way, and it keeps our model hewing closer to the source data. | ||
| - If you're using state-aware orchestration, make sure its freshness detection logic accounts for late-arriving data. By default, dbt uses warehouse metadata, which is updated whenever new rows arrive, even if their event timestamps are in the past. However, if you configure a `loaded_at_field` or `loaded_at_query` that uses an event timestamp (for example, `event_date`), late-arriving data may not increase the `loaded_at` value. In this case, state-aware orchestration may skip rebuilding the incremental model, even though your lookback window would normally pick up those records. To ensure late-arriving data is detected, configure your `loaded_at_field` or `loaded_at_query` to align with the same lookback window used in your incremental filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs to be split into three cases as it's currently confusing. These are suggestions so please edit!
Using State-aware orchestration with Incremental Models
- By default, SAO uses dbt warehouse metadata to determine source freshness. This means that dbt will consider a source to have new data whenever a new row arrives. This could lead to running your models more often than ideal.
- To avoid this issue, you can instead tell dbt exactly which field to look at for freshess by configuing a
loaded_at_fieldfor a specific column or aloaded_at_querywith custom SQL (LINK TO DOCS ON LOADED AT OPTIONS). - Even with a
loaded_at_fieldorloaded_at_query, late arriving records may have an earlier event timestamp. To ensure late-arriving data is detected, configure yourloaded_at_fieldorloaded_at_queryto align with the same lookback window used in your incremental filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Every macro, variable, or templated logic is resolved before state-aware orchestration checks for changes. | ||
| - If you use dynamic content (for example, `{{ run_started_at }}`), state-aware orchestration may detect that as a change even if the “static” SQL template hasn’t changed. This may result in more frequent model rebuilds. | ||
| - Any change to a macro definition or templated logic will be treated as a code change, even if the underlying data or SQL structure remains the same. | ||
| - If you want to leave comments in your source code but don’t want to trigger rebuilds, it is recommended to use regular SQL comments (for example, `-- This is a single-line comment in SQL`) in your query. State-aware orchestration ignores comment-only changes; such annotations will not force model rebuilds across the DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently true, however this should change in a couple of weeks, so it's probably not worth updating right now. Instead, this should (once it goes out) be added to reflect the new behavior.
Detecting code changes
- We first look for changes in the pre-rendered SQL (like Mantle/Core does)
- iff there is a change, we look at the post-complied SQL (with whitespace and comments stripped out like we do for Fusion currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Detecting code changes section for now
|
|
||
| ### Handling concurrent jobs | ||
|
|
||
| If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify: only if something has changed though. If nothing has changes, then the second job will simply reuse model_ab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job. | ||
|
|
||
| Under state-aware orchestration, each job independently evaluates whether a model needs rebuilding based on the model’s compiled code and upstream data state. It does not enforce a single build per model across different jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. This is really more like:
Under state-aware orchestration, all job read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state.
Could also add: If you want to prevent a job from being built too frequently even when the code or data state has changed, you can slow down any model by using the build_after config (LINK TO DOCS ON HOW TO DO THIS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And added build_after paragraph here: https://github.com/dbt-labs/docs.getdbt.com/pull/8234/changes#diff-ad798a159c003c98c28f29456ba1d0e295b58d33c976f5ed18c07c567f822080R54
| - Upstream data changes at runtime and model-level freshness settings | ||
| - Shared state across jobs | ||
|
|
||
| This helps avoid unnecessary rebuilds when underlying source files changed without changing the compiled logic, while still rebuilding when upstream data changes require it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add: While Core did these for a single run in a single job, SAO with Fusion does this in real-time across every job in the enviroment to manage state and ensure you're not building any models when things haven't changed, no matter which job a model is built in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you changing in this pull request and why?
Slack thread 1
Slack thread 2
Previews:
Checklist
🚀 Deployment available! Here are the direct links to the updated files: