Skip to content

Conversation

@luna-bianca
Copy link
Contributor

@luna-bianca luna-bianca commented Nov 28, 2025

@vercel
Copy link

vercel bot commented Nov 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
docs-getdbt-com Ready Ready Preview Jan 9, 2026 4:40pm

@github-actions github-actions bot added the content Improvements or additions to content label Nov 28, 2025
You can use the following optional parameters to customize your state-aware orchestration:

- `loaded_at_query`: Define a custom freshness condition in SQL to account for partial loading or streaming data.
|Parameter | Description | Allowed values | Supports Jinja |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the parameter descriptions to a table format

@luna-bianca luna-bianca marked this pull request as ready for review December 4, 2025 17:10
@luna-bianca luna-bianca requested a review from a team as a code owner December 4, 2025 17:10
@luna-bianca luna-bianca requested a review from reubenmc December 4, 2025 17:11
Copy link

@reubenmc reubenmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luna-bianca! @evabgood and I just added some feedback. Things are starting to look great!

- 🕐 For example if most of our records for `2022-01-30` come in the raw schema of our warehouse on the morning of `2022-01-31`, but a handful don’t get loaded til `2022-02-02`, how might we tackle that? There will already be `max(updated_at)` timestamps of `2022-01-31` in the warehouse, filtering out those late records. **They’ll never make it to our model.**
- 🪟 To mitigate this, we can add a **lookback window** to our **cutoff** point. By **subtracting a few days** from the `max(updated_at)`, we would capture any late data within the window of what we subtracted.
- 👯 As long as we have a **`unique_key` defined in our config**, we’ll simply update existing rows and avoid duplication. We process more data this way, but in a fixed way, and it keeps our model hewing closer to the source data.
- If you're using state-aware orchestration, make sure its freshness detection logic accounts for late-arriving data. By default, dbt uses warehouse metadata, which is updated whenever new rows arrive, even if their event timestamps are in the past. However, if you configure a `loaded_at_field` or `loaded_at_query` that uses an event timestamp (for example, `event_date`), late-arriving data may not increase the `loaded_at` value. In this case, state-aware orchestration may skip rebuilding the incremental model, even though your lookback window would normally pick up those records. To ensure late-arriving data is detected, configure your `loaded_at_field` or `loaded_at_query` to align with the same lookback window used in your incremental filter.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs to be split into three cases as it's currently confusing. These are suggestions so please edit!

Using State-aware orchestration with Incremental Models

  1. By default, SAO uses dbt warehouse metadata to determine source freshness. This means that dbt will consider a source to have new data whenever a new row arrives. This could lead to running your models more often than ideal.
  2. To avoid this issue, you can instead tell dbt exactly which field to look at for freshess by configuing a loaded_at_field for a specific column or a loaded_at_query with custom SQL (LINK TO DOCS ON LOADED AT OPTIONS).
  3. Even with a loaded_at_field or loaded_at_query, late arriving records may have an earlier event timestamp. To ensure late-arriving data is detected, configure your loaded_at_field or loaded_at_query to align with the same lookback window used in your incremental filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Every macro, variable, or templated logic is resolved before state-aware orchestration checks for changes.
- If you use dynamic content (for example, `{{ run_started_at }}`), state-aware orchestration may detect that as a change even if the “static” SQL template hasn’t changed. This may result in more frequent model rebuilds.
- Any change to a macro definition or templated logic will be treated as a code change, even if the underlying data or SQL structure remains the same.
- If you want to leave comments in your source code but don’t want to trigger rebuilds, it is recommended to use regular SQL comments (for example, `-- This is a single-line comment in SQL`) in your query. State-aware orchestration ignores comment-only changes; such annotations will not force model rebuilds across the DAG.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently true, however this should change in a couple of weeks, so it's probably not worth updating right now. Instead, this should (once it goes out) be added to reflect the new behavior.

https://www.notion.so/dbtlabs/Code-changes-for-non-deterministic-SQL-2a4bb38ebda7807386f6ee38e5b0f892?source=copy_link

Detecting code changes

  1. We first look for changes in the pre-rendered SQL (like Mantle/Core does)
  2. iff there is a change, we look at the post-complied SQL (with whitespace and comments stripped out like we do for Fusion currently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Detecting code changes section for now


### Handling concurrent jobs

If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify: only if something has changed though. If nothing has changes, then the second job will simply reuse model_ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


If two separate jobs both depend on the same downstream model (for example, `model_ab`), and both jobs detect upstream changes (`updates_on = any`), then `model_ab` may run twice — once per job.

Under state-aware orchestration, each job independently evaluates whether a model needs rebuilding based on the model’s compiled code and upstream data state. It does not enforce a single build per model across different jobs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. This is really more like:

Under state-aware orchestration, all job read and write from the same shared state and build a model only when either the code or data state has changed. This means that each job individually evaulates whether a model needs rebuilding based on the model’s compiled code and upstream data state.

Could also add: If you want to prevent a job from being built too frequently even when the code or data state has changed, you can slow down any model by using the build_after config (LINK TO DOCS ON HOW TO DO THIS).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Upstream data changes at runtime and model-level freshness settings
- Shared state across jobs

This helps avoid unnecessary rebuilds when underlying source files changed without changing the compiled logic, while still rebuilding when upstream data changes require it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add: While Core did these for a single run in a single job, SAO with Fusion does this in real-time across every job in the enviroment to manage state and ensure you're not building any models when things haven't changed, no matter which job a model is built in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

content Improvements or additions to content

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants