Skip to content

feat(optimizer): annotate types for ORDER BY alias references#7281

Open
doripo wants to merge 3 commits intotobymao:mainfrom
doripo:feat-orderby-alias-types
Open

feat(optimizer): annotate types for ORDER BY alias references#7281
doripo wants to merge 3 commits intotobymao:mainfrom
doripo:feat-orderby-alias-types

Conversation

@doripo
Copy link
Contributor

@doripo doripo commented Mar 12, 2026

When ORDER BY references a projection alias, annotate_types leaves the column typed as UNKNOWN:

import sqlglot
from sqlglot import exp
from sqlglot.optimizer import qualify, annotate_types

query = qualify.qualify(sqlglot.parse_one("SELECT x + 1 AS y FROM t ORDER BY y"), schema={"t": {"x": "INT"}})
annotated = annotate_types.annotate_types(query, schema={"t": {"x": "INT"}})
order = annotated.find(exp.Order)
print(order.expressions[0].this.type)  # UNKNOWN — should be INT

This happens because qualify_columns intentionally preserves ORDER BY alias refs (they're valid SQL in all dialects), so the annotator has no table-qualified column to resolve against. Other clauses (GROUP BY, HAVING) don't have this issue because _expand_alias_refs expands their alias references before annotation.

This PR adds a post-pass (_fixup_order_by_aliases) in annotate_scope that runs after _annotate_expression, when projections are fully typed. It:

  1. Builds an alias-to-type map from query.selects
  2. Walks ORDER BY columns, copying types for bare-column alias matches
  3. Clears _visited entries and non-leaf types on the ORDER BY subtree so _annotate_expression can re-derive compound expression types (e.g., ORDER BY y + 1) from the updated leaves

The _visited clearing is necessary because _annotate_expression skips nodes already in _visited, regardless of _overwrite_types. Subquery subtrees are pruned during the clearing walk because they belong to inner scopes already annotated independently.

Test coverage includes basic alias resolution, shadowing/collisions, sort modifiers, compound expressions, set operations, window functions, subquery-as-projection, type coercion, and regression guards.

When ORDER BY references a projection alias (e.g., SELECT x+1 AS y ...
ORDER BY y), the column's type was left as UNKNOWN. qualify_columns
intentionally preserves these alias refs (they're valid SQL in all
dialects), so the single-pass annotator has no table-qualified column
to resolve against.

Add a post-pass (_fixup_order_by_aliases) in annotate_scope that runs
after projections are fully typed. It builds an alias-to-type map, fixes
matching bare columns in ORDER BY, and re-derives parent types on
compound expressions (e.g., ORDER BY y + 1) via _reannotate_subtree.

This approach avoids modifying _annotate_expression, the core annotation
loop. _reannotate_subtree clears non-leaf types (preserving Column/Literal
ground truth), prunes at Subquery boundaries, and re-invokes
_annotate_expression sequentially.
@doripo
Copy link
Contributor Author

doripo commented Mar 12, 2026

Happy to adjust the approach if you'd prefer this handled differently. A few notes on design choices:

  • Post-pass vs. modifying _annotate_expression: I went with a post-pass to avoid touching the core annotation loop. The tradeoff is two walks over the ORDER BY subtree, but it keeps the change self-contained.
  • _reannotate_subtree as a separate method: Seemed cleaner and potentially reusable for traversal-order gaps, but can be inlined into _fixup_order_by_aliases if preferred.

Replace the post-pass (_fixup_order_by_aliases + _reannotate_subtree)
with _resolve_order_by_alias, called from the column annotation path
in _annotate_expression. When a bare column in ORDER BY matches a
projection alias, it forces the projection's annotation via a recursive
call if needed, then copies the type.

This resolves alias types during the existing annotation pass instead
of walking the ORDER BY subtree twice after the fact.

Signed-off-by: Dori Polotsky <doripo@riverpool.ai>
@doripo
Copy link
Contributor Author

doripo commented Mar 13, 2026

Note: _resolve_order_by_alias has extra logic to collect the last matching alias rather than returning on first match, to replicate behavior empirically observed on a local DuckDB and pinned as a test. On the other hand, duplicate alias resolution appears unspecified across dialects -- this could be simplified to an early return and the duplicate alias test dropped if preferred.

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doripo I think I misguided you to an even costlier approach. This didn't work out well:

For every Column in every scope of the input query, you're doing an ancestor search and you walk the projection list of said query, again and again. Both of these are very wasteful.

We should move the logic out of _annotate_expression entirely and make it a post-pass in annotate_scope. After _annotate_expression finishes, all projections are fully typed, so you can:

  1. Grab the Order node from the query
  2. Build an alias -> type map from query.selects
  3. Walk only the ORDER BY columns, copy types for bare-column alias matches
  4. Re-annotate any compound ORDER BY expressions (e.g., y + 1) whose leaf types changed, by calling _annotate_expression on those subtrees

This is simpler, more targeted, and keeps the hot path (_annotate_expression) untouched. It's essentially what you did in the first commit, but without the complexity of _reannotate_subtree, because you can just re-run _annotate_expression on the individual Ordered node, since all leaf types are now known.

@georgesittas
Copy link
Collaborator

Are there similar issues during the annotation of GROUP BY, HAVING, etc, or do we avoid it due to expanding the alias references in qualify_columns for these clauses?

@doripo
Copy link
Contributor Author

doripo commented Mar 16, 2026

@georgesittas Thanks for the detailed guidance — agreed, the in-loop approach bought us more than we bargained for.

Two things I want to confirm before revising:

  1. After the main pass, Ordered subtree nodes are in _visited, so _annotate_expression would skip them. I'll still need to clear _visited entries for non-leaf nodes before re-invoking — is that what you had in mind, or is there a way to avoid that walk?
  2. I read your suggestion as: only re-annotate compound Ordered expressions (e.g., y + 1), as opposed to trivial alias refs (e.g., ORDER BY y) where setting the Column type is sufficient. Is that right?

Re: GROUP BY / HAVING — those indeed don't have this issue because they are expanded. If we added ORDER BY to _expand_alias_refs in qualify_columns, then SELECT x+1 AS y FROM t ORDER BY y would become ORDER BY x+1 and not need a post-pass, but that's a bigger external change. Worth exploring?

@georgesittas
Copy link
Collaborator

Hey @doripo, apologies for the delay here. Let me answer your questions:

After the main pass, Ordered subtree nodes are in _visited, so _annotate_expression would skip them. I'll still need to clear _visited entries for non-leaf nodes before re-invoking — is that what you had in mind, or is there a way to avoid that walk?

I think you can manually set _overwrite_types to True locally, when you're about to annotate the Ordered nodes. Only do this for subtress whose types you know will change, to avoid messing up existing types. This is just an idea, so make sure you explore if it has any unintended side-effects while at it.

I read your suggestion as: only re-annotate compound Ordered expressions (e.g., y + 1), as opposed to trivial alias refs (e.g., ORDER BY y) where setting the Column type is sufficient. Is that right?

I think we want to annotate all of the Ordered nodes, right? If you skip ORDER BY y, won't y's type remain unknown?

Re: GROUP BY / HAVING — those indeed don't have this issue because they are expanded. If we added ORDER BY to _expand_alias_refs in qualify_columns, then SELECT x+1 AS y FROM t ORDER BY y would become ORDER BY x+1 and not need a post-pass, but that's a bigger external change. Worth exploring?

Cool, that's what I expected. No, let's not worry about this for now.

@georgesittas
Copy link
Collaborator

Will be out for a few days, @geooo109 or @VaggelisD can you keep an eye on this PR?

Revert the in-loop approach and restore the post-pass in annotate_scope.
Inline the reannotation logic into _fixup_order_by_aliases instead of
a separate _reannotate_subtree method. Update duplicate alias test
comment to reference _expand_alias_refs consistency.

Signed-off-by: Dori Polotsky <doripo@riverpool.ai>
@doripo
Copy link
Contributor Author

doripo commented Mar 18, 2026

Thanks for the detailed answers, and no worries!

Re: _overwrite_types — explored it, but _visited is checked first in the skip condition of _annotate_expression, so it blocks re-entry regardless of _overwrite_types (which is already True by default). Clearing _visited entries for non-leaf nodes is still needed for _annotate_expression to re-derive compound expression types. (Worth noting separately — could be that _overwrite_types was intended to bypass _visited as well?)

Re: annotating all Ordered nodes — yes, the post-pass sets leaf types and then clears + re-annotates the full ORDER BY subtree, covering both simple and compound cases.

The third commit brings it back close to the original post-pass, with the reannotation inlined - updated the PR description to reflect the current approach.

Happy to refine further from here as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants