feat(optimizer): annotate types for ORDER BY alias references#7281
feat(optimizer): annotate types for ORDER BY alias references#7281doripo wants to merge 3 commits intotobymao:mainfrom
Conversation
When ORDER BY references a projection alias (e.g., SELECT x+1 AS y ... ORDER BY y), the column's type was left as UNKNOWN. qualify_columns intentionally preserves these alias refs (they're valid SQL in all dialects), so the single-pass annotator has no table-qualified column to resolve against. Add a post-pass (_fixup_order_by_aliases) in annotate_scope that runs after projections are fully typed. It builds an alias-to-type map, fixes matching bare columns in ORDER BY, and re-derives parent types on compound expressions (e.g., ORDER BY y + 1) via _reannotate_subtree. This approach avoids modifying _annotate_expression, the core annotation loop. _reannotate_subtree clears non-leaf types (preserving Column/Literal ground truth), prunes at Subquery boundaries, and re-invokes _annotate_expression sequentially.
|
Happy to adjust the approach if you'd prefer this handled differently. A few notes on design choices:
|
Replace the post-pass (_fixup_order_by_aliases + _reannotate_subtree) with _resolve_order_by_alias, called from the column annotation path in _annotate_expression. When a bare column in ORDER BY matches a projection alias, it forces the projection's annotation via a recursive call if needed, then copies the type. This resolves alias types during the existing annotation pass instead of walking the ORDER BY subtree twice after the fact. Signed-off-by: Dori Polotsky <doripo@riverpool.ai>
|
Note: |
georgesittas
left a comment
There was a problem hiding this comment.
@doripo I think I misguided you to an even costlier approach. This didn't work out well:
For every Column in every scope of the input query, you're doing an ancestor search and you walk the projection list of said query, again and again. Both of these are very wasteful.
We should move the logic out of _annotate_expression entirely and make it a post-pass in annotate_scope. After _annotate_expression finishes, all projections are fully typed, so you can:
- Grab the
Ordernode from the query - Build an
alias -> typemap fromquery.selects - Walk only the
ORDER BYcolumns, copy types for bare-column alias matches - Re-annotate any compound
ORDER BYexpressions (e.g.,y + 1) whose leaf types changed, by calling_annotate_expressionon those subtrees
This is simpler, more targeted, and keeps the hot path (_annotate_expression) untouched. It's essentially what you did in the first commit, but without the complexity of _reannotate_subtree, because you can just re-run _annotate_expression on the individual Ordered node, since all leaf types are now known.
|
Are there similar issues during the annotation of |
|
@georgesittas Thanks for the detailed guidance — agreed, the in-loop approach bought us more than we bargained for. Two things I want to confirm before revising:
Re: GROUP BY / HAVING — those indeed don't have this issue because they are expanded. If we added ORDER BY to |
|
Hey @doripo, apologies for the delay here. Let me answer your questions:
I think you can manually set
I think we want to annotate all of the
Cool, that's what I expected. No, let's not worry about this for now. |
|
Will be out for a few days, @geooo109 or @VaggelisD can you keep an eye on this PR? |
Revert the in-loop approach and restore the post-pass in annotate_scope. Inline the reannotation logic into _fixup_order_by_aliases instead of a separate _reannotate_subtree method. Update duplicate alias test comment to reference _expand_alias_refs consistency. Signed-off-by: Dori Polotsky <doripo@riverpool.ai>
|
Thanks for the detailed answers, and no worries! Re: Re: annotating all The third commit brings it back close to the original post-pass, with the reannotation inlined - updated the PR description to reflect the current approach. Happy to refine further from here as necessary. |
When ORDER BY references a projection alias,
annotate_typesleaves the column typed asUNKNOWN:This happens because
qualify_columnsintentionally preserves ORDER BY alias refs (they're valid SQL in all dialects), so the annotator has no table-qualified column to resolve against. Other clauses (GROUP BY, HAVING) don't have this issue because_expand_alias_refsexpands their alias references before annotation.This PR adds a post-pass (
_fixup_order_by_aliases) inannotate_scopethat runs after_annotate_expression, when projections are fully typed. It:query.selects_visitedentries and non-leaf types on the ORDER BY subtree so_annotate_expressioncan re-derive compound expression types (e.g.,ORDER BY y + 1) from the updated leavesThe
_visitedclearing is necessary because_annotate_expressionskips nodes already in_visited, regardless of_overwrite_types. Subquery subtrees are pruned during the clearing walk because they belong to inner scopes already annotated independently.Test coverage includes basic alias resolution, shadowing/collisions, sort modifiers, compound expressions, set operations, window functions, subquery-as-projection, type coercion, and regression guards.