-
Notifications
You must be signed in to change notification settings - Fork 15.4k
fix(sqllab): ensure SQL_QUERY_MUTATOR is called when MUTATE_AFTER_SPLIT=True #34111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(sqllab): ensure SQL_QUERY_MUTATOR is called when MUTATE_AFTER_SPLIT=True #34111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Vague inline comment for SQL mutation ▹ view | 🧠 Not in standard |
Files scanned
File Path | Reviewed |
---|---|
superset/sql_lab.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
superset/sql_lab.py
Outdated
# Hook to allow environment-specific mutation (usually comments) to the SQL | ||
query.executed_sql = database.mutate_sql_based_on_config(block) | ||
query.executed_sql = database.mutate_sql_based_on_config(block, is_split=config["MUTATE_AFTER_SPLIT"]) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #34111 +/- ##
===========================================
+ Coverage 0 72.98% +72.98%
===========================================
Files 0 559 +559
Lines 0 40520 +40520
Branches 0 4268 +4268
===========================================
+ Hits 0 29575 +29575
- Misses 0 9836 +9836
- Partials 0 1109 +1109
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -475,7 +475,9 @@ def execute_sql_statements( # noqa: C901 | |||
db.session.commit() | |||
|
|||
# Hook to allow environment-specific mutation (usually comments) to the SQL | |||
query.executed_sql = database.mutate_sql_based_on_config(block) | |||
query.executed_sql = database.mutate_sql_based_on_config( | |||
block, is_split=config["MUTATE_AFTER_SPLIT"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand the logic in place, but shouldn't the is_split
argument here be related to whether the sql-to-be-mutated is currently split (as in based on the context of where the function is called), and not be the config flag that means "I want to mutate before/after the split"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding the code calling mutate_sql_based_on_config
should call it twice, once before it's split with is_split=False
and once after it's split with is_split=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, it might be worth changing the function definition to not set a default and, forcing all calls to mutate_sql_based_on_config
to be explicit at every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole thing seems brittle and a bit of a hack. Ideally we'd be consistent here, say always call before split and make it easy for mutators to apply a split if they want, but that would break backwards compatibility ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your assessment. To clarify my approach: the MUTATE_AFTER_SPLIT
flag was introduced specifically to control when the query mutator is applied for chart queries, as discussed in #21645. This behavior was already established in SQL Lab, so the flag was not originally intended to affect SQL Lab queries.
From my understanding, the is_split
parameter is also primarily meant for chart queries, to ensure backward compatibility.
It would be possible to create a separate mutate_sql
function that always applies the mutator, just for SQL Lab (if semantic clarity is the issue with my fix), or to add a comment clarifying why the config flag is passed. However, I believe always applying the mutator after splitting the query in SQL Lab is the intended and correct approach - the "before or after splitting" logic was only ever meant for chart queries, not for SQL Lab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm thinking is that is_split
on the line right above these comments should probably hard coded to True or False, depending on whether the SQL at that time is split or not.
Looking at where it executes ->
superset/superset/models/core.py
Line 646 in 5a32777
if sql_mutator and (is_split == config["MUTATE_AFTER_SPLIT"]): |
If merging the code as is, that method would always execute since config["MUTATE_AFTER_SPLIT"] == config["MUTATE_AFTER_SPLIT"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sql lab shouldn't even be using the mutate_sql_based_on_config
method, it always wants to mutate the sql.
I think the confusion comes from a refactor you did here.
- sql = SQL_QUERY_MUTATOR(
- sql,
- security_manager=security_manager,
- database=database,
- )
+ sql = database.mutate_sql_based_on_config(sql)
The sql lab was refactored to use mutate_sql_based_on_config
, but that method was only written for a fix in chart queries (#21645) . In the sql lab we don't care about is_split
.
I propose that we either create a new mutate_sql
function that always mutates or just use the mutator like before the refactor.
Again, we don't want to use the before or after split logic, that was never intended for the sql lab. Hope this makes more sense now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I remember some of the context now. Generally we're trying to bring the SQL-execution code paths back together, and there's more work to be done here.
I think SQL_QUERY_MUTATOR
is a super useful config hook, but seems it has grown a bit complex. It's tempting to clarify/redesign, but we have to be cautious around backwards compatibility. Dilemma is probably around change management here.
Let's try to understand the ideal scenario here, and see if we can preserve backwards compatibility.
First, let's start with trying to itemize common use cases for SQL_QUERY_MUTATOR
:
- Add pandas-highcharts as requirements #1 is clearly logging, where we add comments to all the SQL executed with information about the query provenance, and the user who triggered it. This allows database admins to get more visibility into who's behind the service account. Many organizations use this to parse user/tool/context information and use it for analytics
- some security/data-access-policy stuff, maybe you have complex data access policy where you want to intercept some things, or double check some rules, and fail queries if something that shouldn't be allowed is submitted, simply
raise
an exception from here - some redirection / rewiring, maybe redirecting to a similar dataset for things like aggregate awareness, ...
- ?
Thinking about this method and keeping it backwards compatible, one thing we could do is passing in more context information, like a source
parameter that would inform whether this SQL comes from SQL Lab, Explore, alerts, populating filters' list of values ... Adding net new args to the function should be backwards compatible as long as user catch/disregard *args **kwargs.
About the is_split
flag, looking at the code, it looks like in explore we call mutate_sql_based_on_config
both before and after the SQL is broken into chunks for execution. Seems this can be confusing for people authoring SQL_QUERY_MUTATOR
, where you have to assume something like:
- in explore, this method will be called twice, once with is_split=False first, and once more for each statement with is_split=True
- in SQL Lab, will be called once for the block (could probably be called twice too following the same pattern as for explore)
- throw in some session/env settings, stuff like calling
SET
and things like that. Kind of a hack but who knows what people might be using this for ...
Knowing all this, it's unclear to me what to do, especially given the backwards-compatibility concerns.
Best "mostly backwards"-compatible way forward: add a source
args for clarity, call twice for SQL Lab for consistency (?) Add some notes in UPGRADING.md about the changes to notify upgraders.
Breaking backwards compatibility? Flag SQL_QUERY_MUTATOR
for deprecation and redesign the hook(s), maybe something like:
SQL_BLOCK_MUTATOR(source) # always called even if block has a single statement
SQL_STATEMENT_MUTATOR(source) # called for each sub-statement
Curious on what @betodealmeida or others might thing is best here. This could be slated along with overdue work around bringing all SQL execution codepaths together, and fix/improve other related issues around code and logic duplication across explore and SQL Lab.
SUMMARY
Very simple fix for bug described in: #30169
Currently all queries in sqllab are mutated like this:
query.executed_sql = database.mutate_sql_based_on_config(block)
This means
is_split
is alwaysFalse
-> We always use the sql_mutator whenMUTATE_AFTER_SPLIT
is False, and never when it's True.The fix passes the value of
MUTATE_AFTER_SPLIT
to themutate_sql_based_on_config
function, sois_split == config["MUTATE_AFTER_SPLIT"]
always evaluates to True.This does not break current implementations and fix the bug.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
In config.py:
Run query in sql lab -> Now prints message when
MUTATE_AFTER_SPLIT
= True or FalseADDITIONAL INFORMATION
Fixes #30169