-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
else: | ||
lag_scol = F.lag(scol, lag).over(Window.orderBy(NATURAL_ORDER_COLUMN_NAME)) | ||
lag_col_name = verify_temp_column_name(sdf, "__autocorr_lag_tmp_col__") | ||
corr = ( | ||
sdf.withColumn(lag_col_name, lag_scol) | ||
.select(F.corr(scol, F.col(lag_col_name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does corr
affected by ansi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example below shows how F.corr(col1, col2) fails with a DIVIDE_BY_ZERO error when ANSI on
>>> df = spark.createDataFrame(
... [(1, None), (0, 1), (0, 0), (0, 0)],
... ["val", "lag"]
... )
>>> df.show()
+---+----+
|val| lag|
+---+----+
| 1|NULL|
| 0| 1|
| 0| 0|
| 0| 0|
+---+----+
>>> df.select(F.corr("val", "lag")).show()
...
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"corr" was called from
...
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> df.select(F.corr("val", "lag")).show()
+--------------+
|corr(val, lag)|
+--------------+
| NULL|
+--------------+
May I get a review @zhengruifeng @ueshin thank you! |
python/pyspark/pandas/series.py
Outdated
|
||
sdf_lag = sdf.withColumn(lag_col_name, lag_scol) | ||
if is_ansi_mode_enabled(sdf.sparkSession): | ||
cov_value = sdf_lag.select(F.covar_samp(scol, F.col(lag_col_name))).first()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why is this using first()
instead of head()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used head
for consistency.
python/pyspark/pandas/series.py
Outdated
|
||
sdf_lag = sdf.withColumn(lag_col_name, lag_scol) | ||
if is_ansi_mode_enabled(sdf.sparkSession): | ||
cov_value = sdf_lag.select(F.covar_samp(scol, F.col(lag_col_name))).first()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the relationship between these functions and corr
. Could you add a comment to explain why it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sdf_lag = sdf.withColumn(lag_col_name, lag_scol) | ||
if is_ansi_mode_enabled(sdf.sparkSession): | ||
# Compute covariance between the original and lagged columns. | ||
# If the covariance is None or zero (indicating no linear relationship), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the relationship between corr
and covar_samp
now, if the ANSI failure will throw DIVIDE_BY_ZERO
, then is it possible to try-catch the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not be able to catch the error because it’s a Spark runtime error thrown from executor. But based on my testing, the calculation of corr fails specifically at covar_samp for the test case ps.Series([1, 0, 0, 0]). According to the formula, when covar_samp is 0, the correlation should be 0 anyway. WDYT?
merged to master |
Thank you! |
What changes were proposed in this pull request?
Fix autocorr divide-by-zero error under ANSI mode
Why are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.
Does this PR introduce any user-facing change?
When ANSI is on,
FROM
TO
How was this patch tested?
Unit tests.
Commands below passed
Was this patch authored or co-authored using generative AI tooling?
No.