[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192

xinrong-meng · 2025-06-16T22:29:30Z

What changes were proposed in this pull request?

Fix autocorr divide-by-zero error under ANSI mode

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.

Does this PR introduce any user-facing change?

When ANSI is on,

FROM

>>> s = ps.Series([1, 0, 0, 0])
>>> s.autocorr()
...
25/08/04 13:25:13 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 33)
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== DataFrame ==
"corr" was called from
...

TO

>>> s = ps.Series([1, 0, 0, 0])
>>> s.autocorr()
nan

How was this patch tested?

Unit tests.

Commands below passed

 1004  SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.series.test_stat SeriesStatTests.test_autocorr"

 1009  SPARK_ANSI_SQL_MODE=false ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.series.test_stat SeriesStatTests.test_autocorr

Was this patch authored or co-authored using generative AI tooling?

No.

zhengruifeng · 2025-06-18T01:22:55Z

python/pyspark/pandas/series.py

        else:
            lag_scol = F.lag(scol, lag).over(Window.orderBy(NATURAL_ORDER_COLUMN_NAME))
            lag_col_name = verify_temp_column_name(sdf, "__autocorr_lag_tmp_col__")
-            corr = (
-                sdf.withColumn(lag_col_name, lag_scol)
-                .select(F.corr(scol, F.col(lag_col_name)))


how does corr affected by ansi?

The example below shows how F.corr(col1, col2) fails with a DIVIDE_BY_ZERO error when ANSI on

>>> df = spark.createDataFrame( ... [(1, None), (0, 1), (0, 0), (0, 0)], ... ["val", "lag"] ... ) >>> df.show() +---+----+ |val| lag| +---+----+ | 1|NULL| | 0| 1| | 0| 0| | 0| 0| +---+----+ >>> df.select(F.corr("val", "lag")).show() ... org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "corr" was called from ... >>> spark.conf.set("spark.sql.ansi.enabled", False) >>> df.select(F.corr("val", "lag")).show() +--------------+ |corr(val, lag)| +--------------+ | NULL| +--------------+

xinrong-meng · 2025-08-04T20:50:36Z

May I get a review @zhengruifeng @ueshin thank you!

ueshin · 2025-08-04T23:01:23Z

python/pyspark/pandas/series.py

+
+            sdf_lag = sdf.withColumn(lag_col_name, lag_scol)
+            if is_ansi_mode_enabled(sdf.sparkSession):
+                cov_value = sdf_lag.select(F.covar_samp(scol, F.col(lag_col_name))).first()[0]


nit: Why is this using first() instead of head()?

Used head for consistency.

ueshin · 2025-08-04T23:42:20Z

python/pyspark/pandas/series.py

+
+            sdf_lag = sdf.withColumn(lag_col_name, lag_scol)
+            if is_ansi_mode_enabled(sdf.sparkSession):
+                cov_value = sdf_lag.select(F.covar_samp(scol, F.col(lag_col_name))).first()[0]


I'm not sure the relationship between these functions and corr. Could you add a comment to explain why it works?

Good point, added comments

cc @zhengruifeng

zhengruifeng · 2025-08-06T01:10:18Z

python/pyspark/pandas/series.py

+            sdf_lag = sdf.withColumn(lag_col_name, lag_scol)
+            if is_ansi_mode_enabled(sdf.sparkSession):
+                # Compute covariance between the original and lagged columns.
+                # If the covariance is None or zero (indicating no linear relationship),


I am not sure about the relationship between corr and covar_samp now, if the ANSI failure will throw DIVIDE_BY_ZERO, then is it possible to try-catch the error?

We might not be able to catch the error because it’s a Spark runtime error thrown from executor. But based on my testing, the calculation of corr fails specifically at covar_samp for the test case ps.Series([1, 0, 0, 0]). According to the formula, when covar_samp is 0, the correlation should be 0 anyway. WDYT?

zhengruifeng · 2025-08-07T02:39:14Z

merged to master

xinrong-meng · 2025-08-07T18:46:39Z

Thank you!

github-actions bot added PYTHON PANDAS API ON SPARK labels Jun 16, 2025

zhengruifeng reviewed Jun 18, 2025

View reviewed changes

chk F.covar_samp

7b45422

xinrong-meng force-pushed the autocorr branch from a1aaa5e to 7b45422 Compare August 4, 2025 18:35

test

6d3b339

xinrong-meng changed the title ~~[WIP][SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode~~ [SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode Aug 4, 2025

xinrong-meng requested a review from zhengruifeng August 4, 2025 20:50

ueshin reviewed Aug 4, 2025

View reviewed changes

xinrong-meng added 2 commits August 4, 2025 17:34

use head

75ef14a

comment

e3876f1

zhengruifeng approved these changes Aug 6, 2025

View reviewed changes

zhengruifeng reviewed Aug 6, 2025

View reviewed changes

zhengruifeng closed this in a72fd41 Aug 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192

[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192

Uh oh!

xinrong-meng commented Jun 16, 2025 •

edited

Loading

Uh oh!

zhengruifeng Jun 18, 2025

Uh oh!

xinrong-meng Aug 4, 2025 •

edited

Loading

Uh oh!

xinrong-meng commented Aug 4, 2025

Uh oh!

ueshin Aug 4, 2025

Uh oh!

xinrong-meng Aug 5, 2025

Uh oh!

ueshin Aug 4, 2025

Uh oh!

xinrong-meng Aug 5, 2025

Uh oh!

ueshin Aug 5, 2025

Uh oh!

zhengruifeng Aug 6, 2025

Uh oh!

xinrong-meng Aug 7, 2025

Uh oh!

zhengruifeng commented Aug 7, 2025

Uh oh!

xinrong-meng commented Aug 7, 2025

Uh oh!

Uh oh!

[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192

[SPARK-52394][PS] Fix autocorr divide-by-zero error under ANSI mode #51192

Uh oh!

Conversation

xinrong-meng commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Aug 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Aug 7, 2025

Uh oh!

xinrong-meng commented Aug 7, 2025

Uh oh!

Uh oh!

xinrong-meng commented Jun 16, 2025 •

edited

Loading

xinrong-meng Aug 4, 2025 •

edited

Loading