Skip to content

Commit 19cedec

Browse files
committed
[SPARK-52751][PYTHON][CONNECT] Don't eagerly validate column name in dataframe['col_name']
### What changes were proposed in this pull request? Don't eagerly validate column name in `dataframe['col_name']` ### Why are the changes needed? to save ANALYZE RPC, fail the query on connect server side ### Does this PR introduce _any_ user-facing change? yes, `df['bad_column']` will fail on analysis or execution ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #51400 from zhengruifeng/test_fail_col. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
1 parent c7c1021 commit 19cedec

File tree

3 files changed

+7
-2
lines changed

3 files changed

+7
-2
lines changed

python/docs/source/migration_guide/pyspark_upgrade.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Upgrading PySpark
2222
Upgrading from PySpark 4.0 to 4.1
2323
---------------------------------
2424

25+
* In Spark 4.1, ``DataFrame['name']`` on Spark Connect Python Client no longer eagerly validate the column name. To restore the legacy behavior, set ``PYSPARK_VALIDATE_COLUMN_NAME_LEGACY`` environment variable to ``1``.
26+
2527
* In Spark 4.1, Arrow-optimized Python UDF supports UDT input / output instead of falling back to the regular UDF. To restore the legacy behavior, set ``spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT`` to ``true``.
2628

2729
* In Spark 4.1, unnecessary conversion to pandas instances is removed when ``spark.sql.execution.pythonUDF.arrow.enabled`` is enabled. As a result, the type coercion changes when the produced output has a schema different from the specified schema. To restore the previous behavior, enable ``spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled``.

python/pyspark/sql/connect/dataframe.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
)
4545

4646
import copy
47+
import os
4748
import sys
4849
import random
4950
import pyarrow as pa
@@ -1740,7 +1741,9 @@ def __getitem__(
17401741
# }
17411742

17421743
# validate the column name
1743-
if not hasattr(self._session, "is_mock_session"):
1744+
if os.environ.get("PYSPARK_VALIDATE_COLUMN_NAME_LEGACY") == "1" and not hasattr(
1745+
self._session, "is_mock_session"
1746+
):
17441747
from pyspark.sql.connect.types import verify_col_name
17451748

17461749
# Try best to verify the column name with cached schema

python/pyspark/sql/tests/test_column.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,8 @@ def test_access_column(self):
133133
self.assertTrue(isinstance(df["key"], Column))
134134
self.assertTrue(isinstance(df[0], Column))
135135
self.assertRaises(IndexError, lambda: df[2])
136-
self.assertRaises(AnalysisException, lambda: df["bad_key"])
137136
self.assertRaises(TypeError, lambda: df[{}])
137+
self.assertRaises(AnalysisException, lambda: df.select(df["bad_key"]).schema)
138138

139139
def test_column_name_with_non_ascii(self):
140140
columnName = "数量"

0 commit comments

Comments
 (0)