You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52861][PYTHON] Skip Row object creation in Arrow-optimized UDTF execution
### What changes were proposed in this pull request?
Skips `Row` object creation in Arrow-optimized UDTF execution.
### Why are the changes needed?
The `Row` object creation is used in Arrow-optimized UDTF execution, although it's expensive, but not necessary.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The existing tests, and manual benchmarks.
```py
def profile(f, *args, _n=10, **kwargs):
import cProfile
import pstats
import gc
st = None
for _ in range(5):
f(*args, **kwargs)
for _ in range(_n):
gc.collect()
with cProfile.Profile() as pr:
ret = f(*args, **kwargs)
if st is None:
st = pstats.Stats(pr)
else:
st.add(pstats.Stats(pr))
st.sort_stats("time", "cumulative").print_stats()
return ret
from pyspark.sql.conversion import ArrowTableToRowsConversion, LocalDataToArrowConversion
from pyspark.sql.types import *
data = [
(i if i % 1000 else None, str(i))
for i in range(1000000)
]
schema = (
StructType()
.add("i", IntegerType(), nullable=True)
.add("s", StringType(), nullable=True)
)
def to_arrow():
return LocalDataToArrowConversion.convert(data, schema, use_large_var_types=False)
def from_arrow(tbl, return_as_tuples):
return ArrowTableToRowsConversion.convert(tbl, schema, return_as_tuples=return_as_tuples)
tbl = to_arrow()
profile(from_arrow, tbl, return_as_tuples=False)
profile(from_arrow, tbl, return_as_tuples=True)
```
- before (`return_as_tuples=False`)
```
60655810 function calls in 14.112 seconds
```
- after (`return_as_tuples=True`)
```
20328060 function calls in 5.613 seconds
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closesapache#51546 from ueshin/issues/SPARK-52861/skip_row_creation.
Authored-by: Takuya Ueshin <[email protected]>
Signed-off-by: Takuya Ueshin <[email protected]>
0 commit comments