Commit 7251e95
[SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF
### What changes were proposed in this pull request?
This PR introduces an iterator API for Arrow grouped aggregation UDFs in PySpark. It adds support for two new UDF patterns:
- `Iterator[pa.Array] -> Any` for single column aggregations
- `Iterator[Tuple[pa.Array, ...]] -> Any` for multiple column aggregations
The implementation adds a new Python eval type `SQL_GROUPED_AGG_ARROW_ITER_UDF` with corresponding support in type inference, worker serialization, and Scala execution planning.
### Why are the changes needed?
The current Arrow grouped aggregation API requires loading all data for a group into memory at once, which can be problematic for groups with large amounts of data. The iterator API allows processing data in batches, providing:
1. **Memory Efficiency**: Processes data incrementally rather than loading entire group into memory
2. **Consistency**: Aligns with existing iterator APIs (e.g., `SQL_SCALAR_ARROW_ITER_UDF`)
3. **Flexibility**: Allows initialization of expensive state once per group while processing batches iteratively
### Does this PR introduce _any_ user-facing change?
Yes. This PR adds a new API pattern for Arrow grouped aggregation UDFs:
**Single column aggregation:**
```python
import pyarrow as pa
from typing import Iterator
from pyspark.sql.functions import arrow_udf
arrow_udf("double")
def arrow_mean(it: Iterator[pa.Array]) -> float:
sum_val = 0.0
cnt = 0
for v in it:
sum_val += pa.compute.sum(v).as_py()
cnt += len(v)
return sum_val / cnt if cnt > 0 else 0.0
df.groupby("id").agg(arrow_mean(df['v'])).show()
```
**Multiple column aggregation:**
```python
import pyarrow as pa
import numpy as np
from typing import Iterator, Tuple
from pyspark.sql.functions import arrow_udf
arrow_udf("double")
def arrow_weighted_mean(it: Iterator[Tuple[pa.Array, pa.Array]]) -> float:
weighted_sum = 0.0
weight = 0.0
for v, w in it:
weighted_sum += np.dot(v.to_numpy(), w.to_numpy())
weight += pa.compute.sum(w).as_py()
return weighted_sum / weight if weight > 0 else 0.0
df.groupby("id").agg(arrow_weighted_mean(df["v"], df["w"])).show()
```
### How was this patch tested?
Added comprehensive unit tests in `python/pyspark/sql/tests/arrow/test_arrow_udf_grouped_agg.py`:
1. `test_iterator_grouped_agg_single_column()` - Tests single column iterator aggregation with `Iterator[pa.Array]`
2. `test_iterator_grouped_agg_multiple_columns()` - Tests multiple column iterator aggregation with `Iterator[Tuple[pa.Array, pa.Array]]`
3. `test_iterator_grouped_agg_eval_type()` - Verifies correct eval type inference from type hints
### Was this patch authored or co-authored using generative AI tooling?
Co-Generated-by: Cursor with Claude Sonnet 4.5
Closes #53035 from Yicong-Huang/SPARK-53615/feat/arrow-grouped-agg-iterator-api.
Authored-by: Yicong-Huang <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>1 parent d4e34f5 commit 7251e95
File tree
12 files changed
+456
-3
lines changed- core/src/main/scala/org/apache/spark/api/python
- python/pyspark
- sql
- pandas
- _typing
- tests/arrow
- sql/core/src/main/scala/org/apache/spark/sql/execution/python
12 files changed
+456
-3
lines changedLines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
| |||
112 | 113 | | |
113 | 114 | | |
114 | 115 | | |
| 116 | + | |
115 | 117 | | |
116 | 118 | | |
117 | 119 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| 69 | + | |
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
| 54 | + | |
53 | 55 | | |
54 | 56 | | |
55 | 57 | | |
| |||
301 | 303 | | |
302 | 304 | | |
303 | 305 | | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
304 | 369 | | |
305 | 370 | | |
306 | 371 | | |
| |||
720 | 785 | | |
721 | 786 | | |
722 | 787 | | |
| 788 | + | |
723 | 789 | | |
724 | 790 | | |
725 | 791 | | |
| |||
768 | 834 | | |
769 | 835 | | |
770 | 836 | | |
| 837 | + | |
771 | 838 | | |
772 | 839 | | |
773 | 840 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
| 61 | + | |
60 | 62 | | |
61 | 63 | | |
62 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1185 | 1185 | | |
1186 | 1186 | | |
1187 | 1187 | | |
| 1188 | + | |
| 1189 | + | |
| 1190 | + | |
| 1191 | + | |
| 1192 | + | |
| 1193 | + | |
| 1194 | + | |
| 1195 | + | |
| 1196 | + | |
| 1197 | + | |
| 1198 | + | |
| 1199 | + | |
| 1200 | + | |
| 1201 | + | |
| 1202 | + | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
| 1224 | + | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
| 1228 | + | |
| 1229 | + | |
| 1230 | + | |
| 1231 | + | |
| 1232 | + | |
| 1233 | + | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
1188 | 1240 | | |
1189 | 1241 | | |
1190 | 1242 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| |||
156 | 157 | | |
157 | 158 | | |
158 | 159 | | |
159 | | - | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
160 | 168 | | |
161 | 169 | | |
162 | 170 | | |
| |||
235 | 243 | | |
236 | 244 | | |
237 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
238 | 281 | | |
239 | 282 | | |
240 | 283 | | |
| |||
249 | 292 | | |
250 | 293 | | |
251 | 294 | | |
| 295 | + | |
252 | 296 | | |
253 | 297 | | |
254 | 298 | | |
| |||
264 | 308 | | |
265 | 309 | | |
266 | 310 | | |
| 311 | + | |
267 | 312 | | |
268 | 313 | | |
269 | 314 | | |
| |||
295 | 340 | | |
296 | 341 | | |
297 | 342 | | |
| 343 | + | |
298 | 344 | | |
299 | 345 | | |
300 | 346 | | |
| |||
0 commit comments