Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 5, 2025

Which issue does this PR close?

Rationale for this change

There are several non trivial changes in arrow 56 so I want to start testing soon

Also, I would like a stable base to test new parquet pushdown code from @XiangpengHao

What changes are included in this PR?

  1. Update to use pre-release version of arrow
  2. Update tests / APIs as necessary

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the common Related to common crate label Jul 5, 2025
@alamb alamb marked this pull request as draft July 5, 2025 11:57
@alamb alamb force-pushed the alamb/update_arrow_56.0.0 branch from 6698376 to 1747605 Compare July 5, 2025 12:20
@github-actions github-actions bot added logical-expr Logical plan and expressions proto Related to proto crate labels Jul 5, 2025
@alamb alamb force-pushed the alamb/update_arrow_56.0.0 branch from 1747605 to fc5bd79 Compare July 5, 2025 13:06
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate datasource Changes to the datasource crate labels Jul 5, 2025
@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2025

I see some failures in the row_group_pruning tests

$ cargo test --package datafusion --test parquet_config parquet::row_group_pruning
failures:
    parquet::row_group_pruning::prune_binary_eq_match
    parquet::row_group_pruning::prune_binary_lt
    parquet::row_group_pruning::prune_binary_neq
    parquet::row_group_pruning::prune_string_eq_match
    parquet::row_group_pruning::prune_string_lt
    parquet::row_group_pruning::prune_string_neq

I am tracking it down

Here is an example failure

thread 'parquet::row_group_pruning::prune_binary_eq_match' panicked at datafusion/core/tests/parquet/row_group_pruning.rs:146:9:
assertion `left == right` failed: Expected 2 rows, got 1: Input:
+---------------+----------------+------------------------------+-------------------+------------------------------+
| name          | service_string | service_binary               | service_fixedsize | service_large_binary         |
+---------------+----------------+------------------------------+-------------------+------------------------------+
| all frontends | frontend one   | 66726f6e74656e64206f6e65     | 666531            | 66726f6e74656e64206f6e65     |
| all frontends | frontend two   | 66726f6e74656e642074776f     | 666532            | 66726f6e74656e642074776f     |
| all frontends | frontend three | 66726f6e74656e64207468726565 | 666533            | 66726f6e74656e64207468726565 |
| all frontends | frontend seven | 66726f6e74656e6420736576656e | 666537            | 66726f6e74656e6420736576656e |
| all frontends | frontend five  | 66726f6e74656e642066697665   | 666535            | 66726f6e74656e642066697665   |
| mixed         | frontend six   | 66726f6e74656e6420736978     | 666536            | 66726f6e74656e6420736978     |
| mixed         | frontend four  | 66726f6e74656e6420666f7572   | 666534            | 66726f6e74656e6420666f7572   |
| mixed         | backend one    | 6261636b656e64206f6e65       | 626531            | 6261636b656e64206f6e65       |
| mixed         | backend two    | 6261636b656e642074776f       | 626532            | 6261636b656e642074776f       |
| mixed         | backend three  | 6261636b656e64207468726565   | 626533            | 6261636b656e64207468726565   |
| all backends  | backend four   | 6261636b656e6420666f7572     | 626534            | 6261636b656e6420666f7572     |
| all backends  | backend five   | 6261636b656e642066697665     | 626535            | 6261636b656e642066697665     |
| all backends  | backend six    | 6261636b656e6420736978       | 626536            | 6261636b656e6420736978       |
| all backends  | backend seven  | 6261636b656e6420736576656e   | 626537            | 6261636b656e6420736576656e   |
| all backends  | backend eight  | 6261636b656e64206569676874   | 626538            | 6261636b656e64206569676874   |
+---------------+----------------+------------------------------+-------------------+------------------------------+
Query:
SELECT name, service_binary FROM t WHERE service_binary = CAST('backend one' AS bytea)
Output:
+-------+------------------------+
| name  | service_binary         |
+-------+------------------------+
| mixed | 6261636b656e64206f6e65 |
| mixed | 6261636b656e642074776f |
+-------+------------------------+
Metrics:
time_elapsed_opening{partition=0}=51.348875ms, time_elapsed_scanning_until_data{partition=0}=179.834µs, time_elapsed_scanning_total{partition=0}=223.75µs, time_elapsed_processing{partition=0}=46.345583ms, file_open_errors{partition=0}=0, file_scan_errors{partition=0}=0, start_timestamp{partition=0}=2025-07-05 14:32:15.489561 UTC, end_timestamp{partition=0}=2025-07-05 14:32:15.541149 UTC, elapsed_compute{partition=0}=NOT RECORDED, output_rows{partition=0}=5, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=1, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=1, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=2, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=1, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=59.625µs, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=46.298417ms, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=5, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=46.333µs, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=4.802583ms, files_ranges_pruned_statistics{partition=0}=0, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=2098163, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=0, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruningn7W55n.parquet}=NOT RECORDED, files_ranges_pruned_statistics{partition=0}=0, num_predicate_creation_errors=0
  left: 2
 right: 1

@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2025

ok, I now have a clean run!

@zhuqi-lucas
Copy link

Thank you @alamb , I am curious about the benchmark result comparing the main branch, because we will include the apache/arrow-rs#7850 for this PR.

And some improvement part of the improvement we have ported to datafusion, but we will also benefit from the dependency changes from the arrow side, such as the sort phase(the merge compare we have ported)/compare, etc.

Could we trigger the benchmark for this PR, thanks!

@alamb
Copy link
Contributor Author

alamb commented Jul 8, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/update_arrow_56.0.0 (76bc1b2) to ebb8e95 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jul 8, 2025

Could we trigger the benchmark for this PR, thanks!

Done!

@alamb
Copy link
Contributor Author

alamb commented Jul 8, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_update_arrow_56.0 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1968.78 ms │              1943.30 ms │    no change │
│ QQuery 1     │   671.98 ms │               752.37 ms │ 1.12x slower │
│ QQuery 2     │  1322.34 ms │              1460.81 ms │ 1.10x slower │
│ QQuery 3     │   679.14 ms │               685.65 ms │    no change │
│ QQuery 4     │  1379.83 ms │              1357.79 ms │    no change │
│ QQuery 5     │ 15240.77 ms │             15204.35 ms │    no change │
│ QQuery 6     │  2055.68 ms │              2095.06 ms │    no change │
│ QQuery 7     │  1832.29 ms │              1860.29 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 25150.80ms │
│ Total Time (alamb_update_arrow_56.0)   │ 25359.62ms │
│ Average Time (HEAD)                    │  3143.85ms │
│ Average Time (alamb_update_arrow_56.0) │  3169.95ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.50 ms │                 2.22 ms │ +1.13x faster │
│ QQuery 1     │    34.30 ms │                34.87 ms │     no change │
│ QQuery 2     │    80.44 ms │                82.83 ms │     no change │
│ QQuery 3     │    99.02 ms │                98.66 ms │     no change │
│ QQuery 4     │   604.00 ms │               591.53 ms │     no change │
│ QQuery 5     │   865.09 ms │               853.85 ms │     no change │
│ QQuery 6     │     2.22 ms │                 2.35 ms │  1.06x slower │
│ QQuery 7     │    38.66 ms │                39.53 ms │     no change │
│ QQuery 8     │   863.08 ms │               865.34 ms │     no change │
│ QQuery 9     │  1170.02 ms │              1198.50 ms │     no change │
│ QQuery 10    │   256.55 ms │               235.52 ms │ +1.09x faster │
│ QQuery 11    │   282.16 ms │               273.37 ms │     no change │
│ QQuery 12    │   856.42 ms │               879.42 ms │     no change │
│ QQuery 13    │  1251.28 ms │              1228.51 ms │     no change │
│ QQuery 14    │   810.79 ms │               801.54 ms │     no change │
│ QQuery 15    │   792.79 ms │               780.00 ms │     no change │
│ QQuery 16    │  1598.21 ms │              1612.46 ms │     no change │
│ QQuery 17    │  1600.07 ms │              1600.00 ms │     no change │
│ QQuery 18    │  2922.80 ms │              2859.57 ms │     no change │
│ QQuery 19    │    91.60 ms │                88.65 ms │     no change │
│ QQuery 20    │  1156.47 ms │              1178.00 ms │     no change │
│ QQuery 21    │  1291.31 ms │              1323.34 ms │     no change │
│ QQuery 22    │  2100.41 ms │              2199.67 ms │     no change │
│ QQuery 23    │  7358.55 ms │              7637.05 ms │     no change │
│ QQuery 24    │   438.16 ms │               415.34 ms │ +1.05x faster │
│ QQuery 25    │   299.70 ms │               287.79 ms │     no change │
│ QQuery 26    │   446.98 ms │               415.91 ms │ +1.07x faster │
│ QQuery 27    │  1534.24 ms │              1572.63 ms │     no change │
│ QQuery 28    │ 12754.23 ms │             11991.17 ms │ +1.06x faster │
│ QQuery 29    │   537.38 ms │               518.72 ms │     no change │
│ QQuery 30    │   777.26 ms │               778.63 ms │     no change │
│ QQuery 31    │   805.02 ms │               758.54 ms │ +1.06x faster │
│ QQuery 32    │  2408.21 ms │              2360.51 ms │     no change │
│ QQuery 33    │  3161.89 ms │              3169.50 ms │     no change │
│ QQuery 34    │  3181.61 ms │              3197.43 ms │     no change │
│ QQuery 35    │  1255.44 ms │              1256.22 ms │     no change │
│ QQuery 36    │   119.55 ms │               119.48 ms │     no change │
│ QQuery 37    │    49.89 ms │                50.74 ms │     no change │
│ QQuery 38    │   120.47 ms │               121.13 ms │     no change │
│ QQuery 39    │   199.39 ms │               200.24 ms │     no change │
│ QQuery 40    │    41.00 ms │                41.43 ms │     no change │
│ QQuery 41    │    38.10 ms │                38.31 ms │     no change │
│ QQuery 42    │    32.07 ms │                31.78 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 54329.35ms │
│ Total Time (alamb_update_arrow_56.0)   │ 53792.25ms │
│ Average Time (HEAD)                    │  1263.47ms │
│ Average Time (alamb_update_arrow_56.0) │  1250.98ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         36 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  98.26 ms │                95.79 ms │     no change │
│ QQuery 2     │  19.98 ms │                20.60 ms │     no change │
│ QQuery 3     │  32.60 ms │                32.88 ms │     no change │
│ QQuery 4     │  18.59 ms │                18.52 ms │     no change │
│ QQuery 5     │  48.91 ms │                49.15 ms │     no change │
│ QQuery 6     │  11.62 ms │                11.80 ms │     no change │
│ QQuery 7     │  90.21 ms │                87.97 ms │     no change │
│ QQuery 8     │  23.40 ms │                25.01 ms │  1.07x slower │
│ QQuery 9     │  53.14 ms │                54.06 ms │     no change │
│ QQuery 10    │  42.78 ms │                40.78 ms │     no change │
│ QQuery 11    │  11.41 ms │                11.28 ms │     no change │
│ QQuery 12    │  34.36 ms │                31.87 ms │ +1.08x faster │
│ QQuery 13    │  26.55 ms │                26.12 ms │     no change │
│ QQuery 14    │   9.73 ms │                 9.52 ms │     no change │
│ QQuery 15    │  18.69 ms │                18.55 ms │     no change │
│ QQuery 16    │  17.99 ms │                18.22 ms │     no change │
│ QQuery 17    │  96.57 ms │                95.77 ms │     no change │
│ QQuery 18    │ 186.92 ms │               198.21 ms │  1.06x slower │
│ QQuery 19    │  24.15 ms │                23.11 ms │     no change │
│ QQuery 20    │  31.03 ms │                31.32 ms │     no change │
│ QQuery 21    │ 142.83 ms │               144.34 ms │     no change │
│ QQuery 22    │  14.82 ms │                14.18 ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1054.52ms │
│ Total Time (alamb_update_arrow_56.0)   │ 1059.05ms │
│ Average Time (HEAD)                    │   47.93ms │
│ Average Time (alamb_update_arrow_56.0) │   48.14ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        19 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link

zhuqi-lucas commented Jul 8, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_update_arrow_56.0 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1968.78 ms │              1943.30 ms │    no change │
│ QQuery 1     │   671.98 ms │               752.37 ms │ 1.12x slower │
│ QQuery 2     │  1322.34 ms │              1460.81 ms │ 1.10x slower │
│ QQuery 3     │   679.14 ms │               685.65 ms │    no change │
│ QQuery 4     │  1379.83 ms │              1357.79 ms │    no change │
│ QQuery 5     │ 15240.77 ms │             15204.35 ms │    no change │
│ QQuery 6     │  2055.68 ms │              2095.06 ms │    no change │
│ QQuery 7     │  1832.29 ms │              1860.29 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 25150.80ms │
│ Total Time (alamb_update_arrow_56.0)   │ 25359.62ms │
│ Average Time (HEAD)                    │  3143.85ms │
│ Average Time (alamb_update_arrow_56.0) │  3169.95ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.50 ms │                 2.22 ms │ +1.13x faster │
│ QQuery 1     │    34.30 ms │                34.87 ms │     no change │
│ QQuery 2     │    80.44 ms │                82.83 ms │     no change │
│ QQuery 3     │    99.02 ms │                98.66 ms │     no change │
│ QQuery 4     │   604.00 ms │               591.53 ms │     no change │
│ QQuery 5     │   865.09 ms │               853.85 ms │     no change │
│ QQuery 6     │     2.22 ms │                 2.35 ms │  1.06x slower │
│ QQuery 7     │    38.66 ms │                39.53 ms │     no change │
│ QQuery 8     │   863.08 ms │               865.34 ms │     no change │
│ QQuery 9     │  1170.02 ms │              1198.50 ms │     no change │
│ QQuery 10    │   256.55 ms │               235.52 ms │ +1.09x faster │
│ QQuery 11    │   282.16 ms │               273.37 ms │     no change │
│ QQuery 12    │   856.42 ms │               879.42 ms │     no change │
│ QQuery 13    │  1251.28 ms │              1228.51 ms │     no change │
│ QQuery 14    │   810.79 ms │               801.54 ms │     no change │
│ QQuery 15    │   792.79 ms │               780.00 ms │     no change │
│ QQuery 16    │  1598.21 ms │              1612.46 ms │     no change │
│ QQuery 17    │  1600.07 ms │              1600.00 ms │     no change │
│ QQuery 18    │  2922.80 ms │              2859.57 ms │     no change │
│ QQuery 19    │    91.60 ms │                88.65 ms │     no change │
│ QQuery 20    │  1156.47 ms │              1178.00 ms │     no change │
│ QQuery 21    │  1291.31 ms │              1323.34 ms │     no change │
│ QQuery 22    │  2100.41 ms │              2199.67 ms │     no change │
│ QQuery 23    │  7358.55 ms │              7637.05 ms │     no change │
│ QQuery 24    │   438.16 ms │               415.34 ms │ +1.05x faster │
│ QQuery 25    │   299.70 ms │               287.79 ms │     no change │
│ QQuery 26    │   446.98 ms │               415.91 ms │ +1.07x faster │
│ QQuery 27    │  1534.24 ms │              1572.63 ms │     no change │
│ QQuery 28    │ 12754.23 ms │             11991.17 ms │ +1.06x faster │
│ QQuery 29    │   537.38 ms │               518.72 ms │     no change │
│ QQuery 30    │   777.26 ms │               778.63 ms │     no change │
│ QQuery 31    │   805.02 ms │               758.54 ms │ +1.06x faster │
│ QQuery 32    │  2408.21 ms │              2360.51 ms │     no change │
│ QQuery 33    │  3161.89 ms │              3169.50 ms │     no change │
│ QQuery 34    │  3181.61 ms │              3197.43 ms │     no change │
│ QQuery 35    │  1255.44 ms │              1256.22 ms │     no change │
│ QQuery 36    │   119.55 ms │               119.48 ms │     no change │
│ QQuery 37    │    49.89 ms │                50.74 ms │     no change │
│ QQuery 38    │   120.47 ms │               121.13 ms │     no change │
│ QQuery 39    │   199.39 ms │               200.24 ms │     no change │
│ QQuery 40    │    41.00 ms │                41.43 ms │     no change │
│ QQuery 41    │    38.10 ms │                38.31 ms │     no change │
│ QQuery 42    │    32.07 ms │                31.78 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 54329.35ms │
│ Total Time (alamb_update_arrow_56.0)   │ 53792.25ms │
│ Average Time (HEAD)                    │  1263.47ms │
│ Average Time (alamb_update_arrow_56.0) │  1250.98ms │
│ Queries Faster                         │          6 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         36 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  98.26 ms │                95.79 ms │     no change │
│ QQuery 2     │  19.98 ms │                20.60 ms │     no change │
│ QQuery 3     │  32.60 ms │                32.88 ms │     no change │
│ QQuery 4     │  18.59 ms │                18.52 ms │     no change │
│ QQuery 5     │  48.91 ms │                49.15 ms │     no change │
│ QQuery 6     │  11.62 ms │                11.80 ms │     no change │
│ QQuery 7     │  90.21 ms │                87.97 ms │     no change │
│ QQuery 8     │  23.40 ms │                25.01 ms │  1.07x slower │
│ QQuery 9     │  53.14 ms │                54.06 ms │     no change │
│ QQuery 10    │  42.78 ms │                40.78 ms │     no change │
│ QQuery 11    │  11.41 ms │                11.28 ms │     no change │
│ QQuery 12    │  34.36 ms │                31.87 ms │ +1.08x faster │
│ QQuery 13    │  26.55 ms │                26.12 ms │     no change │
│ QQuery 14    │   9.73 ms │                 9.52 ms │     no change │
│ QQuery 15    │  18.69 ms │                18.55 ms │     no change │
│ QQuery 16    │  17.99 ms │                18.22 ms │     no change │
│ QQuery 17    │  96.57 ms │                95.77 ms │     no change │
│ QQuery 18    │ 186.92 ms │               198.21 ms │  1.06x slower │
│ QQuery 19    │  24.15 ms │                23.11 ms │     no change │
│ QQuery 20    │  31.03 ms │                31.32 ms │     no change │
│ QQuery 21    │ 142.83 ms │               144.34 ms │     no change │
│ QQuery 22    │  14.82 ms │                14.18 ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1054.52ms │
│ Total Time (alamb_update_arrow_56.0)   │ 1059.05ms │
│ Average Time (HEAD)                    │   47.93ms │
│ Average Time (alamb_update_arrow_56.0) │   48.14ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        19 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

Thank you @alamb, it seems we have some improvement for clickbench. Not too much because we gain for sort string view mostly which is not in clickbench but in sort_tpch.

@alamb
Copy link
Contributor Author

alamb commented Jul 8, 2025

Thank you @alamb, it seems we have some improvement for clickbench. Not too much because we gain for sort string view mostly which is not in clickbench but in sort_tpch.

I will start those as well

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@zhuqi-lucas
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_update_arrow_56.0 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1           │  330.29 ms │               350.13 ms │ 1.06x slower │
│ Q2           │  308.58 ms │               312.59 ms │    no change │
│ Q3           │ 1051.46 ms │              1076.53 ms │    no change │
│ Q4           │  444.92 ms │               462.29 ms │    no change │
│ Q5           │  410.39 ms │               412.36 ms │    no change │
│ Q6           │  449.16 ms │               452.76 ms │    no change │
│ Q7           │  783.43 ms │               833.85 ms │ 1.06x slower │
│ Q8           │  696.74 ms │               712.94 ms │    no change │
│ Q9           │  732.56 ms │               732.64 ms │    no change │
│ Q10          │ 1077.29 ms │              1090.28 ms │    no change │
│ Q11          │  547.35 ms │               563.97 ms │    no change │
└──────────────┴────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 6832.16ms │
│ Total Time (alamb_update_arrow_56.0)   │ 7000.33ms │
│ Average Time (HEAD)                    │  621.11ms │
│ Average Time (alamb_update_arrow_56.0) │  636.39ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │         9 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

Thank you @alamb , it seems no improvement for sort dependencies(So it means the merge will occupied most time), so we gain most for the ported PRs to merge phase from:

#16509
#16630

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@zhuqi-lucas
Copy link

zhuqi-lucas commented Jul 9, 2025

Thank you @alamb @Dandandan , we may also try sort_tpch10 benchmark , but it may also not too much improvement, the ported PR already has 1.4x faster for sort_tpch Q11(inlined string view sort).

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/update_arrow_56.0.0 (0a0c7d7) to cca9d4c diff using: sort_tpch
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  349.61 ms │               323.45 ms │ +1.08x faster │
│ Q2           │  308.94 ms │               317.36 ms │     no change │
│ Q3           │ 1112.50 ms │              1014.45 ms │ +1.10x faster │
│ Q4           │  437.16 ms │               431.15 ms │     no change │
│ Q5           │  418.20 ms │               416.31 ms │     no change │
│ Q6           │  459.93 ms │               453.85 ms │     no change │
│ Q7           │  817.52 ms │               817.33 ms │     no change │
│ Q8           │  705.87 ms │               698.17 ms │     no change │
│ Q9           │  734.30 ms │               724.44 ms │     no change │
│ Q10          │ 1084.13 ms │              1057.24 ms │     no change │
│ Q11          │  569.35 ms │               552.48 ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 6997.49ms │
│ Total Time (alamb_update_arrow_56.0)   │ 6806.22ms │
│ Average Time (HEAD)                    │  636.14ms │
│ Average Time (alamb_update_arrow_56.0) │  618.75ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │         9 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/update_arrow_56.0.0 (0a0c7d7) to cca9d4c diff using: sort_tpch10
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ HEAD ┃ alamb_update_arrow_56.0 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1           │ FAIL │                    FAIL │ incomparable │
│ Q2           │ FAIL │                    FAIL │ incomparable │
│ Q3           │ FAIL │                    FAIL │ incomparable │
│ Q4           │ FAIL │                    FAIL │ incomparable │
│ Q5           │ FAIL │                    FAIL │ incomparable │
│ Q6           │ FAIL │                    FAIL │ incomparable │
│ Q7           │ FAIL │                    FAIL │ incomparable │
│ Q8           │ FAIL │                    FAIL │ incomparable │
│ Q9           │ FAIL │                    FAIL │ incomparable │
│ Q10          │ FAIL │                    FAIL │ incomparable │
│ Q11          │ FAIL │                    FAIL │ incomparable │
└──────────────┴──────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                      ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                      │ 0.00ms │
│ Total Time (alamb_update_arrow_56.0)   │ 0.00ms │
│ Average Time (HEAD)                    │ 0.00ms │
│ Average Time (alamb_update_arrow_56.0) │ 0.00ms │
│ Queries Faster                         │      0 │
│ Queries Slower                         │      0 │
│ Queries with No Change                 │      0 │
│ Queries with Failure                   │     11 │
└────────────────────────────────────────┴────────┘

@zhuqi-lucas
Copy link

zhuqi-lucas commented Jul 31, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_update_arrow_56.0 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  349.61 ms │               323.45 ms │ +1.08x faster │
│ Q2           │  308.94 ms │               317.36 ms │     no change │
│ Q3           │ 1112.50 ms │              1014.45 ms │ +1.10x faster │
│ Q4           │  437.16 ms │               431.15 ms │     no change │
│ Q5           │  418.20 ms │               416.31 ms │     no change │
│ Q6           │  459.93 ms │               453.85 ms │     no change │
│ Q7           │  817.52 ms │               817.33 ms │     no change │
│ Q8           │  705.87 ms │               698.17 ms │     no change │
│ Q9           │  734.30 ms │               724.44 ms │     no change │
│ Q10          │ 1084.13 ms │              1057.24 ms │     no change │
│ Q11          │  569.35 ms │               552.48 ms │     no change │
└──────────────┴────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 6997.49ms │
│ Total Time (alamb_update_arrow_56.0)   │ 6806.22ms │
│ Average Time (HEAD)                    │  636.14ms │
│ Average Time (alamb_update_arrow_56.0) │  618.75ms │
│ Queries Faster                         │         2 │
│ Queries Slower                         │         0 │
│ Queries with No Change                 │         9 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

Thank you @alamb the result is very good, we have further improvement for sort also besides the ported code in datafusion, even Q1 is benefit which is not StringView sort from the new partition null validation, and Q3 is benefit from the new GC for large string size. And the sort-tpch10 seems broken. 🤔

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/update_arrow_56.0.0 (0a0c7d7) to cca9d4c diff using: sort_tpch10
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_update_arrow_56.0.0
--------------------
Benchmark sort_tpch10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ HEAD ┃ alamb_update_arrow_56.0 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1           │ FAIL │                    FAIL │ incomparable │
│ Q2           │ FAIL │                    FAIL │ incomparable │
│ Q3           │ FAIL │                    FAIL │ incomparable │
│ Q4           │ FAIL │                    FAIL │ incomparable │
│ Q5           │ FAIL │                    FAIL │ incomparable │
│ Q6           │ FAIL │                    FAIL │ incomparable │
│ Q7           │ FAIL │                    FAIL │ incomparable │
│ Q8           │ FAIL │                    FAIL │ incomparable │
│ Q9           │ FAIL │                    FAIL │ incomparable │
│ Q10          │ FAIL │                    FAIL │ incomparable │
│ Q11          │ FAIL │                    FAIL │ incomparable │
└──────────────┴──────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                      ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                      │ 0.00ms │
│ Total Time (alamb_update_arrow_56.0)   │ 0.00ms │
│ Average Time (HEAD)                    │ 0.00ms │
│ Average Time (alamb_update_arrow_56.0) │ 0.00ms │
│ Queries Faster                         │      0 │
│ Queries Slower                         │      0 │
│ Queries with No Change                 │      0 │
│ Queries with Failure                   │     11 │
└────────────────────────────────────────┴────────┘

@alamb alamb changed the title DRAFT: Update arrow/parquet to 56.0.0 Upgrade arrow/parquet to 56.0.0 Aug 1, 2025
@alamb alamb force-pushed the alamb/update_arrow_56.0.0 branch from 0a0c7d7 to 60b8577 Compare August 1, 2025 19:48
@alamb alamb marked this pull request as ready for review August 2, 2025 10:26
@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2025

This one is now ready for review

/// default parquet writer setting
/// max_statistics_size is deprecated, currently it is not being used
// TODO: remove once deprecated
#[deprecated(since = "45.0.0", note = "Setting does not do anything")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were removed from the underlying parquet library as well

@@ -2192,6 +2191,16 @@ impl ScalarValue {
}

let array: ArrayRef = match &data_type {
DataType::Decimal32(_precision, _scale) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decimal323 and Decimal64 types were added to arrow

Copy link

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thank you @alamb ! The failed CI seems not related to this PR.

@adriangb
Copy link
Contributor

adriangb commented Aug 4, 2025

I merged in main again which will also re-run the tests, I agree @zhuqi-lucas that CI failure looks like a flake totally unrelated to this PR.

@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2025

Thanks @adriangb and @zhuqi-lucas !

@adriangb adriangb merged commit fa1f8c1 into apache:main Aug 4, 2025
30 checks passed
@alamb alamb deleted the alamb/update_arrow_56.0.0 branch August 4, 2025 18:26
@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2025

🚀

@adriangb
Copy link
Contributor

adriangb commented Aug 4, 2025

This seems to have caused some minor failures in main: https://github.com/apache/datafusion/actions/runs/16730613287/job/47357522119

@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2025

This seems to have caused some minor failures in main: https://github.com/apache/datafusion/actions/runs/16730613287/job/47357522119

There were some timeouts reported that may be related:

driver stderr:
[1754331308.317][SEVERE]: Timed out receiving message from renderer: 300.000
[1754331308.615][SEVERE]: Timed out receiving message from renderer: 300.000

Error: failed to find element reference in response
error: test failed, to rerun pass --lib

I restarted the test -- hopefully it will pass on the second time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants