Skip to content

Conversation

zhuqi-lucas
Copy link
Contributor

…default (set metadata_size_hint)

Which issue does this PR close?

Rationale for this change

Reduce number of object store requests when reading parquet files by default (set metadata_size_hint)

What changes are included in this PR?

 /// Default setting to 512 KB, which should be sufficient for most parquet files,
        /// it can reduce one I/O operation per parquet file. If the metadata is larger than
        /// the hint, two reads will still be performed.
        pub metadata_size_hint: Option<usize>, default = Some(512 * 1024)

Are these changes tested?

Yes

Are there any user-facing changes?

No

@zhuqi-lucas zhuqi-lucas requested a review from alamb October 20, 2025 03:20
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Oct 20, 2025
@zhuqi-lucas zhuqi-lucas requested a review from xudong963 October 20, 2025 03:24
@github-actions github-actions bot added the core Core DataFusion crate label Oct 20, 2025
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on my end but let's wait to see what one more reviewer thinks, I may not be representing all viewpoints on this.

Comment on lines 623 to 626
pub metadata_size_hint: Option<usize>, default = None
/// Default setting to 512 KB, which should be sufficient for most parquet files,
/// it can reduce one I/O operation per parquet file. If the metadata is larger than
/// the hint, two reads will still be performed.
pub metadata_size_hint: Option<usize>, default = Some(512 * 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW having some prefetch on as a default makes a ton of sense to me. I'd like to run the benchmarks to make sure it doesn't have a big impact, I'd guess no positive or negative impact since benchmarks run against local disc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW having some prefetch on as a default makes a ton of sense to me. I'd like to run the benchmarks to make sure it doesn't have a big impact, I'd guess no positive or negative impact since benchmarks run against local disc.

Thank you @adriangb for review, i agree, it should not affect local disk performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @alamb for double check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will kick off benchmarks.

I think the potential downside of this approach is that it will make larger requests to objectstore / local disk by default and use slightly more memory for small files (it will always fetch / buffer 512K even if the actual footer is much smaller)

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i agree if we have many small files.

will kick off benchmarks.

I think the potential downside of this approach is that it will make larger requests to objectstore / local disk by default and use slightly more memory for small files (it will always fetch / buffer 512K even if the actual footer is much smaller)

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_18118 (d7172b5) to 35b2e35 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb alamb changed the title Reduce number of object store requests when reading parquet files by … Change default prefetch_hint to 512Kb to reduce number of object store requests when reading parquet files Oct 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense but that we should do a few more things (I can help with this)

  1. Add a note in the upgrade guide saying the behavior changed
  2. Add some end to end tests somewhere that show the actual object store calls when reading parquet (aka something similar to here #18112 (comment)) -- and then test with the defaults as well as when we change the prefetch_hint

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤖: Benchmark completed

Details

Comparing HEAD and issue_18118
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_18118 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2660.79 ms │  2663.08 ms │ no change │
│ QQuery 1     │  1278.19 ms │  1270.35 ms │ no change │
│ QQuery 2     │  2378.56 ms │  2432.55 ms │ no change │
│ QQuery 3     │  1183.79 ms │  1147.93 ms │ no change │
│ QQuery 4     │  2257.26 ms │  2222.80 ms │ no change │
│ QQuery 5     │ 27613.76 ms │ 27307.95 ms │ no change │
│ QQuery 6     │  4120.87 ms │  4159.21 ms │ no change │
│ QQuery 7     │  3538.80 ms │  3623.46 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 45032.00ms │
│ Total Time (issue_18118)   │ 44827.32ms │
│ Average Time (HEAD)        │  5629.00ms │
│ Average Time (issue_18118) │  5603.42ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_18118 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.08 ms │     2.44 ms │  1.17x slower │
│ QQuery 1     │    50.67 ms │    50.74 ms │     no change │
│ QQuery 2     │   137.10 ms │   141.97 ms │     no change │
│ QQuery 3     │   155.37 ms │   163.15 ms │  1.05x slower │
│ QQuery 4     │  1030.04 ms │  1063.64 ms │     no change │
│ QQuery 5     │  1449.14 ms │  1457.47 ms │     no change │
│ QQuery 6     │     2.25 ms │     2.13 ms │ +1.06x faster │
│ QQuery 7     │    53.39 ms │    56.36 ms │  1.06x slower │
│ QQuery 8     │  1420.49 ms │  1450.88 ms │     no change │
│ QQuery 9     │  1786.32 ms │  1799.07 ms │     no change │
│ QQuery 10    │   376.32 ms │   381.83 ms │     no change │
│ QQuery 11    │   428.31 ms │   434.13 ms │     no change │
│ QQuery 12    │  1327.53 ms │  1373.14 ms │     no change │
│ QQuery 13    │  2078.57 ms │  2173.64 ms │     no change │
│ QQuery 14    │  1244.16 ms │  1278.07 ms │     no change │
│ QQuery 15    │  1200.95 ms │  1201.58 ms │     no change │
│ QQuery 16    │  2597.14 ms │  2642.62 ms │     no change │
│ QQuery 17    │  2631.81 ms │  2654.99 ms │     no change │
│ QQuery 18    │  5589.64 ms │  4913.85 ms │ +1.14x faster │
│ QQuery 19    │   128.04 ms │   128.14 ms │     no change │
│ QQuery 20    │  1962.48 ms │  2009.32 ms │     no change │
│ QQuery 21    │  2298.54 ms │  2302.72 ms │     no change │
│ QQuery 22    │  9232.14 ms │  3876.95 ms │ +2.38x faster │
│ QQuery 23    │ 23385.33 ms │ 12558.52 ms │ +1.86x faster │
│ QQuery 24    │   223.35 ms │   202.93 ms │ +1.10x faster │
│ QQuery 25    │   483.34 ms │   502.14 ms │     no change │
│ QQuery 26    │   226.79 ms │   213.12 ms │ +1.06x faster │
│ QQuery 27    │  2895.97 ms │  2790.24 ms │     no change │
│ QQuery 28    │ 24237.75 ms │ 24021.54 ms │     no change │
│ QQuery 29    │   997.85 ms │   987.09 ms │     no change │
│ QQuery 30    │  1319.34 ms │  1305.97 ms │     no change │
│ QQuery 31    │  1331.19 ms │  1327.00 ms │     no change │
│ QQuery 32    │  5025.88 ms │  4718.44 ms │ +1.07x faster │
│ QQuery 33    │  5870.75 ms │  5730.65 ms │     no change │
│ QQuery 34    │  6056.12 ms │  5917.27 ms │     no change │
│ QQuery 35    │  1988.03 ms │  1998.50 ms │     no change │
│ QQuery 36    │   120.72 ms │   120.39 ms │     no change │
│ QQuery 37    │    51.24 ms │    51.21 ms │     no change │
│ QQuery 38    │   121.32 ms │   120.68 ms │     no change │
│ QQuery 39    │   194.35 ms │   196.77 ms │     no change │
│ QQuery 40    │    42.17 ms │    41.52 ms │     no change │
│ QQuery 41    │    40.17 ms │    38.57 ms │     no change │
│ QQuery 42    │    32.91 ms │    31.74 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 111827.04ms │
│ Total Time (issue_18118)   │  94433.10ms │
│ Average Time (HEAD)        │   2600.63ms │
│ Average Time (issue_18118) │   2196.12ms │
│ Queries Faster             │           7 │
│ Queries Slower             │           3 │
│ Queries with No Change     │          33 │
│ Queries with Failure       │           0 │
└────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_18118 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 172.88 ms │   171.01 ms │    no change │
│ QQuery 2     │  25.00 ms │    26.96 ms │ 1.08x slower │
│ QQuery 3     │  40.22 ms │    39.88 ms │    no change │
│ QQuery 4     │  27.54 ms │    28.01 ms │    no change │
│ QQuery 5     │  76.90 ms │    77.15 ms │    no change │
│ QQuery 6     │  19.43 ms │    19.33 ms │    no change │
│ QQuery 7     │ 211.91 ms │   214.86 ms │    no change │
│ QQuery 8     │  31.48 ms │    32.67 ms │    no change │
│ QQuery 9     │  97.45 ms │   103.98 ms │ 1.07x slower │
│ QQuery 10    │  60.06 ms │    59.55 ms │    no change │
│ QQuery 11    │  16.72 ms │    16.17 ms │    no change │
│ QQuery 12    │  50.27 ms │    51.00 ms │    no change │
│ QQuery 13    │  45.02 ms │    46.07 ms │    no change │
│ QQuery 14    │  13.30 ms │    13.27 ms │    no change │
│ QQuery 15    │  24.08 ms │    24.37 ms │    no change │
│ QQuery 16    │  23.98 ms │    24.59 ms │    no change │
│ QQuery 17    │ 144.63 ms │   150.76 ms │    no change │
│ QQuery 18    │ 321.21 ms │   321.15 ms │    no change │
│ QQuery 19    │  36.44 ms │    36.29 ms │    no change │
│ QQuery 20    │  46.61 ms │    47.90 ms │    no change │
│ QQuery 21    │ 325.99 ms │   333.34 ms │    no change │
│ QQuery 22    │  20.75 ms │    20.26 ms │    no change │
└──────────────┴───────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1831.87ms │
│ Total Time (issue_18118)   │ 1858.56ms │
│ Average Time (HEAD)        │   83.27ms │
│ Average Time (issue_18118) │   84.48ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         2 │
│ Queries with No Change     │        20 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor Author

🤖: Benchmark completed

Details

Interesting, it improved a lot for some cases even for local.

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤖: Benchmark completed
Details

Interesting, it improved a lot for some cases even for local.

yeah that is unexpected -- let me rerun the benchmark and see if I can reproduce it locally

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_18118 (d7172b5) to 35b2e35 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤖: Benchmark completed

Details

Comparing HEAD and issue_18118
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_18118 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2621.53 ms │  2676.97 ms │ no change │
│ QQuery 1     │  1245.30 ms │  1306.64 ms │ no change │
│ QQuery 2     │  2418.54 ms │  2391.75 ms │ no change │
│ QQuery 3     │  1194.39 ms │  1154.42 ms │ no change │
│ QQuery 4     │  2198.57 ms │  2202.82 ms │ no change │
│ QQuery 5     │ 27624.59 ms │ 27556.71 ms │ no change │
│ QQuery 6     │  4126.40 ms │  4129.59 ms │ no change │
│ QQuery 7     │  3383.00 ms │  3311.29 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 44812.32ms │
│ Total Time (issue_18118)   │ 44730.18ms │
│ Average Time (HEAD)        │  5601.54ms │
│ Average Time (issue_18118) │  5591.27ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_18118 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.20 ms │     2.16 ms │     no change │
│ QQuery 1     │    50.15 ms │    50.46 ms │     no change │
│ QQuery 2     │   133.84 ms │   135.97 ms │     no change │
│ QQuery 3     │   161.81 ms │   164.63 ms │     no change │
│ QQuery 4     │  1151.67 ms │  1044.25 ms │ +1.10x faster │
│ QQuery 5     │  1570.43 ms │  1449.88 ms │ +1.08x faster │
│ QQuery 6     │     2.13 ms │     2.23 ms │     no change │
│ QQuery 7     │    54.71 ms │    55.95 ms │     no change │
│ QQuery 8     │  1513.89 ms │  1435.50 ms │ +1.05x faster │
│ QQuery 9     │  1879.06 ms │  1801.97 ms │     no change │
│ QQuery 10    │   393.90 ms │   388.31 ms │     no change │
│ QQuery 11    │   437.86 ms │   438.10 ms │     no change │
│ QQuery 12    │  1498.20 ms │  1336.94 ms │ +1.12x faster │
│ QQuery 13    │  2207.79 ms │  2119.97 ms │     no change │
│ QQuery 14    │  1341.25 ms │  1250.13 ms │ +1.07x faster │
│ QQuery 15    │  1314.53 ms │  1156.66 ms │ +1.14x faster │
│ QQuery 16    │  2688.71 ms │  2603.43 ms │     no change │
│ QQuery 17    │  2658.89 ms │  2594.40 ms │     no change │
│ QQuery 18    │  5262.96 ms │  4884.43 ms │ +1.08x faster │
│ QQuery 19    │   127.33 ms │   127.93 ms │     no change │
│ QQuery 20    │  1966.44 ms │  1954.50 ms │     no change │
│ QQuery 21    │  2308.92 ms │  2310.14 ms │     no change │
│ QQuery 22    │  3908.92 ms │  3901.61 ms │     no change │
│ QQuery 23    │ 14038.86 ms │ 12566.35 ms │ +1.12x faster │
│ QQuery 24    │   229.12 ms │   206.90 ms │ +1.11x faster │
│ QQuery 25    │   501.17 ms │   502.60 ms │     no change │
│ QQuery 26    │   223.25 ms │   211.17 ms │ +1.06x faster │
│ QQuery 27    │  2894.88 ms │  2816.34 ms │     no change │
│ QQuery 28    │ 24333.00 ms │ 24158.06 ms │     no change │
│ QQuery 29    │   962.24 ms │   969.41 ms │     no change │
│ QQuery 30    │  1390.54 ms │  1290.22 ms │ +1.08x faster │
│ QQuery 31    │  1360.31 ms │  1303.87 ms │     no change │
│ QQuery 32    │  4358.47 ms │  4570.94 ms │     no change │
│ QQuery 33    │  5655.31 ms │  5590.03 ms │     no change │
│ QQuery 34    │  5824.96 ms │  5867.01 ms │     no change │
│ QQuery 35    │  2144.07 ms │  1963.55 ms │ +1.09x faster │
│ QQuery 36    │   123.88 ms │   120.14 ms │     no change │
│ QQuery 37    │    52.65 ms │    52.88 ms │     no change │
│ QQuery 38    │   122.74 ms │   121.33 ms │     no change │
│ QQuery 39    │   194.39 ms │   198.62 ms │     no change │
│ QQuery 40    │    42.62 ms │    41.77 ms │     no change │
│ QQuery 41    │    37.79 ms │    37.69 ms │     no change │
│ QQuery 42    │    32.30 ms │    31.47 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 97158.13ms │
│ Total Time (issue_18118)   │ 93829.86ms │
│ Average Time (HEAD)        │  2259.49ms │
│ Average Time (issue_18118) │  2182.09ms │
│ Queries Faster             │         12 │
│ Queries Slower             │          0 │
│ Queries with No Change     │         31 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_18118 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 169.82 ms │   168.96 ms │     no change │
│ QQuery 2     │  25.41 ms │    25.01 ms │     no change │
│ QQuery 3     │  41.22 ms │    39.13 ms │ +1.05x faster │
│ QQuery 4     │  27.79 ms │    28.34 ms │     no change │
│ QQuery 5     │  76.28 ms │    76.93 ms │     no change │
│ QQuery 6     │  19.73 ms │    19.36 ms │     no change │
│ QQuery 7     │ 214.22 ms │   203.70 ms │     no change │
│ QQuery 8     │  31.28 ms │    32.27 ms │     no change │
│ QQuery 9     │  96.93 ms │   104.37 ms │  1.08x slower │
│ QQuery 10    │  58.74 ms │    67.98 ms │  1.16x slower │
│ QQuery 11    │  16.09 ms │    19.02 ms │  1.18x slower │
│ QQuery 12    │  49.64 ms │    65.36 ms │  1.32x slower │
│ QQuery 13    │  46.26 ms │    52.72 ms │  1.14x slower │
│ QQuery 14    │  13.43 ms │    14.86 ms │  1.11x slower │
│ QQuery 15    │  23.83 ms │    29.00 ms │  1.22x slower │
│ QQuery 16    │  23.94 ms │    29.01 ms │  1.21x slower │
│ QQuery 17    │ 142.84 ms │   178.68 ms │  1.25x slower │
│ QQuery 18    │ 319.56 ms │   363.75 ms │  1.14x slower │
│ QQuery 19    │  36.28 ms │    47.48 ms │  1.31x slower │
│ QQuery 20    │  47.33 ms │    47.55 ms │     no change │
│ QQuery 21    │ 303.87 ms │   316.45 ms │     no change │
│ QQuery 22    │  20.39 ms │    20.87 ms │     no change │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1804.88ms │
│ Total Time (issue_18118)   │ 1950.79ms │
│ Average Time (HEAD)        │   82.04ms │
│ Average Time (issue_18118) │   88.67ms │
│ Queries Faster             │         1 │
│ Queries Slower             │        11 │
│ Queries with No Change     │        10 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

🤔 looks like a rough day on the benchmark farm. The second time the results for Q22 are the same.

@alamb
Copy link
Contributor

alamb commented Oct 21, 2025

FYI @BlakeOrth -- I think this should also have a significant effect on the number of object store requests made by default (should reduce the number of requests by one for each file)

@BlakeOrth
Copy link
Contributor

@alamb Yes, agreed this should be a positive performance improvement on most datasets when using high latency storage, especially since fetching the parquet footer followed by the parquet metadata is a strictly sequential operation for each file.

The benchmark results here are a bit curious and look inconsistent (perhaps due to reasons out of everyone's control). However, I wouldn't be too surprised to see minor performance improvements from some local disk backed queries. The 8B fetch for the parquet footer is below pretty much any reasonable storage device's and file system's block size, so the local disk and filesystem are probably doing the same amount of work in either case, and this PR eliminates one extra call to disk and any internal runtime scheduling around managing that call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce number of object store requests when reading parquet files by default (set metadata_size_hint)

6 participants