Skip to content

fix(parquet): write single file if option is set #17009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

hknlof
Copy link

@hknlof hknlof commented Aug 1, 2025

#13323

Which issue does this PR close?

Rationale for this change

DF.write_parquet writes multiple files / one directory even if options.single_file_output is set.

What changes are included in this PR?

Introduce an internal .single extension.

Are these changes tested?

Yes, tests are part of this PR.

Are there any user-facing changes?

Not in this implementation. There might be, if we decide to move to an FileSinkConfig based solution.

Quoting: #13323 (comment)

It seems hard to control the behavior of write_parquet by single_file_output(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()), there are several methods I can think of to handle this issue:

  1. We can add a suffix like .single to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task().
  2. Give up single_file_output in DataFrameWriteOptions, use FileSinkConfig instead to control single file behavior.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Aug 1, 2025
@hknlof hknlof force-pushed the fix/write-single-parquet branch from 1c05032 to f1d58c5 Compare August 4, 2025 14:32
@alamb
Copy link
Contributor

alamb commented Aug 7, 2025

I restarted the checks

pepijnve and others added 19 commits August 20, 2025 18:48
…ec of the correct size (apache#16995)

* apache#16994 Ensure CooperativeExec#maintains_input_order returns a Vec of the correct size

* apache#16994 Extend default ExecutionPlan invariant checks

Add checks that verify the length of the vectors returned by methods that need to return a value per child.
* fix error result in execute&pre_selection

* fix clippy

* Optimize implementation

* more efficiency impl

* fix CI
)

* Docs: Update the crate configuration / build settings page

* Update docs/source/user-guide/crate-configuration.md

Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>
`ScalarValue` can be made into from a `&str`, `Option<&str>` and
`String`. `Option<String>` was a missing alternative.
…onfigOptions` on each query (apache#16970)

* Add `ConfigOptions` to ExecutionProps when execution is started

* Add ConfigOptions to ScalarFunctionArgs, refactor AsyncScalarUDF.invoke_async_with_args to remove ConfigOptions arg.

* Updated OptimizerConfig.options() -> Arc<ConfigOptions> to eliminate clone() calls. Fixed an issue with SimplifyExpressions.rewrite(..) not adding config options to execution_props. Added test to verify it works

* Test update.

* Add note in upgrade guide

* Use Arc all the way down

* start_execution -> mark_start_execution

* Update datafusion/expr/src/execution_props.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update comments

* Avoid API breakage via #deprecated

* Update upgrade guide for Arc<ConfigOptions> change

* Apply suggestions from code review

* fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.15 to 0.7.16.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-util-0.7.15...tokio-util-0.7.16)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-version: 0.7.16
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

* Add missing function mappings

* Added roundtrip test

* Quick fix

* Tests for all mappings, refactored logic

* Quick fix to log test

* Removed logb and sign from name mappings changes
…alarUDFImpl::invoke_with_args` (apache#16902)

* Change AsyncScalarUDFImpl::invoke_async_with_args return type to ColumnarValue

* fix docs

* cargo fmt

* cargo clippy

* Add a note in the upgrade guide

* Fix merge blunder

---------

Co-authored-by: Andrew Lamb <[email protected]>
* feat(spark): implement spark array function array

Signed-off-by: Alan Tang <[email protected]>

* chore: add license header

Signed-off-by: Alan Tang <[email protected]>

* chore: fix clippy error

Signed-off-by: Alan Tang <[email protected]>

* feat: add with_list_field_name method and more tests

Signed-off-by: Alan Tang <[email protected]>

* feat: add name field to SparkArray structure

Signed-off-by: Alan Tang <[email protected]>

* chore: hardcode field name

Signed-off-by: Alan Tang <[email protected]>

* chore: fix clippy error

Signed-off-by: Alan Tang <[email protected]>

---------

Signed-off-by: Alan Tang <[email protected]>
* Use get_slice_memory_size() instead of get_array_memory_size() for measuring array_agg accumulator size

* Add comment explaining the rationale for using `.get_slice_memory_size()`
…pache#17003)

* Support centroids config for `approx_percentile_cont_with_weight`

* Match two functions' signature

* Update docs

* Address comments and unify centroids config
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions labels Aug 20, 2025
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Aug 20, 2025
@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

Sorry for the delay -- this PR seems to have quite. few conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation execution Related to the execution crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate spark sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: DataFrameWriteOptions::with_single_file_output produces a directory