Skip to content

[DISCUSS] DataFusion less frequent major / breaking releases (ease using multiple third-party extensions (like delta, or iceberg) ) #16622

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

One of the dreams of the composable data ecosystem is to quickly assemble a system from various components (DataFusion, data formats

DataFusion still releases once a month, which allows code to quickly flow but also causes at least 2 challenges:

  1. Takes non trivial work required to upgrade downstream projects, as mentioned in [Discuss] Release cadence / patch releases / Long Term Supported (lts) minor releases #5269
  2. Make upgrading and using downstream third-party extensions hard

Third party extensions like delta-rs and iceberg provide TableProviders for DataFusion, which is really nice. However, to use those packages the versions of DataFusion must match exactly.

This means for an application that relies on multiple downstream packages must wait until ALL of them have upgraded to the new version in order to upgrade DataFusion. If there is any delay in the downstream libraries updating, it delays.

For example, an application that wants to use delta-rs, iceberg, and the table-providers crate, there is a race after each upgrade of DataFusion

Let's take a release timeline for

  1. +0 days: DataFusion version X released
  2. +7 days: New delta-rs releases upgraded to DataFusion X
  3. +11 days: new iceberg crate released upgraded to DataFusion X
  4. +12 days: new table-providers version is released
  5. +13-30 days: End user app can upgrade DataFusion and delta, and icerberg
  6. +31 days: New DataFusion is released again

Describe the solution you'd like

I would like downstream libraries to have more time and schedule flexibility when upgrading DataFusion and other dependent crates, so that it is easier to construct a system from different components

Describe alternatives you've considered

Option 1: Switch to major/minor release cadence

We could follow the model of arrow-rs which does releases monthly, but breaking releases only quarterly. Here is how it works in arrow-rs: https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule

This would mean continuing to release every month, but only allowing breaking API changes every 3rd release (or some other cadence)

The major cost here is that maintainers and contributors would have to be diligent about not merging breaking API changes until a major release

This is possible to automate somewhat:

Option 2: LTS and feature branch

-Keep (at least) two branches going: LTS and main, as proposed by @andygrove in #5269

In this model we would likely backport changes to the LTS branch and make releases from there. The downside of this approach is that there is extra work to backport changes to LTS.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions