Skip to content

Conversation

AaronTacke
Copy link
Contributor

@AaronTacke AaronTacke commented Jun 7, 2025

Motivation

We want Pandas to be an optional dependency, especially once a Polars Comparator (see #15) will be added.

Changes

Use native python data structures to replace Pandas in the parts of the comparator implementation that are used by multiple comparators.

TODOS

  • Check for other usages of Pandas in general (non-Pandas specific) files.
  • Remove Pandas as dependency from pixi, create an environment with Pandas, change tests which require Pandas accordingly.
    • why does docs env need sql (in pixi.toml)?
  • Benchark comparisons with many changes (large _values in tabulardelta_dataclasses)

@AaronTacke AaronTacke self-assigned this Jun 7, 2025
@AaronTacke
Copy link
Contributor Author

benchmark
@DominikZuercherQC This PR currently doubles the execution time for comparisons with a very large number of changes.
I'd suggest that the PandasComparator should shorten the list of changes before they are converted into non-Pandas data structures.
Do you think execution time is relevant atm? Otherwise I'd just do it in some later PR :)

@DominikZuercherQC
Copy link
Collaborator

benchmark @DominikZuercherQC This PR currently doubles the execution time for comparisons with a very large number of changes. I'd suggest that the PandasComparator should shorten the list of changes before they are converted into non-Pandas data structures. Do you think execution time is relevant atm? Otherwise I'd just do it in some later PR :)

Unfortunately, performance is quite important (at least for my usecase). Where is the large increase in runtime coming from?
Also, why did you decide to use lists/dicts for _values internally? Would polars Dataframes not be more performant?

@AaronTackeQC
Copy link
Collaborator

@DominikZuercherQC

Where is the large increase in runtime coming from?

I assume copying the changed values from Pandas into lists/dicts.

Also, why did you decide to use lists/dicts for _values internally? Would polars Dataframes not be more performant?

If we want to keep one tabulardelta_dataclasses for all the comparators, I want neither Pandas nor Polars to be a mandatory dependency.
So we can either

  1. Reduce the amount of data we put in this dataclass (e.g. limit the number of changed values we can show as an example), or
  2. Use the previous dataclass for the pandas_comparator only, and add a corresponding version for Polars

@DominikZuercherQC
Copy link
Collaborator

@DominikZuercherQC

Where is the large increase in runtime coming from?

I assume copying the changed values from Pandas into lists/dicts.

Also, why did you decide to use lists/dicts for _values internally? Would polars Dataframes not be more performant?

If we want to keep one tabulardelta_dataclasses for all the comparators, I want neither Pandas nor Polars to be a mandatory dependency. So we can either

1. Reduce the amount of data we put in this dataclass (e.g. limit the number of changed values we can show as an example), or

2. Use the previous dataclass for the pandas_comparator only, and add a corresponding version for Polars

I would vote for option 2 in that case

@AaronTacke
Copy link
Contributor Author

I introduced two versions of the dataclasses: one with Pandas (to make sure the performance of PandasComparator doesn't degrade), and one without (so that TabularDelta can be used without relying on Pandas). It's not pretty, but embraces the modularity of TabularDelta :D

@DominikZuercherQC You can decide whether you want the PolarsComparator to use the native_dataclasses or whether you want to adapt the pandas_dataclasses to create a new polars_dataclasses.

@DominikZuercherQC
Copy link
Collaborator

Thanks @AaronTacke. I guess I will create a polars version of the pandas_dataclasses for the polars comparator then in a follow up PR.

@AaronTackeQC AaronTackeQC marked this pull request as ready for review June 18, 2025 16:29
@AaronTackeQC AaronTackeQC self-requested a review as a code owner June 18, 2025 16:29
@AaronTackeQC AaronTackeQC removed their request for review June 18, 2025 16:30
@AaronTackeQC
Copy link
Collaborator

@DominikZuercherQC feel free to review this, I think it should be fine :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants