Skip to content

[SPARK-52850][PYTHON] Skip calling conversions if identity function #51542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jul 17, 2025

What changes were proposed in this pull request?

Skips calling conversions if identity function.

Why are the changes needed?

Calling functions is usually expensive. We should avoid it if the function is an identity function in the critical path.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing tests, and manual benchmarks.

def profile(f, *args, _n=10, **kwargs):
    import cProfile
    import pstats
    import gc
    st = None
    for _ in range(5):
        f(*args, **kwargs)
    for _ in range(_n):
        gc.collect()
        with cProfile.Profile() as pr:
            ret = f(*args, **kwargs)
        if st is None:
            st = pstats.Stats(pr)
        else:
            st.add(pstats.Stats(pr))
    st.sort_stats("time", "cumulative").print_stats()
    return ret

from pyspark.sql.conversion import ArrowTableToRowsConversion, LocalDataToArrowConversion
from pyspark.sql.types import *

data = [
    (i if i % 1000 else None, str(i), (i, str(i)))
    for i in range(1000000)
]
schema = (
    StructType()
    .add("i", IntegerType(), nullable=True)
    .add("s", StringType(), nullable=True)
    .add("si", StructType().add("i", IntegerType()).add("s", StringType()))
)

def to_arrow():
    return LocalDataToArrowConversion.convert(data, schema, use_large_var_types=False)  # skipping the input check

def from_arrow(tbl):
    return ArrowTableToRowsConversion.convert(tbl, schema)  # skipping creating rows

tbl = profile(to_arrow)
profile(from_arrow, tbl)
  • before
140329810 function calls (140329750 primitive calls) in 12.908 seconds
180989400 function calls (180989380 primitive calls) in 40.992 seconds
  • after
80330750 function calls (80330690 primitive calls) in 10.347 seconds
140989380 function calls (140989360 primitive calls) in 35.979 seconds

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants