Skip to content

Performance regression in CSV pl.select(pl.len()) using streaming engine #26675

@kdn36

Description

@kdn36

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Any query such as

import polars as pl
import time

target = "./tt.csv"
n_rows = 100_000_000

df = (
    pl.DataFrame()
    .with_columns(pl.int_range(n_rows).alias("a"))
    .with_columns(pl.col.a.cast(pl.String).alias("b"))
)
ref_schema = pl.Schema({"a": pl.Int64, "b": pl.String})

df.lazy().sink_csv(target)

print("__start__ scan_csv", flush=True)
start = time.perf_counter()

out = (
    pl.scan_csv(target, schema=ref_schema)
    .select(pl.len())
    .collect(engine="streaming")
)
end = time.perf_counter()
print(out)
print(f"duration: {((end - start) * 1000):.3f} milliseconds", flush=True)

Log output

$ for v in 1.35.1 1.35.2; do echo $v; pip install polars==$v --upgrade --target /tmp/polars_specific && PYTHONPATH=/tmp/polars_specific python perf_regr
ession_mre.py; done
1.35.1
Collecting polars==1.35.1
  Using cached polars-1.35.1-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.35.1 (from polars==1.35.1)
  Using cached polars_runtime_32-1.35.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.5 kB)
Using cached polars-1.35.1-py3-none-any.whl (783 kB)
Using cached polars_runtime_32-1.35.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.3 MB)
Installing collected packages: polars-runtime-32, polars
Successfully installed polars-1.35.1 polars-runtime-32-1.35.1

[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
__start__ scan_csv
shape: (1, 1)
┌───────────┐
│ len       │
│ ---       │
│ u32       │
╞═══════════╡
│ 100000000 │
└───────────┘
duration: 52.978 milliseconds
1.35.2
Collecting polars==1.35.2
  Using cached polars-1.35.2-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.35.2 (from polars==1.35.2)
  Using cached polars_runtime_32-1.35.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.5 kB)
Using cached polars-1.35.2-py3-none-any.whl (783 kB)
Using cached polars_runtime_32-1.35.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.3 MB)
Installing collected packages: polars-runtime-32, polars
Successfully installed polars-1.35.2 polars-runtime-32-1.35.2

[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
__start__ scan_csv
shape: (1, 1)
┌───────────┐
│ len       │
│ ---       │
│ u32       │
╞═══════════╡
│ 100000000 │
└───────────┘
duration: 242.239 milliseconds

Issue description

There is a significant performance regression, from 53 ms to 242 ms in the given example, bisected to:
#25179

Following this PR, the line count is running single-threaded.

Expected behavior

No regression.

Installed versions

See MRE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-io-csvArea: reading/writing CSV filesA-streamingRelated to the streaming engineP-mediumPriority: mediumacceptedReady for implementationbugSomething isn't workingperformancePerformance issues or improvementspythonRelated to Python PolarsregressionIssue introduced by a new release

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions