Skip to content

[Bug]: [Python] Python cache directory leaks cached packages between pipeline runs #35717

@hjtran

Description

@hjtran

What happened?

When specifying a requirements file in the python SDK, the wheels for the extra packages are staged by default in TEMPDIR/dataflow-requirements-cache (we should also update this directory name to something dataflow-agnostic). This directory is not cleaned up, probably so we can keep reusing these wheels between runs.

The issue is that we later then stage the entire cache directory, even if some of the wheels arent necessary or used for a workflow.

The staging code I'm referring to is under create_job_resources:

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions