Skip to content

Side input datasets must be materialized or else memory leaks #131

@langmore

Description

@langmore

There is a memory leak users will cause if side inputs like land_sea_mask or thresholds are not materialized.

Here is the step-by-step for land_sea_mask.

  1. inside ComputeStatisticsAggregateAndPrepareForCombine.process, the line aggregation_state = self.aggregator.aggregate_stat_var(stat) makes a non-materialized aggregation_state via
    1. binning_method.create_bin_mask(stat) which creates a non-materialized mask
    2. xr.dot(stat, *weights, *bin_masks, dims=reduce_dims_set)
  2. During CombiningSum, the accumulator is not materialized
    1. Memory grows > O(N) (because a growing Dask graph is being built?)

I see some solutions

  1. Document and logging.warning to encourage users to materialize
  2. Materialize side inputs (land_sea_mask, thresholds) when they are passed to the user class in __init__
  3. Materialize as late as possible, in case the side input is sliced first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions