Skip to content

fill_value is not preserved in the rechunked output #133

@flamingbear

Description

@flamingbear

Hi,

This is the follow on to #131. (and an updated #132)

In comparing the source and target zarr stores from my regression tests, I noticed that the fill_value changed between my source and target data. I guess that it's not preserved in the rechunk, but this can lead to much larger than needed output stores.

This is an updated script from my previous test script that creates a degenerate case of almost all the same data being rechunked.

If you run this script you will see the fillvalue of "foo/bar/.zarray" changes from "fill_value": 1.0, to "fill_value": null, between the source and target zarr stores. And the output disk size of the stores is significantly different, an order of magnitude.

Thanks,
Matt

❯ du -hs *
 36K	source.zarr
3.1M	target.zarr

Here's a script that demonstrates the issue.

import zarr
from rechunker import rechunk
import shutil


def run_create_input_store():
    shutil.rmtree('testoutput/', ignore_errors=True)
    store = zarr.DirectoryStore('testoutput/source.zarr')
    root = zarr.group(store=store, overwrite=True)
    foo = root.create_group('foo')
    root.attrs['description'] = 'root description'
    foo.attrs['description'] = 'foo description'
    bar = foo.ones('bar', shape=(10000, 10000))
    bar[5000, 5000] = 3
    bar.attrs['description'] = 'foo description'
    zarr.consolidate_metadata(store)


def rechunkit():
    openstore = zarr.open_consolidated('testoutput/source.zarr')
    array_plan = rechunk(openstore, {'foo/bar': (1000, 1000)},
                         '1GB',
                         'testoutput/target.zarr',
                         temp_store='testoutput/temp.zarr')
    array_plan.execute()
    zarr.consolidate_metadata('testoutput/target.zarr')


if __name__ == '__main__':
    run_create_input_store()
    rechunkit()
    print('Compare the .zmetadata files in both your source.zarr and target.zarr directories')
    print('You will see that the "fill_value" in the source is 1.0 and it is null in the target.')
    source = zarr.open('testoutput/source.zarr')
    target = zarr.open('testoutput/target.zarr')
    print(source['foo']['bar'].fill_value)
    print(target['foo']['bar'].fill_value)
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions