Skip to content

Edge case of decompressing legacy file with top-level metadata #65

@brianzhang01

Description

@brianzhang01

Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.

PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items() and is correctly handled, but the previous metadata is in root.attrs.items(). Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄

After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.

def decompress_zarr(root):
    coordinates = root["coordinates"][:]
    dict_repr = {"sequence_length": root.attrs["sequence_length"]}

    # Added by brianzhang01
    for key, value in root.attrs.items():
        if key == "metadata_schema":
            dict_repr[key] = json.dumps(value)
        elif key == "metadata":
            dict_repr[key] = json.dumps(value).encode("utf-8")

Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():

format_name
format_version
metadata
metadata_schema
provenance
sequence_length

and the following are the keys of root.items():

coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites

I would also suggest checking whether the format_name, format_version, and provenance fields of root.attrs are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"] is accessed in the current decompress_zarr() function.

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions