-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.
PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items()
and is correctly handled, but the previous metadata is in root.attrs.items()
. Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄
After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.
def decompress_zarr(root):
coordinates = root["coordinates"][:]
dict_repr = {"sequence_length": root.attrs["sequence_length"]}
# Added by brianzhang01
for key, value in root.attrs.items():
if key == "metadata_schema":
dict_repr[key] = json.dumps(value)
elif key == "metadata":
dict_repr[key] = json.dumps(value).encode("utf-8")
Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata
, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():
format_name
format_version
metadata
metadata_schema
provenance
sequence_length
and the following are the keys of root.items():
coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites
I would also suggest checking whether the format_name
, format_version
, and provenance
fields of root.attrs
are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"]
is accessed in the current decompress_zarr()
function.
Thank you very much.