Skip to content

Conversation

@jbusecke
Copy link
Contributor

Follow up on slack discussion with @paraseba.

I think that pointing this out can make virtual stores much more attractive to HPC folks.

@rabernat would be curious if you think this all looks correct?

@jbusecke
Copy link
Contributor Author

Oh that spellcheck hook is clutch!

@TomNicholas
Copy link
Contributor

The main factor here will be sharding, which effectively packs multiple chunks into one file/object (a shard). Virtual datasets are kind of pre-sharded in the sense that other file formats usually have multiple chunks per file. But it's not really the virtualness that reduces the number of files, it's the use of sharding. Icechunk supports sharding but so does Zarr v3's native format even without Icechunk.

Icechunk has some additional files compared to Native Zarr to implement version control, and I would have to defer to Seba/Deepak on whether your arithmetic is exactly right, but I think the number of those extra files will be a small correction compared to the effect of changing chunk sizes / using sharding.

@jbusecke
Copy link
Contributor Author

But it's not really the virtualness that reduces the number of files

Oh then I might have misunderstood things here? To clarify: I am taking the original files out of the equation here (since not a lot of folks will rewrite to zarr/icechunk and delete netcdfs for example).

I would have to defer to Seba/Deepak on whether your arithmetic is exactly right

The text is a 1:1 copy from slack by @paraseba 😁, just thought it would be good to expose in the docs.

@TomNicholas
Copy link
Contributor

I'm just trying to clarify that:
a) Native Zarr v3 and Icechunk with native chunks have ~= number of total files (assuming same size chunks and shards are chosen), the difference being what you listed above.
b) Virtually referencing chunks inside formats like netCDF does mean fewer total files than rewriting the netCDFs as native Zarr/Icechunk chunks, but it's not really because the chunks are virtual, it's because the netCDF files are effectively acting like shards. If you virtually referenced some other exploded file format you would have a large number of virtual chunks. This would happen if you virtually referenced an existing unsharded zarr v3 store, which VirtualiZarr supports doing.

If the question is "how many extra inodes will I take up by adding a fully virtual icechunk store next to the original netCDFs" then the calculation above is correct. I just wanted to clarify because I think the text here could explain this more clearly. (I also think this is a good thing to add to the docs!)

@jbusecke jbusecke changed the title Add subsection on number of files DOCS: Add subsection on number of files Jul 11, 2025
@jbusecke
Copy link
Contributor Author

b) Virtually referencing chunks inside formats like netCDF does mean fewer total files than rewriting the netCDFs as native Zarr/Icechunk chunks, but it's not really because the chunks are virtual, it's because the netCDF files are effectively acting like shards. If you virtually referenced some other exploded file format you would have a large number of virtual chunks.

Hmmm this goes against what I thought so far, but I might very very well be wrong.

So in that case, building the same virtual store, but with more files, would increase the number of objects/files stored in the icechunk repo?

I ran a little test based on the docs example

import fsspec
import icechunk
from virtualizarr import open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obstore.store import S3Store

fs = fsspec.filesystem('s3', anon=True)

oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc')

oisst_files = sorted(['s3://'+f for f in oisst_files])[0:3] #🛠️ Tune this and rerun. !!! Make sure to erase the local store before running again.
print(len(oisst_files))

store = S3Store(
    "noaa-cdr-sea-surface-temp-optimum-interpolation-pds",
    skip_signature=True, 
    region='us-east-1' # just guessed this...
)
parser = HDFParser()

virtual_ds = open_virtual_mfdataset(
    oisst_files, 
    object_store = store, 
    parser=parser, 
    concat_dim='time',
    combine='nested',
    parallel='lithops',
    coords='minimal',
    compat='override',
    combine_attrs='override'
)

storage = icechunk.local_filesystem_storage(
    path='temp_testing_delete/oisst',
)

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds", icechunk.s3_store(region="us-east-1")))
credentials = icechunk.containers_credentials({"s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds": icechunk.s3_credentials(anonymous=True)})
repo = icechunk.Repository.create(storage, config, credentials)

session = repo.writable_session("main")
virtual_ds.virtualize.to_icechunk(session.store)
session.commit("My first virtual store!")

fs_local = fsspec.filesystem('local')
print(len(fs_local.find('temp_testing_delete/oisst')))

No matter if I use 3 or 30 input files (I made sure to delete the entire local store each time), I get 14 files!

['/home/jovyan/temp_testing_delete/oisst/chunks/A7XFVG1Z9MRNW83WHPK0',
 '/home/jovyan/temp_testing_delete/oisst/chunks/Y2W02VA38HZTN4QXBP5G',
 '/home/jovyan/temp_testing_delete/oisst/config.yaml',
 '/home/jovyan/temp_testing_delete/oisst/manifests/1JVH1WRN30A2E1SMP5P0',
 '/home/jovyan/temp_testing_delete/oisst/manifests/1TQF90293NY045SBEE7G',
 '/home/jovyan/temp_testing_delete/oisst/manifests/F3RS921X2MEZ0W81D760',
 '/home/jovyan/temp_testing_delete/oisst/manifests/G1DS554C2HCFTCEY8NY0',
 '/home/jovyan/temp_testing_delete/oisst/manifests/PC8DG6W8Z1PDGWZQ801G',
 '/home/jovyan/temp_testing_delete/oisst/manifests/QC3XTR2M6V4JK7X3SY40',
 '/home/jovyan/temp_testing_delete/oisst/manifests/Y1WM9X2BWA16ZYX84ZQG',
 '/home/jovyan/temp_testing_delete/oisst/refs/branch.main/ref.json',
 '/home/jovyan/temp_testing_delete/oisst/snapshots/1CECHNKREP0F1RSTCMT0',
 '/home/jovyan/temp_testing_delete/oisst/snapshots/C26JRME761RP8HA4CD7G',
 '/home/jovyan/temp_testing_delete/oisst/transactions/C26JRME761RP8HA4CD7G']

On top of that each file seems to be a single chunk only! This is from the 30 files run:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants