DOCS: Add subsection on number of files #1069

jbusecke · 2025-07-11T15:55:32Z

Follow up on slack discussion with @paraseba.

I think that pointing this out can make virtual stores much more attractive to HPC folks.

@rabernat would be curious if you think this all looks correct?

jbusecke · 2025-07-11T15:57:48Z

Oh that spellcheck hook is clutch!

TomNicholas · 2025-07-11T17:30:00Z

The main factor here will be sharding, which effectively packs multiple chunks into one file/object (a shard). Virtual datasets are kind of pre-sharded in the sense that other file formats usually have multiple chunks per file. But it's not really the virtualness that reduces the number of files, it's the use of sharding. Icechunk supports sharding but so does Zarr v3's native format even without Icechunk.

Icechunk has some additional files compared to Native Zarr to implement version control, and I would have to defer to Seba/Deepak on whether your arithmetic is exactly right, but I think the number of those extra files will be a small correction compared to the effect of changing chunk sizes / using sharding.

jbusecke · 2025-07-11T17:44:30Z

But it's not really the virtualness that reduces the number of files

Oh then I might have misunderstood things here? To clarify: I am taking the original files out of the equation here (since not a lot of folks will rewrite to zarr/icechunk and delete netcdfs for example).

I would have to defer to Seba/Deepak on whether your arithmetic is exactly right

The text is a 1:1 copy from slack by @paraseba 😁, just thought it would be good to expose in the docs.

TomNicholas · 2025-07-11T18:06:17Z

I'm just trying to clarify that:
a) Native Zarr v3 and Icechunk with native chunks have ~= number of total files (assuming same size chunks and shards are chosen), the difference being what you listed above.
b) Virtually referencing chunks inside formats like netCDF does mean fewer total files than rewriting the netCDFs as native Zarr/Icechunk chunks, but it's not really because the chunks are virtual, it's because the netCDF files are effectively acting like shards. If you virtually referenced some other exploded file format you would have a large number of virtual chunks. This would happen if you virtually referenced an existing unsharded zarr v3 store, which VirtualiZarr supports doing.

If the question is "how many extra inodes will I take up by adding a fully virtual icechunk store next to the original netCDFs" then the calculation above is correct. I just wanted to clarify because I think the text here could explain this more clearly. (I also think this is a good thing to add to the docs!)

jbusecke · 2025-07-17T01:36:47Z

b) Virtually referencing chunks inside formats like netCDF does mean fewer total files than rewriting the netCDFs as native Zarr/Icechunk chunks, but it's not really because the chunks are virtual, it's because the netCDF files are effectively acting like shards. If you virtually referenced some other exploded file format you would have a large number of virtual chunks.

Hmmm this goes against what I thought so far, but I might very very well be wrong.

So in that case, building the same virtual store, but with more files, would increase the number of objects/files stored in the icechunk repo?

I ran a little test based on the docs example

import fsspec
import icechunk
from virtualizarr import open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obstore.store import S3Store

fs = fsspec.filesystem('s3', anon=True)

oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc')

oisst_files = sorted(['s3://'+f for f in oisst_files])[0:3] #🛠️ Tune this and rerun. !!! Make sure to erase the local store before running again.
print(len(oisst_files))

store = S3Store(
    "noaa-cdr-sea-surface-temp-optimum-interpolation-pds",
    skip_signature=True, 
    region='us-east-1' # just guessed this...
)
parser = HDFParser()

virtual_ds = open_virtual_mfdataset(
    oisst_files, 
    object_store = store, 
    parser=parser, 
    concat_dim='time',
    combine='nested',
    parallel='lithops',
    coords='minimal',
    compat='override',
    combine_attrs='override'
)

storage = icechunk.local_filesystem_storage(
    path='temp_testing_delete/oisst',
)

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds", icechunk.s3_store(region="us-east-1")))
credentials = icechunk.containers_credentials({"s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds": icechunk.s3_credentials(anonymous=True)})
repo = icechunk.Repository.create(storage, config, credentials)

session = repo.writable_session("main")
virtual_ds.virtualize.to_icechunk(session.store)
session.commit("My first virtual store!")

fs_local = fsspec.filesystem('local')
print(len(fs_local.find('temp_testing_delete/oisst')))

No matter if I use 3 or 30 input files (I made sure to delete the entire local store each time), I get 14 files!

['/home/jovyan/temp_testing_delete/oisst/chunks/A7XFVG1Z9MRNW83WHPK0',
 '/home/jovyan/temp_testing_delete/oisst/chunks/Y2W02VA38HZTN4QXBP5G',
 '/home/jovyan/temp_testing_delete/oisst/config.yaml',
 '/home/jovyan/temp_testing_delete/oisst/manifests/1JVH1WRN30A2E1SMP5P0',
 '/home/jovyan/temp_testing_delete/oisst/manifests/1TQF90293NY045SBEE7G',
 '/home/jovyan/temp_testing_delete/oisst/manifests/F3RS921X2MEZ0W81D760',
 '/home/jovyan/temp_testing_delete/oisst/manifests/G1DS554C2HCFTCEY8NY0',
 '/home/jovyan/temp_testing_delete/oisst/manifests/PC8DG6W8Z1PDGWZQ801G',
 '/home/jovyan/temp_testing_delete/oisst/manifests/QC3XTR2M6V4JK7X3SY40',
 '/home/jovyan/temp_testing_delete/oisst/manifests/Y1WM9X2BWA16ZYX84ZQG',
 '/home/jovyan/temp_testing_delete/oisst/refs/branch.main/ref.json',
 '/home/jovyan/temp_testing_delete/oisst/snapshots/1CECHNKREP0F1RSTCMT0',
 '/home/jovyan/temp_testing_delete/oisst/snapshots/C26JRME761RP8HA4CD7G',
 '/home/jovyan/temp_testing_delete/oisst/transactions/C26JRME761RP8HA4CD7G']

On top of that each file seems to be a single chunk only! This is from the 30 files run:

jbusecke added 2 commits July 11, 2025 11:54

Add subsection on number of files

d814d66

Update virtual.md

bcc25a3

jbusecke added 2 commits July 11, 2025 12:45

Change styling

4956e76

Change styling again

f3b3987

jbusecke changed the title ~~Add subsection on number of files~~ DOCS: Add subsection on number of files Jul 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DOCS: Add subsection on number of files #1069

DOCS: Add subsection on number of files #1069

Uh oh!

jbusecke commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 11, 2025

Uh oh!

TomNicholas commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 11, 2025

Uh oh!

TomNicholas commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DOCS: Add subsection on number of files #1069

Are you sure you want to change the base?

DOCS: Add subsection on number of files #1069

Uh oh!

Conversation

jbusecke commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 11, 2025

Uh oh!

TomNicholas commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 11, 2025

Uh oh!

TomNicholas commented Jul 11, 2025

Uh oh!

jbusecke commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants