-
Notifications
You must be signed in to change notification settings - Fork 45
DOCS: Add subsection on number of files #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Oh that spellcheck hook is clutch! |
|
The main factor here will be sharding, which effectively packs multiple chunks into one file/object (a shard). Virtual datasets are kind of pre-sharded in the sense that other file formats usually have multiple chunks per file. But it's not really the virtualness that reduces the number of files, it's the use of sharding. Icechunk supports sharding but so does Zarr v3's native format even without Icechunk. Icechunk has some additional files compared to Native Zarr to implement version control, and I would have to defer to Seba/Deepak on whether your arithmetic is exactly right, but I think the number of those extra files will be a small correction compared to the effect of changing chunk sizes / using sharding. |
Oh then I might have misunderstood things here? To clarify: I am taking the original files out of the equation here (since not a lot of folks will rewrite to zarr/icechunk and delete netcdfs for example).
The text is a 1:1 copy from slack by @paraseba 😁, just thought it would be good to expose in the docs. |
|
I'm just trying to clarify that: If the question is "how many extra inodes will I take up by adding a fully virtual icechunk store next to the original netCDFs" then the calculation above is correct. I just wanted to clarify because I think the text here could explain this more clearly. (I also think this is a good thing to add to the docs!) |
Hmmm this goes against what I thought so far, but I might very very well be wrong. So in that case, building the same virtual store, but with more files, would increase the number of objects/files stored in the icechunk repo? I ran a little test based on the docs example import fsspec
import icechunk
from virtualizarr import open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obstore.store import S3Store
fs = fsspec.filesystem('s3', anon=True)
oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc')
oisst_files = sorted(['s3://'+f for f in oisst_files])[0:3] #🛠️ Tune this and rerun. !!! Make sure to erase the local store before running again.
print(len(oisst_files))
store = S3Store(
"noaa-cdr-sea-surface-temp-optimum-interpolation-pds",
skip_signature=True,
region='us-east-1' # just guessed this...
)
parser = HDFParser()
virtual_ds = open_virtual_mfdataset(
oisst_files,
object_store = store,
parser=parser,
concat_dim='time',
combine='nested',
parallel='lithops',
coords='minimal',
compat='override',
combine_attrs='override'
)
storage = icechunk.local_filesystem_storage(
path='temp_testing_delete/oisst',
)
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds", icechunk.s3_store(region="us-east-1")))
credentials = icechunk.containers_credentials({"s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds": icechunk.s3_credentials(anonymous=True)})
repo = icechunk.Repository.create(storage, config, credentials)
session = repo.writable_session("main")
virtual_ds.virtualize.to_icechunk(session.store)
session.commit("My first virtual store!")
fs_local = fsspec.filesystem('local')
print(len(fs_local.find('temp_testing_delete/oisst')))No matter if I use 3 or 30 input files (I made sure to delete the entire local store each time), I get 14 files! On top of that each file seems to be a single chunk only! This is from the 30 files run: |

Follow up on slack discussion with @paraseba.
I think that pointing this out can make virtual stores much more attractive to HPC folks.
@rabernat would be curious if you think this all looks correct?