You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[moved this initial conversation over from Slack -- we intend to discuss things here openly on github from now on!]
Cory Levinson
I'm spending some time today digging a bit deeper into the Zarr side of things so i have a better grasp of the inner workings of the geospatial side of this stack. Will report back here with some thoughts later.
Cory Levinson
It’s what I thought we might use for a virtual Zarr dataset that points to chunks from IPLD. Virtualizarr v1 used fsspec, which makes various things look like a file system, but virtualizarr V2 uses obstore, which makes various things look like an obstore object.
i'll look a bit more into what you're taking about here too.
Originally I thought we were looking at doing something custom at the IPLD layer, to point to pre-existing zarr chunks, as opposed to something custom at the zarr layer (eg virtualizarr) to point to pre-existing ipld chunks.
rsignell
oh, okay, well, likely I had the wrong idea...
Cory Levinson
I’m thinking we likely should look at all approaches and map them against pros cons.
And maybe write out some desired functionality / expected behavior
Cory Levinson
Highest level user needs I can think of so far are:
archivers (ppl or orgs wanting to host / pin / seed the dataset) should easily be able to host a whole dataset without needing to think about individual chunks
consumers (ppl querying and doing analytics over the dataset) should easily be able to do their normal xarray / python things without needing to download the entire dataset
I think things get a bit more complicated if we try to handle mutable datasets and want the ability for consumers to write updates to an ipfs x zarr dataset. So maybe best for us to only focus on read only datasets initially?
Cory Levinson
Here's some pre-existing work form the author of one of the Pangeo blog posts that may be worth looking at: https://github.com/d70-t/ipldstore
and here's the most recent thread I've found on the topic: https://discuss.ipfs.tech/t/working-with-shards-to-manage-2tb-dataset/19404
IPFS ForumsIPFS Forums Working with Shards to Manage 2TB Dataset
Hey There! We are working on speeding up our retrievals with IPFS on zarr data storages. When you have a 2TB dataset, the manifest file of the key value pairs gets to be around 160MB. Meaning just to read the dataset you needed to load 160MB. This was obviously not feasible. So we introduced a hamt structure which divided this manifest into “blocks” and “levels” making the first fetch being 160MB / 256 = 0.625KB. Then you have to traverse the levels. so it might need a couple round trips
rsignell
“We are working on speeding up our retrievals with IPFS on zarr data storages” whoa cool!
Cory Levinson
Thoughts on what size datasets we should be targeting our solution for? If we are getting into the terabyte size then maybe we’ll need to look in more detail of what this forum poster is doing.
Maybe for first experiments we want to start with datasets in the 1-10GB range to keep things simple?
Cory Levinson
So the simplest case of "what's the current state of all this stuff?" that I thought of, which I think is a good place to start for general understanding is to just throw some dummy zarr files on IPFS and try to access them in xarray via ipfsspec and see how that all works.
I got a thing working but its just with random dummy zarr data (random number arrays, etc.).
Do either of y'all have an idea of good testing zarr datasets that would be nice for us to play around with? maybe a few basic geospatial datasets at different sizes that would be good for different levels of testing? Maybe a few at <1mb 10-50mb, 100-500mb, 1gb+ ?
While we wait for the $$ to come through at figure out a proper benchmarking / infra costs, I can just pin these datasets on a personal server I have running already.
Cory Levinson https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full
In the Future Outlook section of this journal article on Pangeo Forge:
The current development of Pangeo Forge is supported by a 3 year grant from the National Science Foundation (NSF) EarthCube program. Storage expenses are covered through our partnership with the Open Storage Network (OSN), which provides Pangeo Forge with 100 terabytes of cloud storage space, accessible over the S3 protocol for free (Public cloud storage buckets often implement a “requester-pays” model in which users are responsible for the cost of moving data; our OSN storage does not). All three major cloud providers offer programs for free hosting of public scientific datasets. We anticipate engaging in these programs as our storage needs grow. We have also begun to evaluate distributed, peer-to-peer storage systems such as the InterPlanetary FileSystem (IPFS) and Filecoin as an alternative storage option.
rsignell
Although I think Pangeo Forge slowed down when Ryan left Columbia and then Charles left the project, it looks like @raphael Hagen (@norlandrhagen) and Justus Magin (@keewis) are still involved with the project and might be able to fill us in.
Maybe we can check with them first, then post something to the Pangeo Discourse Forum stating what we believe the current state of the issue to be, and see if the community agrees (or knows more!)
wdyt of getting a basic ec2 instance up in a US region (from the ESIP credits)? The server i'm using is a personal one in the netherlands, and it's a shared box so I think that may be part of the reason why my IPFS speeds were so horrendously slow yesterday...
rsignell
Would you be okay using Coiled?
Cory Levinson
as long as I can ssh in, and we don't have firewalls setup should be fine
is a coiled instance OK for me mucking around and setting up something to run in the background (e.g. a screen or systemd) ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[moved this initial conversation over from Slack -- we intend to discuss things here openly on github from now on!]
Cory Levinson
I'm spending some time today digging a bit deeper into the Zarr side of things so i have a better grasp of the inner workings of the geospatial side of this stack. Will report back here with some thoughts later.
Cory Levinson
It’s what I thought we might use for a virtual Zarr dataset that points to chunks from IPLD. Virtualizarr v1 used fsspec, which makes various things look like a file system, but virtualizarr V2 uses obstore, which makes various things look like an obstore object.
i'll look a bit more into what you're taking about here too.
Originally I thought we were looking at doing something custom at the IPLD layer, to point to pre-existing zarr chunks, as opposed to something custom at the zarr layer (eg virtualizarr) to point to pre-existing ipld chunks.
rsignell
oh, okay, well, likely I had the wrong idea...
Cory Levinson
I’m thinking we likely should look at all approaches and map them against pros cons.
And maybe write out some desired functionality / expected behavior
Cory Levinson
Highest level user needs I can think of so far are:
archivers (ppl or orgs wanting to host / pin / seed the dataset) should easily be able to host a whole dataset without needing to think about individual chunks
consumers (ppl querying and doing analytics over the dataset) should easily be able to do their normal xarray / python things without needing to download the entire dataset
I think things get a bit more complicated if we try to handle mutable datasets and want the ability for consumers to write updates to an ipfs x zarr dataset. So maybe best for us to only focus on read only datasets initially?
Cory Levinson
Here's some pre-existing work form the author of one of the Pangeo blog posts that may be worth looking at: https://github.com/d70-t/ipldstore
and here's the most recent thread I've found on the topic: https://discuss.ipfs.tech/t/working-with-shards-to-manage-2tb-dataset/19404
IPFS ForumsIPFS Forums
Working with Shards to Manage 2TB Dataset
Hey There! We are working on speeding up our retrievals with IPFS on zarr data storages. When you have a 2TB dataset, the manifest file of the key value pairs gets to be around 160MB. Meaning just to read the dataset you needed to load 160MB. This was obviously not feasible. So we introduced a hamt structure which divided this manifest into “blocks” and “levels” making the first fetch being 160MB / 256 = 0.625KB. Then you have to traverse the levels. so it might need a couple round trips
rsignell
“We are working on speeding up our retrievals with IPFS on zarr data storages” whoa cool!
Cory Levinson
Thoughts on what size datasets we should be targeting our solution for? If we are getting into the terabyte size then maybe we’ll need to look in more detail of what this forum poster is doing.
Maybe for first experiments we want to start with datasets in the 1-10GB range to keep things simple?
Cory Levinson
So the simplest case of "what's the current state of all this stuff?" that I thought of, which I think is a good place to start for general understanding is to just throw some dummy zarr files on IPFS and try to access them in xarray via ipfsspec and see how that all works.
I got a thing working but its just with random dummy zarr data (random number arrays, etc.).
Do either of y'all have an idea of good testing zarr datasets that would be nice for us to play around with? maybe a few basic geospatial datasets at different sizes that would be good for different levels of testing? Maybe a few at <1mb 10-50mb, 100-500mb, 1gb+ ?
While we wait for the $$ to come through at figure out a proper benchmarking / infra costs, I can just pin these datasets on a personal server I have running already.
Cory Levinson
https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full
In the Future Outlook section of this journal article on Pangeo Forge:
The current development of Pangeo Forge is supported by a 3 year grant from the National Science Foundation (NSF) EarthCube program. Storage expenses are covered through our partnership with the Open Storage Network (OSN), which provides Pangeo Forge with 100 terabytes of cloud storage space, accessible over the S3 protocol for free (Public cloud storage buckets often implement a “requester-pays” model in which users are responsible for the cost of moving data; our OSN storage does not). All three major cloud providers offer programs for free hosting of public scientific datasets. We anticipate engaging in these programs as our storage needs grow. We have also begun to evaluate distributed, peer-to-peer storage systems such as the InterPlanetary FileSystem (IPFS) and Filecoin as an alternative storage option.
rsignell
Although I think Pangeo Forge slowed down when Ryan left Columbia and then Charles left the project, it looks like @raphael Hagen (@norlandrhagen) and Justus Magin (@keewis) are still involved with the project and might be able to fill us in.
Maybe we can check with them first, then post something to the Pangeo Discourse Forum stating what we believe the current state of the issue to be, and see if the community agrees (or knows more!)
rsignell
Some sample zarr datasets:
183MB:
76GB:
Cory Levinson
wdyt of getting a basic ec2 instance up in a US region (from the ESIP credits)? The server i'm using is a personal one in the netherlands, and it's a shared box so I think that may be part of the reason why my IPFS speeds were so horrendously slow yesterday...
rsignell
Would you be okay using Coiled?
Cory Levinson
as long as I can ssh in, and we don't have firewalls setup should be fine
is a coiled instance OK for me mucking around and setting up something to run in the background (e.g. a screen or systemd) ?
Cory Levinson
Some very early benchmarking results, pulling the ITS_LIVE dataset from s3 vs ipfs: https://gist.github.com/clevinson/38ac8ce20dcf1be85299b1ed28ef90f0
Beta Was this translation helpful? Give feedback.
All reactions