You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This thread is meant to talk through the different ways that one might approach storage of cloud-optimized geospatial datasets (thinking large array data) from a IPFS "data provider" / hoster POV.
Things we should think about:
Which best practices, node configuration, hosting infrastructure make sense?
What kinds of trade-offs / goals should we be considering when exploring different approaches?
One thing I'm wondering is what configuration we want w/r/t usage of IPFS's datastore. In IPFS default's configuration, data is copied in full to the IPFS internal datastore (usually somewhere like ~/.ipfs/....). Two alternative options (both experimental features) are:
ipfs filestore: files are not copied to the datastore, but rather the existing files on disk are used for delivering content to other nodes the network
ipfs urlstore: files are not copied to the datastore, but retrieved from a URL over HTTP
maybe this could work with existing cloud-optimized datasets on s3 if our goal is simply to expose pre-existing datasets over the IPFS network with CIDs?
I get the sense that the biggest place for impact is actually in the direction of what @rsignell was initially proposing (doing something similar to virtualizarr, so that non-ARCO datasets (e.g. netcdf) can be accessed via range requests when stored on IPFS. These IPFS "range requests" would crawl the IPLD dag to get a subset of IPFS blocks from a dataset, similar to HTTP range requests in existing cloud optimized geospatial data workflows).
If that ends up being our goal, we probably want to leverage either ipfs filestore or ipfs datastore (see above) so that these large datasets don't have to get copied over. If users are comfortable duplicating the data from a large netCDF file over into the IPFS datastore, then they might as well reformat as zarr (which from my first round of notebook experiments in #1, already plays pretty nicely with IPFS/IPLD).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This thread is meant to talk through the different ways that one might approach storage of cloud-optimized geospatial datasets (thinking large array data) from a IPFS "data provider" / hoster POV.
Things we should think about:
One thing I'm wondering is what configuration we want w/r/t usage of IPFS's datastore. In IPFS default's configuration, data is copied in full to the IPFS internal datastore (usually somewhere like
~/.ipfs/....). Two alternative options (both experimental features) are:I get the sense that the biggest place for impact is actually in the direction of what @rsignell was initially proposing (doing something similar to virtualizarr, so that non-ARCO datasets (e.g. netcdf) can be accessed via range requests when stored on IPFS. These IPFS "range requests" would crawl the IPLD dag to get a subset of IPFS blocks from a dataset, similar to HTTP range requests in existing cloud optimized geospatial data workflows).
If that ends up being our goal, we probably want to leverage either
ipfs filestoreoripfs datastore(see above) so that these large datasets don't have to get copied over. If users are comfortable duplicating the data from a large netCDF file over into the IPFS datastore, then they might as well reformat as zarr (which from my first round of notebook experiments in #1, already plays pretty nicely with IPFS/IPLD).Beta Was this translation helpful? Give feedback.
All reactions