Skip to content

Loading in datasets in a reproducable way into the world builder #935

@MFraters

Description

@MFraters

With the inclusion of the litho1.0 dataset and the introduction of loading CPO datasets (#918), I think we need to think a bit more about how to maintain reproducibility with (external) datasets. Up to now, the world builder file had been everything needed to reproduce the full state of the world, but loading other datasets not directly in the file or the binary breaks this. This is an issue which I think we need to "solve" before the next release (so we luckily still have a bit of time).

With the litho1.0 dataset, I just included it directly into the world builder, since it is not going to change, and I thought it to be of high enough general use value. This approach can work for published datasets which are not too large, but it increases both the repository and binary size significantly.

With the CPO dataset loading is that it is more about using local and possibly rapid changing datasets which you want to experiment with. @Wang-yijun, please let me know if this is not correct for your case, but even then I can see how this might be a feature requested in the future.

So I think there are two different use cases:

  1. Integrating published datasets like litho1.0, but also topography, tomography, crustal thickness, gplates, etc. datasets.
  2. Integrating experimental dataset.

For case 1 I am thinking we could (also depending on what is allowed by the licenses):

  1. let the users define a url to the dataset in a known and trusted repository such as zenodo, which guarantee future availability. The user would supply a doi or a defined name of the dataset they want to use and the world builder world during the create world phase check if the data is there and intact (with a hash) and otherwise download and process it if needed.
  2. A more restrictive version would be that the world builder would only be able to download named white listed datasets (we will probably need to do some processing anyway if it comes straight from zenodo).
  3. maintain a own (github?) repository or a Git Large File Storage (Git LFS) with preprocessed data.

For case 2 I do not think we can keep the reproducibility of the world builder file if we want to support that use case. We could make sure to always write out the world builder file, all used files and the output to a output folder, which would help, but not solve the problem. The only solution I can think of is that we would need to

  1. mark the world builder file explicitly as non-reproducable, through a parameter in the world builder file, and
  2. explicitly build a non-reproducable world builder world, through a argument in the construction of the world
    to enable the loading of data in a way that makes the world builder file non-reproducable. This makes sure that there is a clear distinction between reproducable world builder files and non-reproducable world builder files, where the user has to actively opt in by setting both the world builder file and the world builder world to be non-reproducible for those features to become available.

I would be happy to hear opinions on this, whether this make sense or not. I think this maybe of interest to (among others) @Wang-yijun, @Minerallo, @danieldouglas92, @tjhei, @alarshi, @lhy11009, @ljhwang, @mibillen, @gassmoeller, @jdannberg, @bangerth, and would be interested to hear opinions and if these have been issue already dealt with or even solved by others in other context.

This is also related to the discussion here: geodynamics/aspect#6866

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions