Website doesn't provide guidence on how to reference PIDINST within data files

In many (but not all) cases, scientific data, as obtained from an instrument, is written into a set of files.  These data files follow some (typically single) format: the file format.  File formats vary considerably in how well they are defined and their flexibility, but a rather common feature of scientific file formats is the ability (at some level) to embed metadata; i.e., any ancillary information that is not direct observations.

Scientific data is often managed within dataset catalogues, where a dataset represents some selection (maybe by some temporal, geographical, processing or organisation criteria) of the total available data.  Dataset catalogues typically offer the ability to record metadata against the dataset.  This is a place where it makes sense to include the PIDINST identifier(s) of the instrument(s) responsible for the observations/measurements that the dataset represents.

While scientific data files may be organised into dataset within catalogues with metadata that includes PIDINST identifiers, such datasets are often fragile, in the sense that:
  * Many scientists use programs that access data through a filesystem, not directly from a catalogue.
  * Filesystems typically have no dataset concept, only files and directories/folders.
  * While some filesystems support user-driven metadata (e.g., extended attributes), such metadata systems are not widely used and are (typically) not populated with catalogue metadata.  Moreover, such metadata may be lost when copying or moving data into a newly created directory/folder.
  * Non-filesystem-metadata approaches for capturing dataset metadata (e.g., sidecar files) may be "lost".

Elaborating somewhat on the last two points, a researcher may freely select any subset of the data files with which to work and discard other "unnecessary" files; any sidecar file may be viewed as unnecessary.  If they then subsequently treat this new set of files as a dataset (e.g., upload it into Zenodo) then any connection with the original dataset would require a manual (and non-trivial) process when entering the metadata for that new dataset.  I suspect that, in practice, this won't happen unless it is somehow automated.

The result is that such subsequent "derived data"  datasets (e.g., data uploaded into Zenodo) will no longer have PIDINST identifiers.

One possible approach to address this issue is to embed PIDINST identifiers into the individual files.  While this will create redundancy if multiple data-focused files are present in the dataset, it greatly increases the likelihood of a PIDINST identifier being available from subsequently generated datasets.

I believe that, currently, the PIDINST documentation (e.g., website, white paper) provides no information on embedding identifiers within data files.  I would consider _this_ issue resolved when the website is updated so that there is some mention/discussion (however small) on the merits of embedding PIDINST within data files and some advice (again, however small) on how to achieve this.

The goal of this issue is to give a "home" (somewhere within the website) where more details may be added as this topic is explored further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Website doesn't provide guidence on how to reference PIDINST within data files #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Website doesn't provide guidence on how to reference PIDINST within data files #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions