Skip to content

Website doesn't provide guidence on how to reference PIDINST within data files #6

@paulmillar

Description

@paulmillar

In many (but not all) cases, scientific data, as obtained from an instrument, is written into a set of files. These data files follow some (typically single) format: the file format. File formats vary considerably in how well they are defined and their flexibility, but a rather common feature of scientific file formats is the ability (at some level) to embed metadata; i.e., any ancillary information that is not direct observations.

Scientific data is often managed within dataset catalogues, where a dataset represents some selection (maybe by some temporal, geographical, processing or organisation criteria) of the total available data. Dataset catalogues typically offer the ability to record metadata against the dataset. This is a place where it makes sense to include the PIDINST identifier(s) of the instrument(s) responsible for the observations/measurements that the dataset represents.

While scientific data files may be organised into dataset within catalogues with metadata that includes PIDINST identifiers, such datasets are often fragile, in the sense that:

  • Many scientists use programs that access data through a filesystem, not directly from a catalogue.
  • Filesystems typically have no dataset concept, only files and directories/folders.
  • While some filesystems support user-driven metadata (e.g., extended attributes), such metadata systems are not widely used and are (typically) not populated with catalogue metadata. Moreover, such metadata may be lost when copying or moving data into a newly created directory/folder.
  • Non-filesystem-metadata approaches for capturing dataset metadata (e.g., sidecar files) may be "lost".

Elaborating somewhat on the last two points, a researcher may freely select any subset of the data files with which to work and discard other "unnecessary" files; any sidecar file may be viewed as unnecessary. If they then subsequently treat this new set of files as a dataset (e.g., upload it into Zenodo) then any connection with the original dataset would require a manual (and non-trivial) process when entering the metadata for that new dataset. I suspect that, in practice, this won't happen unless it is somehow automated.

The result is that such subsequent "derived data" datasets (e.g., data uploaded into Zenodo) will no longer have PIDINST identifiers.

One possible approach to address this issue is to embed PIDINST identifiers into the individual files. While this will create redundancy if multiple data-focused files are present in the dataset, it greatly increases the likelihood of a PIDINST identifier being available from subsequently generated datasets.

I believe that, currently, the PIDINST documentation (e.g., website, white paper) provides no information on embedding identifiers within data files. I would consider this issue resolved when the website is updated so that there is some mention/discussion (however small) on the merits of embedding PIDINST within data files and some advice (again, however small) on how to achieve this.

The goal of this issue is to give a "home" (somewhere within the website) where more details may be added as this topic is explored further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions