Repository scope? #15

mhpob · 2024-03-08T15:10:01Z

mhpob
Mar 8, 2024

After recent discussions with @jdpye, @chrisholbrook, @benjaminhlina, @franksmithxyz, and others regarding surimi (#1), glatos, VR2 VRL and binary formats, VR3 formats, and others, it seems like there is appetite to create some sort of open repository of characteristic types. As this repository is still in concept stage, I want to put forth a few ideas and discussion points related to its intent and scope to see if it can fill this space.

This should be viewed as a draft to be edited, added to, knocked back, or refuted outright -- nothing is critical and things noted here could already be in place elsewhere or merit separate, dedicated support. It will be clear to anyone reading that I am blatantly inserting my own hopes and dreams within the myopic view of my own experiences. I'll update everything below as comments come in if this becomes a worthwhile forum.

Statement of problem

Multiple acoustic telemetry data formats exist across vendors, networks, and investigators
There is currently no open clearinghouse of these data formats
Data formats have shown to be ephemeral and disappear from common knowledge following industry-standard updates
Data can be considerable in size, so including them within a package preclude them from some repositories
An open repository of data formats is necessary for reproducible package and workflow testing and development

Intended scope

From Use as an external data repository? (R-specific) #1: "this ... will serve as a repo for general industry-standard file types and ensuring that we or anyone who cares to can handle them properly"
Language agnostic: Provide examples of current and legacy data formats, forms, and schema; and associated file metadata
R-specific: Provide a CI/CD-controlled branch for use as a drat/R-Universe repository (see Use as an external data repository? (R-specific) #1)

Data types to be included (high-level)

Raw
- Vemco/Innovasea (Current: VRL, VDAT; Legacy: binary, text)
- Lotek
- ThelmaBiotel
- Sonotronics
Derived
- Vemco/Innovasea ("non-truth" VRL, CSV)
- OTN matched/unmatched/qualified/unqualified/other networks
- GLATOS and glatos
- ETN and etn
- IMOS
- Actel
- Deployment data, various forms
Forms
- OTN/FACT/MATOS/ETN/GLATOS metadata forms (multiple versions)
Schema
- OTN (multiple across Geoserver and exports)
- GLATOS

Possible structure (v0.2, mirroring the above)

demo-data/
  |-- Raw/
     |-- Vendor1/
          |-- InstrumentA/
               |-- version 1.0/
                    |-- file123.ext
                    |-- file123.md (metadata: markdown? XML?)
                    |-- CITATION.cff (how to cite the data source)
          |-- ...
     |-- Vendor2/
     |-- ...
  |-- Derived/
     |-- Network1/
     |-- Network2/
     |-- ...
  |-- Schema/
  |-- Forms/

Is it better to organize according to network rather than data type?

jdpye · 2024-03-08T16:38:18Z

jdpye
Mar 8, 2024
Maintainer

I endorse the structure for the filesystem that you're laying out here. Network is probably correct as the higher-order folder for derived dataset formats, groups like GLATOS combine data in a single omnibus workbook, other networks split them into separate files all needing their own examples.

0 replies

mhpob · 2024-03-08T17:28:43Z

mhpob
Mar 8, 2024
Author

@jdpye the first draft didn't respect the spacing of the outlined repo structure, so I put it into a code block. I also went into some sub-directories and shifted note to "v0.2". Is it still agreeable as it's now outlined?

0 replies

benjaminhlina · 2024-03-08T21:54:59Z

benjaminhlina
Mar 8, 2024

This looks great @mhpob and the overall repo structure makes sense to me. The little telemetry-workflow repo I made to help students in the Cooke lab, follows a similar structure as you've put forth here allowing the user to find things based on vendor or networks which I like.

0 replies

chrisholbrook · 2024-03-11T14:26:53Z

chrisholbrook
Mar 11, 2024

+1 for overall structure (structure around vendor/network). Our early scoping of a "characteristic" raw file set for GLATOS was rather daunting due to the number of possible combinations of receiver model, firmware, code map, transmitter options (e.g., various sensors), receiver options (e.g., internal transmitter settings, various receiver sensors), offload software, preference for small files, and desire to include files with errors/issues. Our next step (WIP) is to create a table/list of desired characteristics, then go out in search for files that meet each.

0 replies

mhpob · 2024-03-11T14:53:25Z

mhpob
Mar 11, 2024
Author

...number of possible combinations of receiver model, firmware, code map, transmitter options (e.g., various sensors), receiver options (e.g., internal transmitter settings, various receiver sensors), offload software, preference for small files, and desire to include files with errors/issues

Great idea to include various iterations. Does seem rather daunting as it immediately makes this kind of project quite large... possibly beyond the capability of a GitHub repo. The benefit of an open repo would be the ability to crowd-source some of these files via pull requests while maintaining an open record of the transaction.

Our next step (WIP) is to create a table/list of desired characteristics, then go out in search for files that meet each.

Currently the "searching" may be the bear that leans a lot on your time and energy. Might there be merit to putting the list out there via something like this and seeing what is submitted to you?

0 replies

mhpob · 2024-03-11T14:55:38Z

mhpob
Mar 11, 2024
Author

GitHub recommends repos smaller than 1 GB with a max of 5 GB due to performance. May also then wind up requiring some hands-on management and git-fu:
https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github

0 replies

chrisholbrook · 2024-03-11T15:02:07Z

chrisholbrook
Mar 11, 2024

So I guess my question is which files, specifically, do you you/we really want here? E.g., the GLATOS data system has >20,000 VRL files. So identifying an optimal set will require first identifying desired characteristics/features.

0 replies

mhpob · 2024-03-11T15:23:46Z

mhpob
Mar 11, 2024
Author

That's a really critical question, and highlights that I completely left any transmitter examples off of that repo structure. My initial thought is that a complete data example library (individual files for receiver x transmitter x software options) is the holy grail, but the possibility of that is questionable at best.

Highly desirable on "history of science" principles
Having one representative file each already puts us over 5K files using combinations of the the off-the-cuff iterations you noted above
Based on size, alone, may be beyond the scope of something hosted on GitHub, and so beyond the scope of this repo
However, since transmitters are really only viewed through the lens of the receiver and decoder that logged them, receiver x firmware combinations should drastically reduce the number of files needed as long as a transmitter type was detected
May put us back in the size/scope of a GitHub repo

What I think we're getting at is a metadata question -- can we design applicable metadata to not only log what we do have, but log its deficiencies? I.e., can we stand up something that's good-enough, but "not let the perfect be the enemy of the good"?

data-repo v.0.0.1 has a VRL that contains X and Y but not Z as noted in the metadata.
Scientist a provides a PR that has a VRL with X, Y, and Z.
Cue v0.1.0 or whatever the appropriate versioning is that has the new VRL and metadata

0 replies

jdpye · 2024-03-11T16:57:03Z

jdpye
Mar 11, 2024
Maintainer

I swear i was authoring a reply that featured the phrase 'don't let the perfect be the enemy of the good' and i let it languish in a tab.

I agree that we should take the files we're currently leaning on for testing glatos / surimi / remora / TelemetryWorkflow / etc... and then we allow users to supply extra files that don't yet exist on a needs basis and roll them into the mix. If we outgrow a regular repo there's a GitHub LFS option or there are other WAF-ish things we could try. But for now this has the right mix of 'others can suggest updates' and 'we know how it versions things and generally how it works' to be a reasonable solve.

0 replies

mhpob · 2024-03-13T14:30:29Z

mhpob
Mar 13, 2024
Author

@jdpye I know I've already stepped all over your toes here, but would you accept PRs following the guidance above to start fleshing this thing out?

0 replies

mhpob · 2024-03-13T14:39:32Z

mhpob
Mar 13, 2024
Author

Also, since the OTNDC is basically one big metadata factory -- any views on what that structure should be? Tabular is human readable but maybe not the most efficient; XML is all over the place but it's not super approachable (at least to me); JSON might be a compromise that would also slot into an API; some CI/CD that takes one and creates the others?

0 replies

jdpye · 2024-03-13T15:09:24Z

jdpye
Mar 13, 2024
Maintainer

I don't mind a PR one bit, I just picked a poor week to take vacation. :)

I'm also a big yaml fan.

0 replies

mhpob · 2024-03-13T18:05:15Z

mhpob
Mar 13, 2024
Author

Possibly useful reference; R-centric: https://music.dataobservatory.eu/documents/open_music_europe/dataset-development/dataset-working-paper.html

0 replies

mhpob · 2024-03-14T12:06:47Z

mhpob
Mar 14, 2024
Author

Re: building a package in another repo based on changes in this one
https://medium.com/hostspaceng/triggering-workflows-in-another-repository-with-github-actions-4f581f8e0ceb

0 replies

Repository scope? #15

Uh oh!

Uh oh!

mhpob Mar 8, 2024

Statement of problem

Intended scope

Data types to be included (high-level)

Possible structure (v0.2, mirroring the above)

Replies: 14 comments

Uh oh!

jdpye Mar 8, 2024 Maintainer

Uh oh!

Uh oh!

mhpob Mar 8, 2024 Author

Uh oh!

benjaminhlina Mar 8, 2024

Uh oh!

chrisholbrook Mar 11, 2024

Uh oh!

Uh oh!

mhpob Mar 11, 2024 Author

Uh oh!

mhpob Mar 11, 2024 Author

Uh oh!

chrisholbrook Mar 11, 2024

Uh oh!

Uh oh!

mhpob Mar 11, 2024 Author

Uh oh!

jdpye Mar 11, 2024 Maintainer

Uh oh!

mhpob Mar 13, 2024 Author

Uh oh!

mhpob Mar 13, 2024 Author

Uh oh!

jdpye Mar 13, 2024 Maintainer

Uh oh!

mhpob Mar 13, 2024 Author

Uh oh!

mhpob Mar 14, 2024 Author

mhpob
Mar 8, 2024

jdpye
Mar 8, 2024
Maintainer

mhpob
Mar 8, 2024
Author

benjaminhlina
Mar 8, 2024

chrisholbrook
Mar 11, 2024

mhpob
Mar 11, 2024
Author

mhpob
Mar 11, 2024
Author

chrisholbrook
Mar 11, 2024

mhpob
Mar 11, 2024
Author

jdpye
Mar 11, 2024
Maintainer

mhpob
Mar 13, 2024
Author

mhpob
Mar 13, 2024
Author

jdpye
Mar 13, 2024
Maintainer

mhpob
Mar 13, 2024
Author

mhpob
Mar 14, 2024
Author