clarify where to put file paths (e.g ml-25m/ratings.csv)

During the 2024-03-20 Crossaint Task Force meeting I asked where to put file paths (e.g. "ml-25m" for "ml-25m/ratings.csv" and @benjelloun said to go ahead and create an issue to clarify the spec.

I understand that the spec is pretty clear in the case where a zip file is available and `contentUrl` can be used to show the paths to files within the zip. Here's an example from https://github.com/mlcommons/croissant/blob/v1.0.5/datasets/1.0/movielens/metadata.json that shows a file path of "ml-25m/ratings.csv":

```
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "ml-25m-archive",
      "name": "ml-25m-archive",
      "contentUrl": "https://files.grouplens.org/datasets/movielens/ml-25m.zip",
      "encodingFormat": "application/zip",
      "sha256": "8b21cfb7eb1706b4ec0aac894368d90acf26ebdfb6aced3ebd4ad5bd1eb9c6aa"
    },
    {
      "@type": "cr:FileObject",
      "@id": "ratings-table",
      "name": "ratings-table",
      "containedIn": {
        "@id": "ml-25m-archive"
      },
      "contentUrl": "ml-25m/ratings.csv",
      "encodingFormat": "text/csv"
    },
```

However, while Dataverse often can provide a zip of all files in a dataset, increasingly files are large and zipping is expensive, so we plan to continue using `contentUrl` for direct links to the files. (Besides, why download an entire zip if you only need one file?) I say continue because to support Google Dataset Search we already provide the following, for example, in our Schema.org output:

```
{
  "@type": "DataDownload",
  "name": "2023-01-03.tab",
  "fileFormat": "text/tab-separated-values",
  "contentSize": 21865,
  "description": "Information on known Harvard repositories on GitHub, such as the number of stars, programming language, day last updated, number of open issues, size, number of forks, repository URL, create date, and description.",
  "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/6867331"
}
```

So, if not `contentUrl`, which field should I use for the file path? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clarify where to put file paths (e.g ml-25m/ratings.csv) #639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions