Skip to content

Commit dd09763

Browse files
authored
Merge pull request #299 from justinkadi/main
Updating large dataset method section
2 parents f34cce6 + ad72607 commit dd09763

File tree

4 files changed

+131
-32
lines changed

4 files changed

+131
-32
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
## Example Dataset
2+
3+
Here is an example of a published dataset using the large dataset method:
4+
5+
https://arcticdata.io/catalog/view/doi:10.18739/A2X63B701
6+
7+
In the example dataset, you can see:
8+
9+
1. A link in the abstract that directs to the folder with all the data in it
10+
11+
12+
![](../images/large_dataset_abstract.png)
13+
14+
2. Other entities that are in a generalized format are representative of the files on the server
15+
16+
17+
![](../images/large_dataset_entities.png)
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
## Uploading files to `datateam`
2+
3+
We'll use the large dataset method if a submitter has a dataset with around 700+ files, as the uploading would take too long through the website. When we confirm with a PI that their submission has 700+ files, we will begin to give instructions for how they can directly upload all their files to our datateam server. Eventually, we'll move these files to a web-accessible folder so that viewers can access their files directly from our server instead of through metacat.
4+
5+
### Instructions for uploading files to `datateam`
6+
7+
To have submitters upload their files to our server, we will give them access through SSH (secure shell). Please reach out to a Data Coordinator for the credentials of the `visitor` account in the `datateam` server.
8+
9+
If the submitter is familiar with SSH, we can give them the credentials and tell them to create a folder under the home directory of `/home/visitor` to add all their files. If they are not familiar, we will give them the following instructions to use a GUI client for SSH:
10+
11+
> Cyberduck is a free software that you can download at <https://cyberduck.io/download/>, and follow the directions below:
12+
>
13+
> 1. Open Cyberduck.
14+
>
15+
> 2. Click "Open Connection".
16+
>
17+
> 3. From the drop-down, choose "SFTP (Secure File Transfer Protocol)".
18+
>
19+
> 4. Enter "datateam.nceas.ucsb.edu" for Server.
20+
>
21+
> 5. Enter your username and password (ask Data Coordinators)
22+
>
23+
> 6. Connect.
24+
>
25+
> 7. Create the folder `/home/visitor/submitter_name` and upload your files.
26+
27+
### Moving files within `datateam`
28+
29+
Eventually, we will be moving the files from `/home/visitor` to the `/var/data` shoulder. While we are in the process of curating the dataset, we will move the files to `var/data/curation`.
30+
31+
Then, when we are ready for the files to be in a web-accessible location (i.e. when the dataset is ready to be published), we'll move the files to `var/data/10.18739/preIssuedDOI/`. Before we can create the `preIssuedDOI` directory, we'll need to mint a pre-issued DOI for this dataset.
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
## Processing
2+
3+
As seen in the example dataset, the majority of the data files will be hosted directly on our server. In order to still have some metadata to describe the contents of the data package, we'll create generalized representative entities to describe the files.
4+
5+
### Understanding the contents of a large data submission
6+
7+
First, we'll need to inspect the files and folders submitted by the PI. Ideally, the PI will have organized the files in a comprehensive folder hierarchy with a file naming convention. Asking the PI to provide the file and folder naming convention for all their submitted files is helpful, as it allows us to create one representative entity for each type of file they have.
8+
9+
We'll also need the PI to submit a dataset, as normal, through the ADC. Since they'll have uploaded their files directly to our server, they won't need to upload any files to their metadata submission.
10+
11+
### Creating representative entities
12+
13+
To create representative entities, there are 2 ways:
14+
15+
#### 1. Creating entities through R
16+
17+
To create the representative entities through R, we will use `eml` library. In this example, we'll create a general `dataTable` entity:
18+
19+
```{r, eval=FALSE}
20+
dataTable1 <- eml$dataTable(entityName = "[region_name]/[lake_name]_[YYYY-MM-DD].csv",
21+
entityDescription = "These CSV files contain lake measurement information. YYYY-MM-DD represents the date of measurement.")
22+
23+
doc$dataset$dataTable[[1]] <- dataTable1
24+
```
25+
26+
Now, we can add an attribute table to this entity to document the files' metadata.
27+
28+
```{r, eval=FALSE}
29+
atts <- EML::shiny_attributes()
30+
doc$dataset$dataTable[[1]]$attributeList <- EML::set_attributes(attributes = atts$attributes, factors = atts$factors)
31+
```
32+
33+
Similarly, we can create representative entities for an `otherEntity`, `spatialVector`, or `spatialRaster`. The required sections for each of these different types of entities may slightly differ, so be sure to check the [EML schema documentation](https://eml.ecoinformatics.org/schema/) so that all the required sections are added. Otherwise, the EML will not validate.
34+
35+
For example, you'll need to have a coordinate reference system for a `spatialVector` or `spatialRaster`.
36+
37+
#### 2. Creating entities through the ADC
38+
39+
Another method to create representative entities is to
40+
41+
1. Download one of each type of file from the dataset.
42+
43+
2. Upload each file into the data package through the ADC web editor.
44+
45+
3. Add file descriptions and attributes as normal.
46+
47+
4. Change the `entityName` of the file to be more general, showing the file and folder naming convention.
48+
49+
5. Before the dataset is published, remove any physicals, then remove the files from the data package using `arcticdatautils::updateResourceMap()`. This will remove the file PIDs from the resource map, but it will leave the entities in the EML document. Please ask a Data Coordinator if you have not used `updateResourceMap()` before.
50+
51+
### Moving files to web-accessible location
52+
53+
As mentioned in the section, "Uploading files to `datateam`", we'll need to move the files to their web-accessible location in `var/data/10.18739/preIssuedDOI`.
54+
55+
### Adding Markdown link to abstract in EML
56+
57+
We'll need to provide the link to the files that will be hosted on our server for viewers to download files. The link to the dataset will be:
58+
59+
> http://arcticdata.io/data/preIssuedDOI
60+
61+
To add this as a clickable Markdown link in the Abstract section of an EML doc, we'll need to
62+
63+
1. Open the EML doc in a text editor.
64+
65+
2. Navigate to the <abstract> section.
66+
67+
3. Underneath the <abstract> section, add a <markdown> ... </markdown> section.
68+
69+
4. Add in Markdown-formatted text without any indentations.
70+
71+
This will look like:
72+
73+
```
74+
<abstract>
75+
<markdown>
76+
### Access
77+
Files be accessed and downloaded from the directory via: [http://arcticdata.io/data/10.18739/DOI](http://arcticdata.io/data/10.18739/DOI).
78+
79+
### Overview
80+
This is the original abstract overview that the PI submitted.
81+
</markdown>
82+
</abstract>
83+
```

workflows/large_dataset_method/large_dataset.Rmd

Lines changed: 0 additions & 32 deletions
This file was deleted.

0 commit comments

Comments
 (0)