Skip to content

Data Strategy

Robert Ladwig edited this page Jun 12, 2025 · 12 revisions

This document outlines the standardized data and code management practices for research and advisory projects within our computational limnology team.

πŸ“ Project File Structure

Each project must follow a consistent folder and documentation structure similar to this:


β”œβ”€β”€ README.md             # Project overview and objectives
β”œβ”€β”€ data/                 # Organized into raw/, processed/, external/
β”‚   └── README.md         # Data descriptions and sources
β”œβ”€β”€ scripts/              # Analysis and workflow scripts
β”œβ”€β”€ outputs/              # Results: figures, tables, model output
β”œβ”€β”€ reports/              # Reports, manuscripts
β”œβ”€β”€ figures/              # Figures, visualizations
β”œβ”€β”€ targets.R             # Workflow pipeline (e.g., {targets} in R)
β”œβ”€β”€ renv/ or environment.yml # Environment/dependency management
└── devlog.md             # Log of decisions and progress (optional)

πŸ”„ Version Control & Backup

  • All projects must have a GitHub repository:

    • Public for research projects

    • Private for advisory/confidential projects

  • Repositories must be mirrored locally on AU’s O:\ drive for backup

  • (Ideally) Use standardized naming (e.g., project-<year>-<topic>)

πŸ§ͺ Reproducibility Standards

  • Use {targets} (R) or snakemake (Python/bash) for all analysis pipelines

  • Final results must be fully reproducible from raw data

  • Use renv, virtualenv, or pipenv for dependency management

  • Clearly comment all scripts and functions

πŸ“¦ Data Publishing & Archiving

  • On completion, archive the GitHub repository to Zenodo to mint a DOI

  • Include a CITATION.cff file for machine-readable citation metadata

  • Tag final release version for reproducibility and reference

βœ… FAIR Data Principles

All datasets should follow FAIR guidelines:

  • Findable: Documented, discoverable via DOI

  • Accessible: Public repositories or open-access archives

  • Interoperable: Use open formats (CSV, NetCDF); standard units

  • Reusable: Include metadata (metadata.yaml or data_dictionary.csv)

Each dataset must be accompanied by meta-data:

  • Description of variables (names, units, meaning)

  • Sampling methods and instruments

  • Collection date, location, and processing steps

🧾 Documentation & Communication

All code must be well-commented and follow team style guides

  • Preferably using {roxygen2} for commenting

Project documentation includes:

  • README.md – Overview

  • devlog.md – Key decisions and changes

  • GitHub Projects – Plan and track decisions and work packages

  • GitHub Issues – Task and discussion tracking (encouraged)

βš™οΈ Automation & Tools

Automate repetitive tasks where possible:

  • Convert often used scripts or functions into packages/libraries (R, Python)

  • GitHub repo creation, license, and structure setup

  • Sync to O:\ drive using post-commit hooks or scheduled scripts

  • DOI publishing via GitHub-Zenodo integration

Clone this wiki locally