util/
├── common/ # shared helpers imported by other scripts
│ ├── gcloud_ops.py # gcloud/storage CLI wrappers + bucket IAM/label ops
│ ├── release_ops.py # Releases-Sheet loading, release constants, slug classifiers
│ ├── data_integrity.py # manifest / MD5 / blob checks for staging→prod
│ ├── bucket_validation_utils.py
│ └── markdown_generator.py
├── raw_bucket_prep/ # prepare a dataset raw bucket for QC & release
│ ├── download_raw_bucket_metadata_to_local
│ ├── transfer_qc_metadata_to_raw_bucket
│ └── transfer_release_resources_to_raw_bucket.py
├── data_promotion/ # promote raw → staging → curated buckets
│ ├── promote_raw_data
│ ├── promote_staging_data
│ ├── clean_wdl_raw_buckets
│ ├── data_promotion_diagram.svg
│ └── archive/transfer_raw_data # deprecated
├── reporting/ # collection summaries & dataset stat tables
│ ├── crn_cloud_collection_summary
│ ├── internal_qc_dataset_collection_summary
│ ├── generate_dataset_summary_table
│ ├── extract_brain_bank_data
│ └── generate_brain_bank_summary
├── workflow_inputs/
│ └── generate_inputs
└── requirements.txt
Scripts in
raw_bucket_prep/,data_promotion/, andreporting/import shared helpers fromcommon/(gcloud_ops,release_ops,data_integrity) via a smallsys.pathbootstrap, so they still run directly from their subfolder.
| Script | Folder | Description | Context | Example usage |
|---|---|---|---|---|
gcloud_ops.py |
common/ |
Elementary gcloud storage CLI wrappers (copy/move/remove/rsync/list), bucket IAM and label operations, and bucket/dataset name-parsing helpers. |
Centralizes the low-level Cloud Storage calls reused across the promotion and transfer scripts. | NA |
release_ops.py |
common/ |
Loads the live Releases Google Sheet (SSOT), derives release/bucket constants, and provides slug-based assay/organism/source classifiers. | Single source of truth for release metadata and dataset classification when Sheet data isn't available. | NA |
data_integrity.py |
common/ |
Manifest reading and MD5 / non-empty / associated-metadata checks, plus staging-vs-curated blob name and hash comparisons. | Used to validate data integrity when promoting staging data to production. | NA |
bucket_validation_utils.py |
common/ |
Functions to validate raw bucket and local metadata structure and contents before transferring data. | Checks preceding data transfers. | NA |
generate_inputs |
workflow_inputs/ |
Generate inputs JSON for WDL pipelines. | Ability to generate the inputs JSON for WDL pipelines given a project TSV (sample information), inputs JSON template, workflow name, and cohort dataset ID. | ./generate_inputs --project-tsv lee.metadata.tsv --inputs-template inputs.json --workflow-name pmdbs_sc_rnaseq_analysis --release-version v4.0.0 --cohort-dataset-id cohort-pmdbs-sc-rnaseq |
download_raw_bucket_metadata_to_local |
raw_bucket_prep/ |
Validate the raw bucket structure, then sync raw bucket metadata to the local metadata directory. | Once authors have contributed their metadata to the raw bucket, this script first validates the bucket structure/metadata and then downloads the data locally so that QC can be performed. Pass -V/--validate-only to run just the structure/metadata checks without downloading (this replaces the former standalone validate_raw_bucket_structure.py). |
./download_raw_bucket_metadata_to_local -d team-jakobsson-pmdbs-bulk-rnaseq (add --validate-only to check only) |
transfer_qc_metadata_to_raw_bucket |
raw_bucket_prep/ |
Sync local metadata directory to the raw bucket. | After receiving author-contributed metadata from a raw bucket, QC/processing steps must be done locally. This script is run after QC is complete, so that the locally changed metadata directories are sync'd to the raw bucket. If any later changes are made to the metadata, this script will need to be re-run to ensure that the raw bucket contains the most up to date copies of the QC'd metadata. | ./transfer_qc_metadata_to_raw_bucket -d team-jakobsson-pmdbs-bulk-rnaseq -v v4.0.0 |
promote_raw_data |
data_promotion/ |
Transfer QC'ed metadata, CRN Team contributed artifacts, and other CRN Team contributed data (e.g., spatial) from raw data buckets to staging (for Urgent/Minor releases) or production buckets (for Minor/Major releases). | Ability to transfer QC'ed metadata and CRN Team contributed data from raw buckets to staging/production buckets. This script is run for all releases: Urgent, Minor, and Major. It also removes the internal-qc-data label from the released raw buckets for Urgent/Minor releases. The rationale behind moving this type of data to production buckets (i.e., CURATED) for Urgent/Minor releases is because there are no pipeline/curated outputs, so the staging buckets are not used. The rationale behind moving this type of data to staging buckets (i.e., DEV/UAT) for Minor/Major releases is because there are pipeline/curated outputs, so the promote_staging_data is used and will eventually copy the data over to production buckets. Minor releases are applicable to both here because sometimes datasets are only platformed in a Minor release, but there are other times where datasets are run through existing pipelines. Note: this script must be run before promote_staging_data. |
./promote_raw_data --type-of-release urgent --all-datasets --release-version v4.0.0 |
promote_staging_data |
data_promotion/ |
Promote staging data to production data buckets and apply the appropriate permissions. | Ability to run data integrity tests when trying to promote data from staging (i.e., DEV/UAT) to production buckets (i.e., CURATED). This script is only run for Minor and Major releases. It also applies the appropriate permissions to the buckets (e.g., adding Verily's ASAP Cloud Readers to released raw buckets) and removes the internal-qc-data label from the released raw buckets. The buckets/datasets are detected based on the workflow name provided and the workflow/pipeline version that's used to store current curated outputs in raw workflow_execution bucket. This dict, unembargoed_dev_buckets_and_workflow_version_outputs, is in release_ops.py |
./promote_staging_data -w pmdbs_sc_rnaseq --release-version v4.0.0 --collection-version v3.1.0 |
markdown_generator.py |
common/ |
Functions that generate a Markdown report. | This script is used in the promote_staging_data script to generate a Markdown report that contains data integrity results when trying to promote data from staging (i.e., DEV/UAT) to production buckets (i.e., CURATED). |
NA |
crn_cloud_collection_summary |
reporting/ |
Track the ASAP raw/curated buckets, size, sample breakdown, and subject breakdown in the CRN Cloud. | See CRN Cloud Statistics below for more details. | ./crn_cloud_collection_summary |
internal_qc_dataset_collection_summary |
reporting/ |
Track datasets in internal QC by getting their ASAP raw buckets, size, sample, and subject breakdown in GCP. | See CRN Cloud Statistics below for more details. | ./internal_qc_dataset_collection_summary |
generate_dataset_summary_table |
reporting/ |
Generate pivot tables of unique subject/sample counts and subject diagnosis counts by organism × sample source × assay from CRN Cloud or internal QC summary outputs. | Run after crn_cloud_collection_summary or internal_qc_dataset_collection_summary to produce summary tables for reporting. Auto-detects input source from the filename and prefixes outputs accordingly. Reads dataset metadata from the Google Releases Sheet via get_releases_df() when available; falls back to slug-name classification otherwise. |
python3 generate_dataset_summary_table <prefix>.<date>.tsv <prefix>.subject_dataset_membership.<date>.tsv <prefix>.sample_dataset_membership.<date>.tsv <prefix>.subject_diagnosis_membership.<date>.tsv |
extract_brain_bank_data |
reporting/ |
Extract brain bank (biobank_name) metadata for every PMDBS sample across CRN curated and/or internal QC raw buckets. |
Walks asap-curated-team-* and asap-raw-team-* buckets, reads SUBJECT.csv + SAMPLE.csv, joins on subject_id, and emits one row per sample with its associated brain bank. Tracks attempted datasets and flags those skipped in both curated and internal QC. |
./extract_brain_bank_data |
generate_brain_bank_summary |
reporting/ |
Generate brain-bank-centric summary tables (matrix + long format) from the brain bank membership TSV. | Run after extract_brain_bank_data to produce brain-bank-focused summaries useful for identifying well-characterized samples vs. data gaps across data types. |
python3 generate_brain_bank_summary brain_bank_membership.<date>.tsv |
transfer_release_resources_to_raw_bucket.py |
raw_bucket_prep/ |
Sync local release-resources config/, release_stats/ and publisher_cards/ to dataset ASAP raw buckets. | After producing Publisher card text and summary figures, this script syncs locally stored files (presumably living at asap-crn-cloud-dataset-metadata/) into each dataset gs:// raw bucket. If any later changes are made to the release-resources, this script will need to be re-run to ensure that the raw bucket contains the most up to date copies. | ./transfer_release_resources_to_raw_bucket.py -i /path/to/release_<release_version>.json -p |
clean_wdl_raw_buckets |
data_promotion/ |
Clean up script for GCP raw bucket workflow execution timestamp cohort analysis and downstream folders. | Removes outdated timestamp folder contents across all raw buckets in the cohort analysis and downstream folders while preserving versions. | ./clean_wdl_raw_buckets -p |
| Script | Folder | Description | Context |
|---|---|---|---|
transfer_raw_data |
data_promotion/archive/ |
Transfer data in generic raw buckets to dataset-specific raw buckets (e.g., gs://asap-raw-data-team-lee vs. gs://asap-dev-team-lee-pmdbs-sn-rnaseq. |
Originally, "generic" raw buckets were created because we only had one data type (i.e., sc RNAseq). Later on, we started implementing new data types (e.g., bulk RNAseq, spatial transcriptomics, etc.) and restructured the bucket naming and organization. Therefore, this script is used to move raw data from the generic raw buckets to data-specific raw buckets. It is not applicable to new datasets where we collaborate with the CRN Teams to determine the dataset name. |
This section describes the workflow for processing contributor submissions, from initial upload through QC and back to the raw bucket.
See documentation in the asap-crn-cloud-dataset-metadata repo for more granular information on the steps pertaining to releasing a contributed dataset.
Contributor data workflow diagram:
Script: download_raw_bucket_metadata_to_local
Validates the raw bucket structure and metadata, then downloads the metadata from the raw bucket to your local workspace for QC. Handles both initial submissions (loose CSV files) and post-QC structures (organized directories).
To run only the structure/metadata validation without downloading (the former standalone validate_raw_bucket_structure.py, now removed), pass -V/--validate-only:
# validate only — no download
./download_raw_bucket_metadata_to_local -d team-jakobsson-pmdbs-bulk-rnaseq --validate-only
# validate + download
./download_raw_bucket_metadata_to_local -d team-jakobsson-pmdbs-bulk-rnaseqWhat it does:
- Initial submission: Downloads
metadata/*.csv→ localmetadata/original/ - Re-sync: Downloads entire
metadata/tree plusfile_metadata/andDOI/if present - Optional: Also downloads
file_metadata/andDOI/if present in bucket
Quality control is performed locally in the asap-crn-cloud-dataset-metadata repository.
QC outputs:
metadata/
├── original/ # Contributor submission
├── cde/ # CDE-versioned copies
├── release/ # Release-versioned metadata (e.g., v4.0.0/)
└── latest/ # Copy of the latest release version
Script: transfer_qc_metadata_to_raw_bucket
Syncs the local metadata directory (including all QC'd subdirectories) back to the raw bucket.
./transfer_qc_metadata_to_raw_bucket -d team-jakobsson-pmdbs-bulk-rnaseq -v v4.0.0 -pWhat it transfers:
- Entire
metadata/directory tree file_metadata/(if present)DOI/(if present)
Note: Use -p flag to execute (defaults to dry-run for safety).
Build Publisher collection cards text and figures using:
Script: make_release.py in the asap-crn-cloud-dataset-metadata repository.
/path/to/make_release -i /path/to/release_<release_version>.json -prelease-resources outputs:
release-resources/
└─ {release_version}/
├─ cde/
├─ release_stats/
│ └─ {dataset_name}/
└─ publisher_cards/
└─ {dataset_name}/
├─ figures/
└─ text/
Script: transfer_release_resources_to_raw_bucket.py
Syncs the local release-resources directory (including all QC'd subdirectories) back to the raw bucket.
./transfer_release_resources_to_raw_bucket.py -i /path/to/config/release_<release_version>.json -pWhat it transfers:
- Entire
config/release_<release_version>.json publisher_cards/html filesrelease_stats/final svg files
Note: Use -p flag to execute (defaults to dry-run for safety).
- Dry-run by default: Most scripts require
-p(promote) flag to actually execute transfers - Structure migration: First transfer after QC from local to the raw bucket establishes the new directory structure (
original/,cde/,release/,latest/) in the bucket - Re-running scripts: Safe to re-run download/transfer scripts - the rysnc command will replace changed files and add new source files to destination, but will not remove files that exist in destination but not source
- Missing files: Scripts warn about missing CORE metadata tables but allow incomplete submissions (for flexibility during initial upload)
Contributors are expected to deposit their data and metadata in a structured manner in the given dataset's raw bucket. The bucket structure is organized into required, recommended, and optional directories. Note that a contribution consists of the metadata and deposited data, however the form of this data (processed outputs or raw data) will vary by assay. Thus, a submission should at minumum have metadata/ and a data directory such as raw/ or fastqs/, and preferably include the author's own processed data in artifacts/.
metadata/- Contains 'core' and 'supplemental' metadata tables (see Metadata Files below)
artifacts/- Processed outputs of data pipelines
fastqs/- FASTQ files for relevant sequencing assaysspatial/- Outputs of spatial transcriptomic assaysscripts/- Analysis and processing code used by the contributorsraw/- Catch-all for raw/unprocessed data for non-sequencing-based assaysworkflow_execution/- Created by DNAstack during pipeline execution
Metadata tables are grouped into two categories:
The CDE metadata schema can be found here: CDE Google Sheet
Expected for every submission (CDE 4.0+):
ASSAY.csvCONDITION.csvDATA.csvPROTOCOL.csvSAMPLE.csvSTUDY.csvSUBJECT.csv
Context-specific information or tables from releases prior to CDE 4.0 (which consolidated some tables, e.g., MOUSE + CELL → SUBJECT):
PMDBS.csvCLINPATH.csvMOUSE.csvCELL.csvPROTEOMICS.csvASSAY_RNAseq.csvSPATIAL.csvSDRF.csv
After receiving a contribution, the metadata/ directory is reorganized and versioned during QC. See asap-crn-cloud-dataset-metadata for details on the QC process and final structure:
metadata/
├── original/ # Original contributor submission
├── cde/ # CDE-versioned copies
├── release/ # Release-versioned metadata
└── latest/ # Copy of the latest release version
| Data Release Scenario | Script Used |
|---|---|
| Urgent | |
| Minor | |
| Major |
Scripts used in different Data Release Scenarios diagram:
Note: Previous Minor Releases did not contain pipeline/curated outputs (SOW 2); however, moving forward there will be outputs (SOW 3 - onwards) [06/12/2025]. Minor Releases apply to both diagrams, as some datasets may include either pipeline/curated outputs depending on the data assay/modality. If a dataset was previously released in an Urgent or Minor Release and is later scheduled for a Major Release, the curated buckets will be overwritten with the most recent version of the data.
asap-{dev,uat,curated}-{cohort,team-xxyy}-{source}-{assay}-{context}
├── <raw_data>
├── artifacts
├── file_metadata
├── metadata
│ └── release
│ └── ${release_version}
│ ├── *.csv
│ └── cde_version # plain text file, no extension
└── ${workflow_name}
└── release
└── ${release_version}
├── <curated_outputs>
│ ├── ...
│ └── MANIFEST.tsv
├── VERSION # plain text file, no extension
└── workflow_metadata
└── ${timestamp}
├── MANIFEST.tsv # combined
└── data_promotion_report.md
The VERSION plain txt file contains associated versions to the ASAP CRN Cloud Release which can be found on Zenodo, following a similar structure to the metadata/release/<release_version>/VERSION file:
WORKFLOW_VERSION=
COLLECTION_VERSION=
RELEASE_VERSION=
Set up for pulling data from live Google Spreadsheets using gspread
- Grant the SA asap-gcs-admin@dnastack-asap-parkinsons.iam.gserviceaccount.com Viewer access to the Google Spreadsheet.
- Download the SA credentials:
gcloud iam service-accounts keys create ~/.config/gspread/credentials.json \
--iam-account=asap-gcs-admin@dnastack-asap-parkinsons.iam.gserviceaccount.com- The credentials file will be picked up automatically by
get_releases_df()- no additional configuration needed.
Utility scripts for tracking ASAP dataset statistics across the CRN Cloud and internal GCP infrastructure. Reports on bucket sizes, sample/subject counts, brain bank coverage, and breakdowns by data assay/modality and biological origin.
Reporting pipeline diagram:
Queries the CRN Cloud via the DNAstack CLI to report on published individual datasets and harmonized collections. For each dataset, it retrieves the associated GCP raw and curated buckets, their sizes, sample/subject counts, brain-specific statistics, and subject diagnosis breakdown.
Output:
crn_cloud_collection_summary.<date>.tsvcrn_cloud_collection_summary.subject_dataset_membership.<date>.tsv— one row per subject-dataset pair (excludes cohorts)crn_cloud_collection_summary.sample_dataset_membership.<date>.tsv— one row per sample-dataset pair (excludes cohorts)crn_cloud_collection_summary.brain_donor_dataset_membership.<date>.tsv— one row per brain donor-dataset pair (excludes cohorts)crn_cloud_collection_summary.subject_diagnosis_membership.<date>.tsv— one row per subject-diagnosis-dataset pair, human datasets only (CLINPATH → SUBJECT → SAMPLEcondition_idpriority order)crn_cloud_collection_summary.sample_region_dataset_membership.<date>.tsv— one row per sample-dataset pair with brain region info (excludes cohorts; columns:subject_id,asap_sample_id,region_level_1,region_level_2,publisher_slug). Source priority:SAMPLE.region_level_1/region_level_2→PMDBS.brain_region(legacy CDE, populatesregion_level_1only). Samples without any region info are not emitted.
| Column | Description |
|---|---|
publisher_slug |
Dataset slug name in the CRN Cloud |
gcp_raw_bucket |
GCS raw bucket URI |
gcp_raw_bucket_size |
Raw bucket size in bytes |
gcp_curated_bucket |
GCS curated bucket URI |
gcp_curated_bucket_size |
Curated bucket size in bytes |
team_name |
Contributing team name parsed from slug |
n_samples |
Distinct asap_sample_id + modality count from ASSAY table; falls back to COUNT(DISTINCT asap_sample_id) from SAMPLE |
n_subjects_unique |
COUNT(DISTINCT <subject-id-col>) from team SAMPLE table where <subject-id-col> is asap_subject_id, asap_mouse_id, asap_cell_id, or subject_id (probed in that order); deduplicated subject count |
n_samples_unique |
COUNT(DISTINCT asap_sample_id) from team SAMPLE table (deduplicated samples) |
n_samples_total |
COUNT(*) of team SAMPLE table — raw row count, captures replicates of the same asap_sample_id |
n_brain_samples |
Brain sample count from PMDBS table, or from tissue column in SAMPLE if no PMDBS table |
n_brain_regions |
Distinct brain regions in PMDBS table |
n_brain_donors |
Distinct donors in CLINPATH table |
n_subjects_<diagnosis> |
Subject count per primary diagnosis category (25 columns); sourced from CLINPATH or SUBJECT primary_diagnosis column; 0 if not applicable |
condition_counts |
Raw condition value counts serialized as condition:count|...; populated from SAMPLE condition_id or CONDITION condition when primary_diagnosis is not available |
Usage:
./crn_cloud_collection_summary [OPTIONS]
OPTIONS
-h Display this message and exit
-s Grab no. of samples and subjects only (skip bucket size queries)
-i A previously generated TSV to append to, skipping already-processed datasets (Note: Use only if certain that earlier datasets have not been updated)
-l A file containing a list of dataset_ids to process, one per line (e.g. team-hafler-pmdbs-sn-rnaseq-pfc, cohort-pmdbs-sc-rnaseq).
Slug is inferred by prepending "prod-" to query the CRN Cloud.
team-* and cohort-* prefixes are used to classify individual vs. harmonized collections respectively.
If not provided, all datasets in the CRN Cloud are processed.Notes:
- Requires
dnastackCLI authenticated tocloud.parkinsonsroadmap.organdgcloudwith appropriate permissions - Raw bucket sizes include files used for development and may exceed what is strictly part of a release
- Cohort collections (
cohort-*) have their bucket derived from the slug (gs://asap-raw-cohort-*) rather than from the DATA table, which points to individual team buckets n_subjectsis sourced from the SUBJECT table (asap_subject_id), MOUSE table (asap_mouse_id), or CELL table (asap_cell_id), whichever applies; falls back toCOUNT(DISTINCT subject_id)from SAMPLEn_samplesusesCOUNT(DISTINCT asap_sample_id, modality)from ASSAY table if available, otherwise falls back toCOUNT(DISTINCT asap_sample_id)from SAMPLEn_brain_donorscounts subjects in CLINPATH who also appear in PMDBS (via SAMPLE join) or have a non-nullregion_level_1in SAMPLE- Diagnosis counts (
n_subjects_*) are sourced in priority order: CLINPATH → SUBJECT → SAMPLEcondition_id→ CONDITIONcondition; values not matching the fixed diagnosis vocabulary are captured incondition_countsinstead - Subject and sample membership files contain one row per ID-dataset pair; global deduplication is performed by
generate_dataset_summary_table - Use
-ito incrementally update an existing summary file rather than reprocessing everything from scratch
Scans GCP directly for asap-raw-team-* buckets labelled internal-qc-data and reports sample, subject, brain sample, brain donor, and diagnosis breakdowns by reading SAMPLE.csv, PMDBS.csv, and CLINPATH.csv from each bucket's metadata path. Intended for tracking datasets currently in internal QC that are not yet published to the CRN Cloud. Output column names mirror those of crn_cloud_collection_summary so the same generate_dataset_summary_table pivot script works on either source.
Output:
internal_qc_dataset_collection_summary.<date>.tsv- sample breakdown and bucket size if selectedinternal_qc_dataset_collection_summary.subject_dataset_membership.<date>.tsv— one row per subject-dataset pairinternal_qc_dataset_collection_summary.sample_dataset_membership.<date>.tsv— one row per sample-dataset pairinternal_qc_dataset_collection_summary.brain_donor_dataset_membership.<date>.tsv— one row per brain donor-dataset pairinternal_qc_dataset_collection_summary.subject_diagnosis_membership.<date>.tsv— one row per subject-diagnosis-dataset pair, human datasets onlyinternal_qc_dataset_collection_summary.sample_region_dataset_membership.<date>.tsv— one row per sample-dataset pair with brain region info (columns:subject_id,asap_sample_id,region_level_1,region_level_2,publisher_slug). Source priority:SAMPLE.csvregion_level_1/region_level_2→PMDBS.csvregion_level_1/region_level_2→PMDBS.csvbrain_region(legacy CDE). Samples without any region info are not emitted.
| Column | Description |
|---|---|
publisher_slug |
Dataset slug name, synthesized from bucket name as prod-team-... |
team |
Team name parsed from bucket name |
gcp_raw_bucket |
GCS raw bucket URI |
gcp_raw_bucket_size |
Raw bucket size in bytes |
n_subjects_unique |
Distinct subject_id count from SAMPLE.csv (deduplicated subjects) |
n_samples_unique |
Distinct sample_id count from SAMPLE.csv (deduplicated samples) |
n_samples_total |
Raw row count of SAMPLE.csv (header excluded) — captures replicates of the same sample_id |
n_brain_samples |
Brain sample count from PMDBS.csv if present, else count of rows in SAMPLE.csv where tissue ~ /brain/i |
n_brain_donors |
Distinct donors in CLINPATH.csv that also appear in PMDBS.csv (or SAMPLE.region_level_1 if PMDBS not present) |
n_subjects_<diagnosis> |
Per-diagnosis subject counts pulled from CLINPATH.csv (priority order: primary_diagnosis → last_diagnosis → path_autopsy_dx_main → path_autopsy_second_dx; any column with numeric-only values is skipped) |
Usage:
./internal_qc_dataset_collection_summary [OPTIONS]
OPTIONS
-h Display this message and exit
-s Grab no. of samples and subjects only (skip bucket size queries)
-l Only grab info for a list of datasets, usually those included in the upcoming ReleaseNotes:
- Requires
gcloudauthenticated with access toasap-raw-team-*buckets, pluspython3for CSV parsing - Looks for
SAMPLE.csv,PMDBS.csv, andCLINPATH.csvfirst atmetadata/release/<name>.csv, then searches the fullmetadata/prefix as a fallback - Datasets with no
SAMPLE.csvfound will reportNAfor sample and subject counts - All
gcloud storage catoutput is piped through a CSV normalizer that flattens multi-line quoted fields before parsing, so awk-based downstream processing is safe - Membership file column names match the CRN script's output so
generate_dataset_summary_tableaccepts either source - The
internal-qc-datalabel is checked on each bucket and the script skips buckets not currently labelled as such (so released buckets fall out of internal-QC reporting once promoted)
Generates pivot tables of unique subject/sample counts and subject diagnosis counts by organism × sample source × assay. Reads dataset metadata (organism, sample source, assay) from the Google Releases Sheet via get_releases_df() when available, falling back to slug-name pattern matching for datasets not in the Sheet. Joins with the membership files output by crn_cloud_collection_summary or internal_qc_dataset_collection_summary to deduplicate subjects and samples globally across datasets.
Input files (outputs of crn_cloud_collection_summary or internal_qc_dataset_collection_summary used as here):
<prefix>.<date>.tsv<prefix>.subject_dataset_membership.<date>.tsv<prefix>.sample_dataset_membership.<date>.tsv<prefix>.subject_diagnosis_membership.<date>.tsv<prefix>.sample_region_dataset_membership.<date>.tsv(optional; auto-discovered from the subject-membership path if omitted)
Output:
dataset_summary_table.<timestamp>.tsv— table of unique subjects/samples by organism × sample source × assaysubject_diagnosis_table.<timestamp>.tsv— table of unique subject diagnosis counts, human datasets onlydataset_region_table.<timestamp>.tsv— long-format table, one row per (dataset,region_level_1,region_level_2) with distinctsubject_countandsample_count. Cohort slugs are excluded. Produced only when the sample-region membership file is present.
Usage:
python3 generate_dataset_summary_table \
<prefix>.<date>.tsv \
<prefix>.subject_dataset_membership.<date>.tsv \
<prefix>.sample_dataset_membership.<date>.tsv \
<prefix>.subject_diagnosis_membership.<date>.tsv \
[<prefix>.sample_region_dataset_membership.<date>.tsv]Notes:
- Cohort slugs are excluded from all tables
- Output filenames are prefixed based on the input filename:
crn_cloud_*→crn_*,internal_qc_*→internal_qc_*, anything else → no prefix - Sample source classification uses
sample_sourceandorganismfrom the Releases Sheet when available; datasets not present in the Releases Sheet (typical for internal QC) fall back to slug-name pattern matching - Datasets with non-standard
sample_source/organismvalues are flagged as warnings prod-team-scherzer-pmdbs-geneticsis hard-coded as Human / Brain tissue / Genetics due to a non-standardassayvalue in the sheet- Mass-spec data type patterns are kept mutually exclusive:
ms-p→ Proteomics,ms-mb→ Metabolomics,ms-l→ Lipidomics (the slug classifier requires the pattern to appear with a separator, soms-pwon't accidentally matchms-mb) mefsin a slug is classified as Mouse / Embryonic fibroblast (the bucket-name classifier; the Releases Sheet path uses the Sheet'sorganismfield directly)- The Releases Sheet fetch is wrapped in try/except — if
gspreadcredentials are missing or the fetch fails, the script logs a warning and proceeds with slug-name classification only - Requires
gspreadcredentials at~/.config/gspread/credentials.json(see gspread setup); optional for slug-only classification
Walks both CRN curated (asap-curated-team-*) and internal QC raw (asap-raw-team-*) buckets, locates SUBJECT.csv and SAMPLE.csv for each PMDBS dataset, joins them on subject_id, and emits one row per sample with its biobank_name. Designed to support brain-bank-centric reporting (which samples came from which bank, across which datasets and data types).
Output:
brain_bank_membership.<date>.tsv— one row per sample-dataset pair
| Column | Description |
|---|---|
subject_id |
Subject identifier from SUBJECT.csv |
sample_id |
Sample identifier from SAMPLE.csv |
biobank_name |
Brain bank name from SUBJECT.csv |
publisher_slug |
Dataset slug name, synthesized from bucket name as prod-team-... |
source |
curated (from curated bucket) or raw (from raw bucket) |
Usage:
./extract_brain_bank_data [OPTIONS]
OPTIONS
-h Display this message and exit
-l FILE Restrict to datasets listed in FILE (one slug per line, without the asap-{raw,curated}- prefix)
-s SRC Which source(s) to scan: curated, raw, or both (default: both)Notes:
- Requires
gcloudauthenticated with access to bothasap-curated-team-*andasap-raw-team-*buckets, pluspython3for CSV parsing - Only PMDBS buckets are scanned —
biobank_nameis meaningful only for postmortem brain tissue datasets - Cohort buckets (
cohort-*) are skipped — they aggregate across teams and do not have a per-teamSUBJECT.csv SAMPLE.csvandSUBJECT.csvare searched first atmetadata/release/<name>.csv, then recursively acrossmetadata/- Each
gcloud storage catoutput is piped through a CSV normalizer that flattens multi-line quoted fields (e.g., GeoMX-style sample IDs) before parsing - After all buckets are processed, the script summarizes counts per source (CRN curated vs. internal QC) and deduplicated totals, then lists any datasets attempted in both CRN curated and internal QC that produced no output rows (missing
SUBJECT.csv, missingbiobank_name, empty join, etc.)
Generates two brain-bank-centric TSVs from the brain_bank_membership.<date>.tsv output of extract_brain_bank_data. Each unique biobank_name string is treated as a separate bank (no alias normalization) — sort the output to spot near-duplicate spellings manually.
Input:
brain_bank_membership.<date>.tsv(output ofextract_brain_bank_data)
Output:
brain_bank_summary_matrix.<date>.tsv— matrix view: rows = brain banks, columns = data types. Each cell showssubjects / samples (team). Right-side summary columns:total_samples,total_subjects,n_data_types,n_subjects_multi_modality,sources.brain_bank_summary_long.<date>.tsv— long format, one row per (bank, team, data_type) for filtering or pivoting in Excel.
| Column (long format) | Description |
|---|---|
brain_bank |
Brain bank name as it appears in biobank_name |
team |
Team name parsed from the dataset slug |
data_type |
Assay category derived from the slug (sc/snRNA-seq, Spatial Transcriptomics, etc.) |
in_curated / in_raw |
Whether this (bank, team, data_type) cell has any rows from each source |
n_samples |
Distinct sample count for this cell |
n_subjects |
Distinct subject count for this cell |
n_subjects_multi_modality |
Subjects in this cell that also appear in ≥1 other data type for the same (bank, team) |
datasets |
Semicolon-separated list of contributing datasets |
Usage:
python3 generate_brain_bank_summary brain_bank_membership.<date>.tsvNotes:
- Data-type classification uses the same slug-name patterns as
generate_dataset_summary_tableso categories stay consistent across reports - The matrix sheet is sorted by
total_samplesdescending — banks with the most data appear first; banks with sparse coverage and empty cells highlight where data gaps exist n_subjects_multi_modalityanswers the "well-characterized samples" question: high values mean the same subject was profiled with multiple assays at the same bank+team
crn_cloud_collection_summary and internal_qc_dataset_collection_summary print a breakdown to stdout after writing the TSV, including counts grouped by:
Data assay/modality
| Group | Matched bucket patterns |
|---|---|
| sc/sn RNAseq | sc-rnaseq, sn-rnaseq |
| sc/sn ATACseq | sc-atacseq, sn-atacseq |
| sc/sn Multiome | multimodal, multiome, multiomics |
| bulk RNAseq | bulk-rnaseq |
| spatial transcriptomics | spatial |
| proteomics | ms-p |
| metabolomics | ms-mb |
| lipidomics | ms-l |
| genetics | genetics |
| wgs | wgs |
| metagenomics | metagenome |
| other | anything else |
Biological origin
| Group | Matched bucket patterns |
|---|---|
| human | human, pmdbs |
| mouse | mouse, sulzer-fecal-metagenome-fp-spf |
| cell | cell, invitro, ipsc, mef |
| other | anything else |
Buckets that fall into the other category are listed explicitly in the output to flag gaps in the grouping logic.
gcloudCLI — required forcrn_cloud_collection_summary,internal_qc_dataset_collection_summary, andextract_brain_bank_datadnastackCLI — required forcrn_cloud_collection_summaryonlyjq— required forcrn_cloud_collection_summarypython3(≥ 3.8) — required forinternal_qc_dataset_collection_summary,extract_brain_bank_data,generate_dataset_summary_table, andgenerate_brain_bank_summarypandas— required forgenerate_dataset_summary_tableandgenerate_brain_bank_summarygspread(optional) — enables Releases-sheet-based classification ingenerate_dataset_summary_tableopenpyxl(optional) — enables xlsx output ingenerate_dataset_summary_table