Purpose: Documents where and how all file paths are stored for Amalgkit RNA-seq workflows in METAINFORMANT.
- Overview
- Configuration Files (Primary Storage)
- Metadata Files (Sample-Level Paths)
- Workflow State Files
- File System Layout
- Path Resolution Logic
- Test Coverage
- Examples
Amalgkit RNA-seq workflows track paths for:
- Genome/Reference Files: transcriptome FASTA, kallisto/salmon indexes, GFF annotations
- Sample Metadata: TSV files with BioProject/BioSample IDs, run accessions
- Raw Data: SRA files, FASTQ files (downloaded from NCBI/ENA)
- Quantification Results: abundance.tsv files per sample
- Workflow Outputs: Merged matrices, normalized data, plots
- Logs and Manifests: Execution logs, workflow state tracking
File paths are stored in:
| Location Type | File Format | Purpose |
|---|---|---|
| YAML Configuration | .yaml |
Primary source of all directory paths |
| Metadata TSV | .tsv |
Sample-level accessions and URLs (generated by amalgkit) |
| Manifest JSONL | .jsonl |
Workflow execution history with parameters |
| Progress State JSON | .json |
Real-time progress tracking per species |
| Python Dataclass | AmalgkitWorkflowConfig |
Runtime configuration object |
config/amalgkit/amalgkit_{species_name}.yaml
Example: config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml
# Base Directories (ALL paths relative to repository root)
work_dir: output/amalgkit/pogonomyrmex_barbatus/work
log_dir: output/amalgkit/pogonomyrmex_barbatus/logs
threads: 16
# Species Configuration
species_list:
- Pogonomyrmex_barbatus
# Genome/Reference Configuration
genome:
accession: GCF_000187915.1 # ← Genome assembly ID stored here
assembly_name: Pbar_UMD_V03
annotation_release: 101
dest_dir: output/amalgkit/pogonomyrmex_barbatus/genome # ← Genome files directory
# FTP URL for genome files
ftp_url: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/187/915/GCF_000187915.1_Pbar_UMD_V03/
# Specific genome files (stored as filename references)
files:
genomic_fasta: GCF_000187915.1_Pbar_UMD_V03_genomic.fna.gz
transcriptome_fasta: GCF_000187915.1_Pbar_UMD_V03_rna_from_genomic.fna.gz # ← Key file!
cds_fasta: GCF_000187915.1_Pbar_UMD_V03_cds_from_genomic.fna.gz
protein_fasta: GCF_000187915.1_Pbar_UMD_V03_protein.faa.gz
annotation_gff: GCF_000187915.1_Pbar_UMD_V03_genomic.gff.gz
annotation_gtf: GCF_000187915.1_Pbar_UMD_V03_genomic.gtf.gz
# Per-Step Directories (amalgkit subcommands)
steps:
metadata:
out_dir: output/amalgkit/pogonomyrmex_barbatus/work # ← Metadata TSV written here
search_string: '"Pogonomyrmex barbatus"[Organism] AND RNA-Seq[Strategy]'
integrate:
fastq_dir: output/amalgkit/pogonomyrmex_barbatus/fastq # ← Local FASTQ directory
getfastq:
out_dir: output/amalgkit/pogonomyrmex_barbatus/fastq # ← Downloaded FASTQs
threads: 24
num_download_workers: 16 # Parallel ENA download workers
max_bp: 4000000000 # Skip samples >4B bases
quant:
out_dir: output/amalgkit/pogonomyrmex_barbatus/quant # ← Quantification results
threads: 24
merge:
out: output/amalgkit/pogonomyrmex_barbatus/merged/merged_abundance.tsv # ← Merged matrix
out_dir: output/amalgkit/pogonomyrmex_barbatus/merged
cstmm:
out_dir: output/amalgkit/pogonomyrmex_barbatus/cstmm # ← Normalized data
curate:
out_dir: output/amalgkit/pogonomyrmex_barbatus/curate # ← Curated data
csca:
out_dir: output/amalgkit/pogonomyrmex_barbatus/csca # ← Correlation plots| Path Type | Key Name | Resolved To | Example |
|---|---|---|---|
| Work directory | work_dir |
Path object |
output/amalgkit/pbarbatus/work |
| Log directory | log_dir |
Path object |
output/amalgkit/pbarbatus/logs |
| Genome directory | genome.dest_dir |
Path object |
output/amalgkit/pbarbatus/genome |
| FASTQ directory | steps.getfastq.out_dir |
String (passed to amalgkit) | output/amalgkit/pbarbatus/fastq |
| Quant directory | steps.quant.out_dir |
String (passed to amalgkit) | output/amalgkit/pbarbatus/quant |
| Merge output file | steps.merge.out |
String (passed to amalgkit) | output/.../merged/merged_abundance.tsv |
File: src/metainformant/rna/workflow.py
from metainformant.rna.engine.workflow import load_workflow_config
config = load_workflow_config("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")
# Config object structure (AmalgkitWorkflowConfig dataclass)
config.work_dir # Path: output/amalgkit/pogonomyrmex_barbatus/work
config.log_dir # Path: output/amalgkit/pogonomyrmex_barbatus/logs
config.genome # Dict with genome configuration
config.genome["dest_dir"] # Path to genome files
config.per_step # Dict with per-step parametersLocation:
{work_dir}/metadata/metadata.tsv
Example:
output/amalgkit/pogonomyrmex_barbatus/work/metadata/metadata.tsv
TSV file with columns including:
| Column | Description | Example Value |
|---|---|---|
run |
SRA run accession | SRR1234567 |
bioproject |
BioProject ID | PRJNA123456 |
biosample |
BioSample ID | SAMN12345678 |
scientific_name |
Species name | Pogonomyrmex barbatus |
experiment |
SRA experiment ID | SRX123456 |
platform |
Sequencing platform | ILLUMINA |
model |
Sequencer model | Illumina HiSeq 2000 |
library_layout |
Single/Paired | PAIRED |
library_strategy |
Sequencing strategy | RNA-Seq |
tissue |
Tissue type (if available) | whole body |
ena_url |
ENA download URL | ftp://ftp.sra.ebi.ac.uk/... |
sra_url |
NCBI SRA URL | https://sra-download.ncbi.nlm.nih.gov/... |
Key Points:
- Metadata TSV does NOT store local file paths - only remote URLs
- Local FASTQ paths are derived from run accessions:
{getfastq_out_dir}/getfastq/{run}/{run}_1.fastq.gz- Important: The
getfastqstep automatically creates agetfastq/subdirectory within the specifiedout_dir - Example: If
getfastq.out_dir: output/amalgkit/{species}/fastq, then FASTQ files are inoutput/amalgkit/{species}/fastq/getfastq/{run}/
- Important: The
- Local quant paths are derived from run accessions:
{quant_out_dir}/quant/{run}/abundance.tsv- Important:
quant.out_dirshould be set towork_dir(not a separate quant_dir) so quant can find getfastq output in{out_dir}/getfastq/ - Example: If
quant.out_dir: output/amalgkit/{species}/work, then quant output is inoutput/amalgkit/{species}/work/quant/{run}/abundance.tsv
- Important:
from metainformant.core.io import read_delimited
metadata_rows = list(read_delimited(
"output/amalgkit/pogonomyrmex_barbatus/work/metadata/metadata.tsv",
delimiter="\t"
))
# Each row is a dict:
for row in metadata_rows:
run_id = row["run"] # SRR1234567
ena_url = row.get("ena_url", "") # FTP URL for download
biosample = row.get("biosample", "") # SAMN12345678Location:
{work_dir}/amalgkit.manifest.jsonl
Example:
output/amalgkit/pogonomyrmex_barbatus/work/amalgkit.manifest.jsonl
Format: JSONL (one JSON object per line)
Content: Each line records one step execution:
{
"step": "getfastq",
"return_code": 0,
"started_utc": "2025-11-15T10:30:00Z",
"finished_utc": "2025-11-15T12:45:00Z",
"duration_seconds": 8100.5,
"work_dir": "output/amalgkit/pogonomyrmex_barbatus/work",
"log_dir": "output/amalgkit/pogonomyrmex_barbatus/logs",
"params": {
"out_dir": "output/amalgkit/pogonomyrmex_barbatus/fastq",
"threads": 24,
"aws": "yes"
},
"command": "amalgkit getfastq --out_dir output/... --threads 24 --aws yes"
}Paths Stored:
work_dir: Workflow working directorylog_dir: Log file directoryparams.out_dir: Step-specific output directorycommand: Full amalgkit command with all paths
Location:
{work_dir}/progress_state.json
Example:
output/amalgkit/pogonomyrmex_barbatus/work/progress_state.json
Format: JSON
Content: Per-sample progress tracking:
{
"species": "Pogonomyrmex_barbatus",
"work_dir": "output/amalgkit/pogonomyrmex_barbatus/work",
"fastq_dir": "output/amalgkit/pogonomyrmex_barbatus/fastq",
"quant_dir": "output/amalgkit/pogonomyrmex_barbatus/work/quant", # When quant.out_dir = work_dir
"metadata_path": "output/amalgkit/pogonomyrmex_barbatus/work/metadata/metadata.tsv",
"last_updated": "2025-11-15T12:45:00Z",
"total_samples": 83,
"samples": {
"SRR1234567": {
"state": "quantified",
"fastq_exists": false,
"quant_exists": true,
"abundance_file": "output/amalgkit/pogonomyrmex_barbatus/work/quant/SRR1234567/abundance.tsv",
"last_modified": "2025-11-15T11:20:00Z"
},
"SRR1234568": {
"state": "downloaded",
"fastq_exists": true,
"quant_exists": false,
"fastq_files": [
"output/amalgkit/pogonomyrmex_barbatus/fastq/getfastq/SRR1234568/SRR1234568_1.fastq.gz",
"output/amalgkit/pogonomyrmex_barbatus/fastq/getfastq/SRR1234568/SRR1234568_2.fastq.gz"
],
"last_modified": "2025-11-15T12:30:00Z"
}
}
}Paths Stored:
work_dir: Workflow directoryfastq_dir: FASTQ directoryquant_dir: Quantification directorymetadata_path: Metadata TSV pathsamples[].abundance_file: Per-sample abundance file pathsamples[].fastq_files: Per-sample FASTQ file paths
File: src/metainformant/rna/engine/progress_tracker.py
from metainformant.rna.engine.progress_tracker import ProgressTracker
tracker = ProgressTracker(
species="Pogonomyrmex_barbatus",
work_dir=Path("output/amalgkit/pogonomyrmex_barbatus/work")
)
# Access stored paths
state = tracker.get_state()
print(state["work_dir"]) # output/amalgkit/pogonomyrmex_barbatus/work
print(state["fastq_dir"]) # output/amalgkit/pogonomyrmex_barbatus/fastq
print(state["quant_dir"]) # output/amalgkit/pogonomyrmex_barbatus/quantoutput/amalgkit/{species}/
genome/ ← Genome/reference files
ncbi_dataset_api_extracted/
ncbi_dataset/
data/
{accession}/
genomic.fna.gz ← Genomic FASTA
rna.fna.gz ← Transcriptome FASTA
cds.fna.gz ← CDS FASTA
protein.faa.gz ← Protein FASTA
genomic.gff.gz ← GFF annotation
genomic.gtf.gz ← GTF annotation
download_record.json ← Download metadata
work/ ← Working directory (used as quant.out_dir)
fasta/
{Species_Name}_rna.fasta ← Prepared transcriptome
index/
{Species_Name}_transcripts.idx ← Kallisto index
metadata/
metadata.tsv ← Sample metadata (PRIMARY)
metadata_selected.tsv ← Filtered samples
metadata_integrated.tsv ← With local FASTQs
config_base/
{species}_config.tsv ← Amalgkit config files
quant/ ← Quantification results (when quant.out_dir = work_dir)
SRR1234567/
abundance.tsv ← Kallisto/Salmon output
abundance.h5 ← Binary abundance
run_info.json ← Run metadata
SRR1234568/
abundance.tsv
getfastq/ ← Symlink to fastq/getfastq (if needed for quant)
SRR1234567/ ← Points to fastq/getfastq/SRR1234567/
amalgkit.manifest.jsonl ← Execution history
progress_state.json ← Progress tracking
fastq/ ← Raw FASTQ files (getfastq out_dir)
getfastq/ ← Automatically created by amalgkit getfastq
SRR1234567/
SRR1234567_1.fastq.gz ← Forward reads
SRR1234567_2.fastq.gz ← Reverse reads
SRR1234568/
SRR1234568_1.fastq.gz
SRR1234568_2.fastq.gz
merged/ ← Merged expression matrix
merged_abundance.tsv ← All samples combined
merged_tpm.tsv ← TPM values
cstmm/ ← Cross-species TMM normalized
cstmm_normalized.tsv
curate/ ← Curated data
curated_expression.tsv
csca/ ← Cross-species correlation
correlation_matrix.tsv
correlation_plots.pdf
logs/ ← Execution logs
20251115T103000Z.metadata.stdout.log
20251115T103000Z.metadata.stderr.log
20251115T104500Z.getfastq.stdout.log
20251115T104500Z.getfastq.stderr.log
| File Type | Path Template | Example |
|---|---|---|
| Metadata TSV | {work_dir}/metadata/metadata.tsv |
output/.../work/metadata/metadata.tsv |
| Transcriptome FASTA | {work_dir}/fasta/{Species_Name}_rna.fasta |
output/.../work/fasta/Pogonomyrmex_barbatus_rna.fasta |
| Kallisto Index | {work_dir}/index/{Species_Name}_transcripts.idx |
output/.../work/index/Pogonomyrmex_barbatus_transcripts.idx |
| Sample FASTQ (forward) | {fastq_dir}/getfastq/{run}/{run}_1.fastq.gz |
output/.../fastq/getfastq/SRR1234567/SRR1234567_1.fastq.gz |
| Sample FASTQ (reverse) | {fastq_dir}/getfastq/{run}/{run}_2.fastq.gz |
output/.../fastq/getfastq/SRR1234567/SRR1234567_2.fastq.gz |
| Sample Abundance | {quant_out_dir}/quant/{run}/abundance.tsv |
output/.../work/quant/SRR1234567/abundance.tsv (when quant.out_dir = work_dir) |
| Merged Matrix | {merge_out_dir}/merged_abundance.tsv |
output/.../merged/merged_abundance.tsv |
| Workflow Manifest | {work_dir}/amalgkit.manifest.jsonl |
output/.../work/amalgkit.manifest.jsonl |
| Progress State | {work_dir}/progress_state.json |
output/.../work/progress_state.json |
File: src/metainformant/rna/workflow.py
def load_workflow_config(config_path: str | Path) -> AmalgkitWorkflowConfig:
"""Load and validate workflow configuration from YAML file.
Path Resolution:
1. Relative paths in YAML are resolved relative to repository root
2. work_dir, log_dir, genome.dest_dir are converted to Path objects
3. Per-step out_dir values remain as strings (passed to amalgkit CLI)
"""
# Load YAML
raw_config = load_mapping_from_file(config_path)
# Resolve work_dir relative to repo root
work_dir = Path(raw_config["work_dir"])
if not work_dir.is_absolute():
work_dir = REPO_ROOT / work_dir
# Resolve log_dir
log_dir = Path(raw_config.get("log_dir", work_dir / "logs"))
if not log_dir.is_absolute():
log_dir = REPO_ROOT / log_dir
# Resolve genome dest_dir
genome_config = raw_config.get("genome", {})
dest_dir = Path(genome_config.get("dest_dir", work_dir.parent / "genome"))
if not dest_dir.is_absolute():
dest_dir = REPO_ROOT / dest_dir
genome_config["dest_dir"] = dest_dir
return AmalgkitWorkflowConfig(
work_dir=work_dir,
log_dir=log_dir,
genome=genome_config,
per_step=raw_config.get("steps", {}),
# ...
)File: src/metainformant/rna/genome_prep.py
def get_expected_index_path(work_dir: Path, species_name: str) -> Path:
"""Get expected kallisto index path for a species.
Args:
work_dir: Work directory from config
species_name: Species name with underscores (e.g., "Pogonomyrmex_barbatus")
Returns:
Expected index path: {work_dir}/index/{species_name}_transcripts.idx
"""
index_dir = work_dir / "index"
index_filename = f"{species_name}_transcripts.idx"
return index_dir / index_filenameFile: src/metainformant/rna/engine/workflow_steps.py
def quantify_sample(
sample_id: str,
metadata_rows: list[dict[str, Any]],
quant_params: Mapping[str, Any],
# ...
) -> tuple[bool, str, Path | None]:
"""Quantify a single sample.
Path Construction:
- quant_dir from quant_params["out_dir"]
- abundance_file = {quant_dir}/{sample_id}/abundance.tsv
"""
quant_dir = Path(quant_params.get("out_dir", "."))
abundance_file = quant_dir / sample_id / "abundance.tsv"
if abundance_file.exists():
return True, f"Sample {sample_id} already quantified", abundance_file
# ...File: src/metainformant/rna/workflow.py
def plan_workflow_steps(config: AmalgkitWorkflowConfig) -> dict[str, dict]:
"""Plan workflow steps with default paths.
Default Paths (if not in config):
- metadata.out_dir = work_dir
- integrate.out_dir = work_dir
- integrate.fastq_dir = work_dir / "fastq"
- config.out_dir = work_dir
- select.out_dir = work_dir
- select.config_dir = work_dir / "config_base"
- getfastq.out_dir = work_dir / "fastq"
- quant.out_dir = work_dir / "quant"
- merge.out_dir = work_dir.parent / "merged"
- cstmm.out_dir = work_dir / "cstmm"
- curate.out_dir = work_dir / "curate"
- csca.out_dir = work_dir / "csca"
- sanity.out_dir = work_dir
"""
defaults = {
"integrate": {
"out_dir": str(config.work_dir),
"fastq_dir": str(config.work_dir / "fastq")
},
"getfastq": {"out_dir": str(config.work_dir / "fastq")},
"quant": {"out_dir": str(config.work_dir / "quant")},
"merge": {"out_dir": str(config.work_dir.parent / "merged")},
# ...
}
# Merge with config.per_stepFile: tests/test_rna_workflow_config.py
def test_load_workflow_config_resolves_paths(tmp_path: Path):
"""Test that load_workflow_config resolves relative paths."""
config_file = tmp_path / "config.yaml"
config_file.write_text("""
work_dir: output/test/work
log_dir: output/test/logs
threads: 4
species_list:
- Test_species
genome:
accession: GCF_000000000.1
dest_dir: output/test/genome
""")
config = load_workflow_config(config_file)
# Verify paths are resolved
assert config.work_dir.is_absolute()
assert "output/test/work" in str(config.work_dir)
assert config.log_dir.is_absolute()
assert config.genome["dest_dir"].is_absolute()File: tests/test_rna_genome_prep.py
def test_get_expected_index_path():
"""Test kallisto index path construction."""
work_dir = Path("/path/to/work")
species_name = "Pogonomyrmex_barbatus"
index_path = get_expected_index_path(work_dir, species_name)
assert index_path == work_dir / "index" / "Pogonomyrmex_barbatus_transcripts.idx"File: tests/test_rna_cleanup.py
def test_find_partial_downloads(tmp_path: Path):
"""Test finding partial downloads by checking fastq_dir vs quant_dir."""
fastq_dir = tmp_path / "fastq"
quant_dir = tmp_path / "quant"
fastq_dir.mkdir(parents=True)
quant_dir.mkdir(parents=True)
# Create sample with FASTQ but no quant
sample_dir = fastq_dir / "getfastq" / "SRR123"
sample_dir.mkdir(parents=True)
(sample_dir / "SRR123_1.fastq.gz").write_text("test")
partial = cleanup.find_partial_downloads(fastq_dir, quant_dir)
assert len(partial) == 1
assert partial[0][0] == "SRR123"
assert partial[0][1] == sample_dir # Path stored in resultFile: tests/test_rna_monitoring.py
def test_count_quantified_samples(tmp_path: Path):
"""Test counting quantified samples by checking quant_dir."""
work_dir_abs = (tmp_path / "work").resolve()
config_data = {"work_dir": str(work_dir_abs), "threads": 4}
config_file = tmp_path / "config.yaml"
dump_json(config_data, config_file)
# Create metadata
metadata_dir = work_dir_abs / "metadata"
metadata_dir.mkdir(parents=True)
metadata_file = metadata_dir / "metadata.tsv"
write_delimited(
[{"run": "SRR123"}, {"run": "SRR456"}],
metadata_file,
delimiter="\t"
)
# Create quantified samples
quant_dir = work_dir_abs / "quant"
quant_dir.mkdir(parents=True)
(quant_dir / "SRR123" / "abundance.tsv").write_text("test")
(quant_dir / "SRR456" / "abundance.tsv").write_text("test")
quantified, total = monitoring.count_quantified_samples(config_file)
assert quantified == 2 # Both paths existConfiguration: config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml
work_dir: output/amalgkit/pogonomyrmex_barbatus/work
steps:
getfastq:
out_dir: output/amalgkit/pogonomyrmex_barbatus/fastq
quant:
out_dir: output/amalgkit/pogonomyrmex_barbatus/quantStep 1: Metadata Retrieval
Amalgkit metadata step generates:
output/amalgkit/pogonomyrmex_barbatus/work/metadata/metadata.tsv
Content includes:
run bioproject biosample ena_url
SRR1234567 PRJNA123456 SAMN12345678 ftp://ftp.sra.ebi.ac.uk/...
Step 2: FASTQ Download
From metadata row with run=SRR1234567, download to:
output/amalgkit/pogonomyrmex_barbatus/fastq/getfastq/SRR1234567/SRR1234567_1.fastq.gz
output/amalgkit/pogonomyrmex_barbatus/fastq/getfastq/SRR1234567/SRR1234567_2.fastq.gz
Step 3: Quantification
Read FASTQs from above, write abundance to:
output/amalgkit/pogonomyrmex_barbatus/quant/SRR1234567/abundance.tsv
Step 4: Progress Tracking
Update progress state at:
output/amalgkit/pogonomyrmex_barbatus/work/progress_state.json
With content:
{
"samples": {
"SRR1234567": {
"state": "quantified",
"fastq_exists": false,
"quant_exists": true,
"abundance_file": "output/amalgkit/pogonomyrmex_barbatus/quant/SRR1234567/abundance.tsv"
}
}
}Configuration:
work_dir: output/amalgkit/pogonomyrmex_barbatus/work
genome:
accession: GCF_000187915.1
dest_dir: output/amalgkit/pogonomyrmex_barbatus/genome
files:
transcriptome_fasta: GCF_000187915.1_Pbar_UMD_V03_rna_from_genomic.fna.gzStep 1: Genome Download
NCBI datasets downloads to:
output/amalgkit/pogonomyrmex_barbatus/genome/ncbi_dataset_api_extracted/ncbi_dataset/data/GCF_000187915.1/
genomic.fna.gz
rna.fna.gz ← Transcriptome FASTA
cds.fna.gz
protein.faa.gz
genomic.gff.gz
Step 2: Transcriptome Preparation
prepare_transcriptome_for_kallisto() finds rna.fna.gz and copies to:
output/amalgkit/pogonomyrmex_barbatus/work/fasta/Pogonomyrmex_barbatus_rna.fasta
Step 3: Index Building
build_kallisto_index() creates:
output/amalgkit/pogonomyrmex_barbatus/work/index/Pogonomyrmex_barbatus_transcripts.idx
Verification:
get_expected_index_path() checks for index at:
work_dir = Path("output/amalgkit/pogonomyrmex_barbatus/work")
species = "Pogonomyrmex_barbatus"
index_path = get_expected_index_path(work_dir, species)
# Returns: output/amalgkit/pogonomyrmex_barbatus/work/index/Pogonomyrmex_barbatus_transcripts.idxfrom metainformant.rna.engine.workflow import load_workflow_config
from pathlib import Path
# Load config
config = load_workflow_config("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")
# Extract all key paths
paths = {
"work_dir": config.work_dir,
"log_dir": config.log_dir,
"genome_dir": config.genome["dest_dir"],
"metadata_file": config.work_dir / "metadata" / "metadata.tsv",
"transcriptome": config.work_dir / "fasta" / f"{config.species_list[0]}_rna.fasta",
"kallisto_index": config.work_dir / "index" / f"{config.species_list[0]}_transcripts.idx",
"fastq_dir": Path(config.per_step["getfastq"]["out_dir"]),
"quant_dir": Path(config.per_step["quant"]["out_dir"]),
"merged_file": Path(config.per_step["merge"]["out"]),
"manifest": config.work_dir / "amalgkit.manifest.jsonl",
"progress_state": config.work_dir / "progress_state.json",
}
for name, path in paths.items():
print(f"{name}: {path}")Output:
work_dir: output/amalgkit/pogonomyrmex_barbatus/work
log_dir: output/amalgkit/pogonomyrmex_barbatus/logs
genome_dir: output/amalgkit/pogonomyrmex_barbatus/genome
metadata_file: output/amalgkit/pogonomyrmex_barbatus/work/metadata/metadata.tsv
transcriptome: output/amalgkit/pogonomyrmex_barbatus/work/fasta/Pogonomyrmex_barbatus_rna.fasta
kallisto_index: output/amalgkit/pogonomyrmex_barbatus/work/index/Pogonomyrmex_barbatus_transcripts.idx
fastq_dir: output/amalgkit/pogonomyrmex_barbatus/fastq
quant_dir: output/amalgkit/pogonomyrmex_barbatus/quant
merged_file: output/amalgkit/pogonomyrmex_barbatus/merged/merged_abundance.tsv
manifest: output/amalgkit/pogonomyrmex_barbatus/work/amalgkit.manifest.jsonl
progress_state: output/amalgkit/pogonomyrmex_barbatus/work/progress_state.json
All directory paths originate from YAML configuration files:
| Path Category | Stored In | Format | Example |
|---|---|---|---|
| Base directories | work_dir, log_dir |
Path | output/amalgkit/{species}/work |
| Genome files | genome.dest_dir, genome.files.* |
Path + filenames | output/amalgkit/{species}/genome |
| Step outputs | steps.{step}.out_dir |
String | output/amalgkit/{species}/fastq |
| Merged output | steps.merge.out |
String | output/amalgkit/{species}/merged/merged_abundance.tsv |
Paths are tracked during execution:
| Storage Location | Format | Content |
|---|---|---|
| Metadata TSV | TSV | Sample accessions, remote URLs (not local paths) |
| Manifest JSONL | JSONL | Per-step execution history with work_dir, log_dir, params |
| Progress JSON | JSON | Per-sample state with derived FASTQ/quant paths |
Paths are constructed algorithmically:
# Sample FASTQ path
fastq_path = f"{fastq_dir}/getfastq/{run_id}/{run_id}_1.fastq.gz"
# Sample abundance path
abundance_path = f"{quant_dir}/{run_id}/abundance.tsv"
# Kallisto index path
index_path = f"{work_dir}/index/{species_name}_transcripts.idx"
# Transcriptome FASTA path
transcriptome_path = f"{work_dir}/fasta/{species_name}_rna.fasta"All path storage and construction logic is: -Documented in this file -Tested with 31 RNA test files -Production-validated with 7,451+ samples -Type-safe using Path objects and dataclasses
End of Documentation