Skip to content

Latest commit

 

History

History
151 lines (120 loc) · 5.77 KB

File metadata and controls

151 lines (120 loc) · 5.77 KB

Core Infrastructure: Overview

Shared infrastructure and utilities used by all METAINFORMANT domain modules. The core package provides battle-tested components for I/O, configuration, logging, parallel execution, caching, database connectivity, and workflow orchestration.

Quick Navigation

  • Getting Started — 5-minute tutorial with complete pipeline example
  • Architecture — System design, component interactions, and principles

Core Components

Component Description Documentation
I/O Operations File I/O, JSON/CSV/TSV/YAML, downloads, atomic writes core.io
Configuration Config loading, environment overrides, merging core.utils.config
Path Handling Path resolution, security, sanitization core.io.paths
Logging Structured logging, metadata, environment config core.utils.logging
Caching JSON cache with TTL, thread-safe operations core.io.cache
Download Robust HTTP/FTP downloads, retry, resume, heartbeat core.io.download
Parallel Execution Thread/process pools, resource-aware workers core.execution.parallel
Database PostgreSQL connectivity, connection pooling core.data.db
Hashing SHA256 file and content hashing core.utils.hash
Text Processing Text cleaning, slugify, gene name standardization core.utils.text
Workflow DAG orchestration, config-driven pipelines core.execution.workflow

Module Structure

src/metainformant/core/
├── io/                    # Input/Output operations
│   ├── io.py             # Core file I/O (JSON, CSV, YAML, Parquet)
│   ├── paths.py          # Path utilities and security
│   ├── cache.py          # JSON caching with TTL
│   ├── download.py       # Download with retry/resume/heartbeat
│   ├── atomic.py         # Atomic file operations
│   ├── checksums.py      # Checksum verification
│   └── disk.py           # Disk space management
├── utils/                # Utility functions
│   ├── logging.py        # Structured logging
│   ├── config.py         # Configuration loader
│   ├── hash.py           # SHA256 hashing
│   ├── text.py           # Text processing
│   ├── errors.py         # Error hierarchy
│   └── timing.py         # Performance timing
├── execution/            # Execution engines
│   ├── parallel.py       # Parallel execution utilities
│   ├── workflow.py       # Workflow orchestration
│   └── discovery.py      # Symbol discovery
├── data/                 # Data layer
│   ├── db.py             # PostgreSQL integration
│   └── validation.py     # Validation utilities
├── engine/               # Pipeline engines
│   └── workflow_manager.py
└── ui/                   # User interfaces
    └── tui.py            # Terminal UI

Core Design Principles

1. Zero Mocking

All tests use real implementations. No mock objects. This ensures production reliability.

2. Atomicity

All file writes use atomic replacement (temp file → rename) to prevent corruption.

3. Observability

  • Consistent log format: TIMESTAMP | LEVEL | MODULE | MESSAGE
  • Optional structured metadata via log_with_metadata()
  • Download heartbeats for progress tracking

4. Security

  • Path traversal prevention (is_safe_path())
  • Filename sanitization
  • SQL injection protection (sanitize_connection_params())

5. Portability

  • Pure pathlib.Path (no os.path)
  • UTF-8 everywhere
  • Minimal external dependencies (optional)

Usage Pattern

# Standard import pattern
from metainformant.core import io, cache, paths
from metainformant.core.utils import logging, config

# Get logger
logger = logging.get_logger(__name__)

# Ensure directories
output = paths.ensure_directory(Path("output"))

# Load configuration
cfg = config.load_mapping_from_file("config.yaml")

# Download with caching
cached = cache.load_cached_json(cache_dir, "key", ttl_seconds=3600)
if cached is None:
    data = io.download_json(url)
    cache.cache_json(cache_dir, "key", data)

# Process files
for item in io.read_jsonl("data.jsonl"):
    process(item)

logger.info("Pipeline complete")

Environment Variables

Variable Purpose Default
CORE_LOG_LEVEL Logging level (DEBUG, INFO, WARNING, ERROR) INFO
AK_THREADS Override default thread count CPU-dependent
AK_WORK_DIR Working directory for outputs output/
AK_LOG_DIR Directory for log files logs/
PG_HOST PostgreSQL host localhost
PG_PORT PostgreSQL port 5432
PG_DATABASE Database name metainformant
PG_USER Database user postgres
PG_PASSWORD Database password (empty)

Sphinx Documentation

Build with:

uv run python scripts/package/uv_docs.sh

Or manually:

cd docs
sphinx-build -b html . _build

Contributing

When modifying core components:

  1. Add tests in tests/test_core_*.py (no mocks!)
  2. Update documentation in docs/core/*.md
  3. Follow conventions: pathlib.Path, type hints, get_logger(__name__)
  4. Check AGENTS.md: src/metainformant/core/AGENTS.md has agent-specific rules

Related Resources