Doppelganger is a package designed for the creation of digital twin models and development of synthetic data from clinical data sources.
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── config <- JSON files for preprocessing and training configurations
├── doppelganger <- Source code for use in this project.
│  ├── etl <- Luigi pipeline steps for data processing
│  ├── models <- Custom PyTorch model classes
│  └── scores <- Heart failure scoring functions
│ └── train <- PyTorch lightning training modules
| └── utils <- Utility functions
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
|
├── results <- Outputs from training and processing (model pth files, ROCs, AUCs, etc.)
│
├── scripts <- Source code for use in this project.
│
└── setup.py <- makes project pip installable (pip install -e .) so src can be imported
Step 1: Install uv
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with pip
pip install uvStep 2: Clone and setup the project
git clone <repository-url>
cd Doppelganger
uv venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
uv pip install -r requirements.txt
uv pip install -e .If you have a pre-built container image (doppelganger.sif), you can skip the installation and use it directly:
# Assumes doppelganger.sif is available in your working directory
apptainer shell doppelganger.sif
source /opt/python/bin/activateThe pipeline requires two main configuration files:
This file controls data preprocessing and transformation steps:
{
"name": "name",
"description": "dataset description",
"absolute_path": "INPUT DATA FOLDER",
"preprocessing": "WHERE WILL THE INTERMEDIATE AND OUTPUT FILES BE",
"filename_type1.parquet": ["pseudo_id", "col1", "col2", "ref_date"],
"filename_type2.parquet": ["pseudo_id", "value", {"type_id": {"type1_id": "name1", "type2_id": "name2"}}, ["date_column_name"]],
"final_cols": ["pseudo_id", "col1", "col2", "name1", "name2", "ref_date"],
"tuple_vals_after": ["name1", "name2"],
"tuple_vals_anybefore": [],
"ref_date": "ref_date",
"InitTransforms": {
},
"PreTransforms": {
},
"MergedTransforms": {
}
}Note that 3 types of transformations can be specified:
InitTransforms: Applied to individual files on loadPreTransforms: Applied to individual files after initial processingMergedTransforms: Applied to the final merged dataframe See the functions in the 'dpplgngr.utils.functions' module for available transformations.
This file controls synthetic data generation parameters:
{
"input_file": "FILE LOCATION",
"working_dir": "OUTPUT DIRECTORY",
"columns": ["col1", "col2", "name1", "name2"],
"num_points": 10000,
"synth_type": "GC"
}Available synthesis types:
GC: Gaussian CopulaCTGAN: Conditional Tabular GANTVAE: Tabular Variational AutoencoderRTF: RealTabFormer - GPT2GNN: Graph Neural Network (for relational data) Coming Soon
Option 1: Using uv environment
# Activate your environment
source venv/bin/activate
# Run the complete pipeline
python scripts/run_amc.py --etlconfig config/etl_amc_v3.json --genconfig config/synth.jsonOption 2: Using the container
# Run complete pipeline in container
apptainer run doppelganger.sif etl config/etl_amc_v3.json config/synth.jsonYou can also run individual Luigi tasks:
1. Data Conversion and Preprocessing
import luigi
from dpplgngr.etl.prep_dataset_tabular import ConvertLargeFiles, PreProcess
# Convert and preprocess data
luigi.build([
ConvertLargeFiles(etl_config="config/etl_amc_v3.json"),
PreProcess(etl_config="config/etl_amc_v3.json")
], local_scheduler=True)2. Synthetic Data Generation
import luigi
from dpplgngr.train.sdv import SDVGen
# Generate synthetic data
luigi.build([
SDVGen(
gen_config="config/synth.json",
etl_config="config/etl_amc_v3.json"
)
], local_scheduler=True)3. Data Quality Audit
import luigi
from dpplgngr.scores.audit import SyntheticDataAudit
# Audit synthetic data quality
luigi.build([
SyntheticDataAudit(
gen_config="config/synth.json",
etl_config="config/etl_amc_v3.json"
)
], local_scheduler=True)Individual steps using container:
# Data preprocessing only
apptainer exec doppelganger.sif python -c "
import luigi
from dpplgngr.etl.prep_dataset_tabular import ConvertLargeFiles, PreProcess
luigi.build([ConvertLargeFiles(etl_config='config/etl_amc_v3.json'), PreProcess(etl_config='config/etl_amc_v3.json')], local_scheduler=True)
"
# Synthetic data generation only
apptainer run doppelganger.sif synth config/synth.json config/etl_amc_v3.json
# Data audit only
apptainer run doppelganger.sif audit config/synth.json config/etl_amc_v3.json
# Run tests
apptainer run doppelganger.sif testWhen using containers, bind your data directories:
# Bind local data directory to container
apptainer exec -B /local/data:/data -B /local/output:/output doppelganger.sif \
python scripts/run_amc.py --etlconfig /data/config/etl_amc_v3.json --genconfig /data/config/synth.jsonThe package includes a comprehensive auditing system for evaluating synthetic data quality using the SyntheticDataAudit Luigi task.
- SDV Quality Evaluation: Uses the Synthetic Data Vault (SDV) library to compute overall quality scores
- Missing Data Analysis: Compares missing value patterns between original and synthetic data
- Distribution Analysis: Visual comparison of numeric and categorical distributions
- Statistical Summary: Side-by-side comparison of descriptive statistics
- Correlation Analysis: Heatmaps showing correlation matrices and differences
The audit functionality is integrated as a Luigi task that automatically depends on the synthetic data generation step:
from dpplgngr.scores.audit import SyntheticDataAudit
import luigi
# Run the audit task
luigi.build([
SyntheticDataAudit(
gen_config="config/synth.json",
etl_config="config/etl.json"
)
], local_scheduler=True)Or use the provided script:
python run_audit.pyThe audit task produces:
audit_report_{synth_type}.json: Comprehensive quality metrics and statisticsaudit_plots_{synth_type}/: Directory containing visualization plotsmissing_values_comparison.pngnumeric_distributions.pngcategorical_distribution_{column}.pngcorrelation_matrices.pngcorrelation_difference.png
The audit task uses the same configuration files as the synthetic data generation task:
gen_config: Points to the synthesis configuration JSONetl_config: Points to the ETL configuration JSON
If you need to build the container from scratch:
# Build the container image (requires Apptainer/Singularity)
./build_container.sh
# Or specify a custom build directory
./build_container.sh /path/to/build/directory
# Force rebuild if container exists
./build_container.sh --force
# Skip tests during build
./build_container.sh --skip-testsThe build script (build_container.sh) provides several options:
- Automatic dependency checking
- Build directory specification
- Force rebuild option
- Container testing after build
- Usage examples and help
The Doppelganger package can be containerized using Apptainer (Singularity) for reproducible execution across different environments. The container uses uv for fast Python package management instead of conda, resulting in faster builds and smaller images.
# Build the container image (requires Apptainer/Singularity)
apptainer build doppelganger.sif doppelganger.defThe container build process:
- Uses Ubuntu 22.04 as the base image
- Installs system dependencies and Python 3.10
- Installs uv for fast Python package management
- Creates a virtual environment at
/opt/python - Installs all Python packages from
requirements-container.txtusing uv - Copies the Doppelganger package and configurations
The container provides several convenient commands:
# Interactive shell
apptainer shell doppelganger.sif
# Run ETL pipeline
apptainer run doppelganger.sif etl [etl_config] [gen_config]
# Run synthetic data generation
apptainer run doppelganger.sif synth [gen_config] [etl_config]
# Run data audit
apptainer run doppelganger.sif audit [gen_config] [etl_config]
# Run test suite
apptainer run doppelganger.sif test
# Direct execution
apptainer exec doppelganger.sif python /opt/doppelganger/scripts/run_amc.py/opt/doppelganger/- Main application directory/opt/python/- Python virtual environment (managed by uv)/opt/doppelganger/config/- Configuration files/opt/doppelganger/data/- Data directory/opt/doppelganger/output/- Output directory/opt/doppelganger/logs/- Log files