Skip to content

MiraldiLab/InferelatorJL

Repository files navigation

InferelatorJL

A Julia package for inference of transcriptional regulatory networks (GRNs) from gene expression data and prior chromatin accessibility or binding information, using the mLASSO-StARS algorithm.

Pipeline diagram


Overview

InferelatorJL infers genome-scale gene regulatory networks (GRNs) by combining gene expression data with a prior network (e.g., from ATAC-seq or ChIP-seq) to identify transcription factor (TF) → target gene regulatory relationships. The method uses multi-scale LASSO with Stability Approach to Regularization Selection (mLASSO-StARS), a penalized regression framework in which the prior network biases edge selection and stability selection across bootstrap subsamples determines a data-driven sparsity level.

Key features:

  • Supports bulk RNA-seq, pseudobulk, and single-cell expression data
  • Estimates TF activity (TFA) via least squares as an alternative to raw TF mRNA
  • Builds separate networks for TFA and TF mRNA predictors, then combines them
  • Prior-weighted LASSO penalties reduce false positives for prior-supported edges
  • Network-level or gene-level instability thresholding
  • Built-in PR / ROC evaluation against gold-standard interaction sets
  • Fully multithreaded subsampling loop

Installation

Requirements: Julia ≥ 1.7, Python (for PyPlot/matplotlib)

# From the Julia REPL:
] add https://github.com/miraldilab/InferelatorJL.jl

# Install all dependencies:
] instantiate

For development / local installation:

] dev /path/to/InferelatorJL
] instantiate

Quick start

using InferelatorJL

# Step 1 — load expression data
data = loadData(exprFile, targFile, regFile)

# Steps 2–3 — merge degenerate TFs, build prior, estimate TFA
priorData, mergedTFs = loadPrior(data, priorFile)
estimateTFA(priorData, data; outputDir = dirOut)

# Step 3b (optional) — apply time-lag correction (time-series data only)
# applyTimeLag(priorData, data, timeLagFile, timeLag)

# Step 4 — infer GRN (TFA predictors)
buildNetwork(data, priorData;
             tfaMode   = true,
             outputDir = joinpath(dirOut, "TFA"))

# Step 5 — aggregate TFA + mRNA networks
aggregateNetworks([tfaEdges, mrnaEdges];
                  method    = :max,
                  outputDir = joinpath(dirOut, "Combined"))

# Step 6 — refine TFA using the consensus network as a new prior
refineTFA(combinedNetFile, data, mergedTFs; outputDir = dirOut)

# Step 7 — evaluate against a gold standard
evaluateNetwork(networkFile, goldStandard;
                gsRegsFile = regFile, targGeneFile = targFile, saveDir = dirPR)

See examples/interactive_pipeline.jl for a fully annotated step-by-step example.


Pipeline

The full pipeline has seven steps. Each step is a single function call.

Step Function Description
1 loadData Load expression matrix; filter to target genes and regulators
2–3 loadPrior + estimateTFA Process prior, merge degenerate TFs, estimate TF activity via least squares
3b (optional) applyTimeLag Adjust TFA and TF mRNA for the delay between TF transcription and protein activity (time-series data only)
4 buildNetwork Run mLASSO-StARS for one predictor mode (TFA or TF mRNA)
5 aggregateNetworks Combine TFA and mRNA edge lists into a consensus network
6 refineTFA Re-estimate TFA using the consensus network as a data-driven prior
7 evaluateNetwork Evaluate any network against gold standards (PR/ROC curves)

Steps 4–6 are typically run twice (TFA mode and TF mRNA mode) and the results aggregated in Step 5.


Input files

File Format Description
Expression matrix Tab-delimited TSV or Apache Arrow Genes × samples; genes in rows, samples in columns; first column is gene names
Target gene list Plain text, one gene per line Genes to model as regression targets
Regulator list Plain text, one TF per line Candidate transcription factors
Prior network Wide-matrix TSV (gene × TF, empty first header) Prior regulatory connections; rows = target genes, columns = TFs; values are edge weights (binary or continuous)
Prior penalties Same format as prior network Prior(s) used to set per-edge LASSO penalties; can be the same as prior network
TFA gene list Plain text, one gene per line Optional: restrict TFA estimation to a gene subset

Output files

All outputs are written under the directory specified by outputDir.

File Description
edges.tsv Full ranked edge table: TF, gene, signed quantile, stability, partial correlation, inPrior flag
edges_subset.tsv Top edges after applying the meanEdgesPerGene cap
instabOutMat.jld Serialised GrnData struct with full instability arrays across the λ grid
instability_diagnostic_network.png Two-panel plot: model size vs λ (left) and bStARS instability bounds vs λ (right)
instability_selection_network.png |Instability − target| vs λ with dot at the chosen λ
combined_<method>_sp.tsv Aggregated network — long/edge list format (TF, gene, score; no zeros)
combined_<method>.tsv Aggregated network — wide/matrix format (genes × TFs; used as prior input to refineTFA)

Key parameters

Parameter Default Description
totSS 80 Total bootstrap subsamples for instability estimation
subsampleFrac 0.63 Fraction of samples drawn per subsample
targetInstability 0.05 Instability threshold for λ selection
lambdaBias [0.5] Penalty reduction factor for prior-supported edges (0 = no prior, 1 = uniform)
meanEdgesPerGene 20 Maximum retained edges per target gene
instabilityLevel "Network" "Network": single λ for all genes; "Gene": per-gene λ
zScoreTFA true Z-score expression before TFA estimation
zScoreLASSO true Z-score expression before LASSO regression
method :max Network aggregation rule: :max, :mean, or :min stability
timeLagFile "" Path to 4-column TSV (sampleQ, timeQ, sampleP, timeP) for time-lag correction; "" skips step 3b
timeLag 0.0 Lag between TF mRNA and protein activity, in the same time units as timeLagFile
leaveOutSampleList "" Path to a text file (one sample name per line) listing samples to exclude from subsampling; held-out samples form the test set for calcR2predFromStabilities

API reference

Data loading

data                    = loadData(exprFile, targFile, regFile; tfaGeneFile="", epsilon=0.01)
priorData, mergedTFs    = loadPrior(data, priorFile; minTargets=3)

Core pipeline

estimateTFA(priorData, data; edgeSS=0, zScoreTFA=true, outputDir=".")
applyTimeLag(priorData, data, timeLagFile, timeLag)          # optional step 3b
buildNetwork(data, priorData; tfaMode=true, totSS=80, lambdaBias=[0.5], ...)
aggregateNetworks(netFiles; method=:max, meanEdgesPerGene=20, outputDir=".")
refineTFA(combinedNetFile, data, mergedTFs; zScoreTFA=true, outputDir=".")

Evaluation

evaluateNetwork(networkFile, goldStandard;
                gsRegsFile="", targGeneFile="", doPerTF=false, saveDir="")

# Out-of-sample R² prediction: sweep model sizes 1:maxModSize TFs/gene,
# fit OLS on training samples, evaluate on held-out test samples.
# Requires a leave-out set (passed via leaveOutSampleList in buildNetwork).
calcR2predFromStabilities(grnData, buildGrn, data, maxModSize, outputDir;
                          xaxisStepSize=5)

Data preparation (pseudobulk)

# Aggregate single-cell counts from an AnnData (.h5ad) object into pseudobulk
pbRaw = pseudoBulk(adata, "celltype_rep")   # returns features × samples DataFrame

# Normalize a features × samples count matrix
# methods: "none", "tpm", "log2tpm", "zscore", "log2tpm_zscore", "log2fc", "sizefactor"
countsNorm = normalizeMatrix(countsMat, "log2tpm")

# Build a one-hot biological covariate matrix (drop-first encoding)
bioDesign = buildDesignMatrix(meta, "celltype")

# Remove additive batch effects (pure Julia OLS, equivalent to limma)
corrected = removeBatchEffect(countsNorm, meta[!, "replicate"]; designMat=bioDesign)

# Sum replicate columns into per-celltype aggregate columns
# Column names must follow {celltype}_{replicate} convention
limaAgg = aggregateReplicates(limaDf, celltypeOrder, replicateOrder)

For DESeq2 VST normalization and exact limma batch correction, use R directly. See examples/prepareData/pseudobulkWorkflow.jl for a complete workflow using RCall.jl.

Utilities

matrixToEdgeList(df)                         # wide prior matrix → edge list (long format, no zeros)
edgeListToMatrix(df; indices=(2,1,3))        # edge list → wide matrix
frobeniusNormalize(M, :column)               # normalize matrix columns or rows
binarizeNumeric!(df)                         # continuous prior → binary
mergeDFs(dfs, :Gene, "sum")                  # merge prior DataFrames
completeDF(df, :Gene, allGenes, allTFs)      # align to full gene/TF universe
writeTSVWithEmptyFirstHeader(df, path)       # write sparse prior format
check_column_norms(M)                        # verify unit column norms
writeNetworkTable!(buildGrn; outputDir=".")  # write edges.tsv from BuildGrn struct
saveData(data, tfaData, grnData, buildGrn, dir, "checkpoint.jld2")

Examples

File Description
interactive_pipeline.jl Full 7-step pipeline, step-by-step in the REPL, public API only
interactive_pipeline_dev.jl Same pipeline using module-qualified internal calls (for development/debugging)
run_pipeline.jl Full pipeline wrapped in runInferelator() for batch/cluster use, public API
run_pipeline_dev.jl Same wrapped function using internal calls
dataUtils.jl Data utility and partial-correlation functions demonstrated with synthetic data; no input files needed
networkUtils.jl Network I/O and aggregation utilities demonstrated with synthetic data; no input files needed
plotPR.jl Standalone script to load saved PR results and generate PR curve plots
prepareData/pseudobulkWorkflow.jl End-to-end pseudobulk workflow: h5ad → normalize → batch-correct → save TSV/Arrow for InferelatorJL
prepareData/h5adToArrow.jl Convert an h5ad expression matrix directly to the Arrow format expected by InferelatorJL
r2PredWorkflow.jl Leave-out R² prediction workflow: build GRN with held-out samples, compute R²_pred vs model size to guide TF-per-gene selection

Data structures

Struct Populated by Key fields
GeneExpressionData loadData geneExpressionMat, geneNames, cellLabels, targGenes, potRegs
PriorTFAData loadPrior, estimateTFA priorMatrix, medTfas, pRegs, pTargs
mergedTFsResult loadPrior mergedPrior, mergedTFMap
GrnData buildNetwork internals predictorMat, penaltyMat, allPredictors, subsamps, trainInds, stabilityMat, targGenes
BuildGrn buildNetwork networkMat, networkMatSubset

Citation

If you use InferelatorJL in your work, please cite:

Miraldi ER, Pokrovskii M, Watters A, et al. Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells. PLOS Computational Biology, 2019. https://doi.org/10.1371/journal.pcbi.1006979


License

See LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors