Skip to content

Commit f9c4a1e

Browse files
authored
Add otter-lioness and cli (#342)
* fixed lioness CLI and added files for otterlioness * addedd otter and lioness otter to CLI. For the moment the data filtering step is different from PANDA, as it does not have all preprocessing modes. It needs to converge * fixing test lioness. We need to remove the save_memory flag * testing preprocessing with PANDA * fixing test * fixed otter test * adding warnings --------- Co-authored-by: violafanfani <[email protected]>
1 parent be0fb96 commit f9c4a1e

File tree

10 files changed

+1786
-2
lines changed

10 files changed

+1786
-2
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ netZooPy is a python package to reconstruct, analyse, and plot biological networ
1919
**WARNING**: for macos arm64 architectures you might have to manually install pytables. We are only testing macos-13
2020
intel architecture for the moment
2121

22+
23+
**WARNING**: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
24+
if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering.
25+
2226
## Features
2327

2428
netZooPy currently integrates
@@ -44,6 +48,9 @@ genes by miRNA.
4448
* **SAMBAR** (Subtyping Agglomerated Mutations By Annotation Relations) [[Kuijjer et al.]](https://www.nature.com/articles/s41416-018-0109-7): SAMBAR is a tool for studying cancer subtypes based on patterns of somatic mutations in curated biological pathways. Rather than characterize cancer according to mutations at the gene level, SAMBAR agglomerates mutations within pathways to define a pathway mutation score. To avoid bias based on pathway representation, these pathway mutation scores correct for the number of genes in each pathway as well as the number of times each gene is represented in the universe of pathways. By taking a pathway rather than gene-by-gene lens, SAMBAR both de-sparsifies somatic mutation data and incorporates important prior biological knowledge. Kuijjer et al. (2018) demonstrate that SAMBAR is capable of outperforming other methods for cancer subtyping, producing subtypes with greater between-subtype distances; the authors use SAMBAR for a pan-cancer subtyping analysis that identifies four diverse pan-cancer subtypes linked to distinct molecular processes.
4549

4650
* **OTTER** (Optimization to Estimate Regulation) [[Weighill et al.]](https://www.biorxiv.org/content/10.1101/2020.06.23.167999v2.abstract): OTTER is a GRN inference method based on the idea that observed biological data (PPI data and gene co-expression data) are projections of a bipartite GRN between TFs and genes. Specifically, PPI data represent the projection of the GRN onto the TF-TF space and gene co-expression data represent the projection of the GRN onto the gene-gene space. OTTER reframes the problem of GRN inference as a problem of relaxed graph matching and finds a GRN that has optimal agreement with the observed PPI and coexpression data. The OTTER objective function is tunable in two ways: first, one can prioritize matching the PPI data or the coexpression data more heavily depending on one's confidence in the data source; second, there is a regularization parameter that can be applied to induce sparsity on the estimated GRN. The OTTER objective function can be solved using spectral decomposition techniques and gradient descent; the latter is shown to be closely related to the PANDA message-passing approach (Glass et al. 2013).
51+
52+
**WARNING**: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
53+
if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering.
4754

4855
* **DRAGON** (Determining Regulatory Associations using Graphical models on Omics Networks) [[Shutta et al.]](https://arxiv.org/abs/2104.01690) is a method for estimating multiomic Gaussian graphical models (GGMs, also known as partial correlation networks) that incorporate two different omics data types. DRAGON builds off of the popular covariance shrinkage method of Ledoit and Wolf with an optimization approach that explicitly accounts for the differences in two separate omics "layers" in the shrinkage estimator. The resulting sparse covariance matrix is then inverted to obtain a precision matrix estimate and a corresponding GGM. Although GGMs assume normally distributed data, DRAGON can be used on any type of continuous data by transforming data to approximate normality prior to network estimation. Currently, DRAGON can be applied to estimate networks with two different types of omics data. Investigators interested in applying DRAGON to more than two types of omics data can consider estimating pairwise networks and "chaining" them together.
4956

netZooPy/cli.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,5 @@ def cli():
1212
cli.add_command(cl.lioness)
1313
cli.add_command(cl.condor)
1414
cli.add_command(cl.bonobo)
15+
cli.add_command(cl.otterlioness)
16+
cli.add_command(cl.otter)

netZooPy/command_line.py

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,9 @@ def panda(expression, motif, ppi, output, computing='cpu', precision='double',wi
158158
help='If true, the final PANDA is saved as an adjacency matrix. Works only when save_memory is false')
159159
@click.option('--old_compatible', is_flag=True, show_default=True,
160160
help='If true, PANDA is saved without headers. Pass this if you want the same results of netzoopy before v0.9.11')
161+
161162
def lioness(expression, motif, ppi, output_panda, output_lioness, el, fmt, computing, precision, ncores, save_tmp, rm_missing, mode_process,output_type, alpha, panda_start, panda_end, start, end, subset_numbers='', subset_names='',with_header=False, save_single_lioness=False,ignore_final=False, as_adjacency=False, old_compatible=False):
163+
162164
"""Run Lioness to extract single-sample networks.
163165
First runs panda using expression, motif and ppi data.
164166
Then runs lioness and puts results in the output_lioness folder.
@@ -341,3 +343,119 @@ def bonobo(
341343
precision = precision,
342344
sample_names=sample_names)
343345

346+
347+
348+
#####################################################################################
349+
############## OTTER LIONESS ########################################################
350+
#####################################################################################
351+
352+
from netZooPy.lioness.lioness_for_otter import LionessOtter
353+
354+
@click.command()
355+
@click.option('-e', '--expression', 'expression', type=str, required=True,
356+
help='Path to file containing the gene expression data. By default, \
357+
the expression file does not have a header, and the cells are separated by a tab.')
358+
@click.option('-m', '--motif', 'motif', type=str, required=True,
359+
help='Path to pair file containing the transcription factor DNA binding motif edges in the form of TF-gene-weight(0/1). If not provided, the gene coexpression matrix is returned as a result network.')
360+
@click.option('-p', '--ppi', 'ppi', type=str, required=True,
361+
help='Path to pair file containing the PPI edges. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.')
362+
@click.option('-of', '--out-folder', 'output_folder', type=str, required=True,
363+
help='Output lioness otter folder')
364+
@click.option('--fmt', type=str, show_default=True, default='h5',
365+
help='Lioness network files output format. Choose one between .npy,.txt,.mat')
366+
@click.option('--computing', type=str, show_default=True, default='cpu',
367+
help='computing option, choose one between cpu and gpu')
368+
@click.option('--precision', type=str, show_default=True, default='double',
369+
help='precision option')
370+
@click.option('--mode_process', type=str, default='intersection', show_default=True,
371+
help='panda option for input data processing. Choose between union(default), \
372+
legacy and intersection')
373+
@click.option('--iterations', type=int, default=60, show_default=True,
374+
help='otter iterations, Iter')
375+
@click.option('--lam', type=float, default=0.035, show_default=True,
376+
help='lambda parameter')
377+
@click.option('--gamma', type=float, default=0.335, show_default=True,
378+
help='gamma parameter')
379+
@click.option('--eta', type=float, default=0.00001, show_default=True,
380+
help='eta parameter')
381+
@click.option('--bexp', type=int, default=1., show_default=True,
382+
help='bexp parameter')
383+
def otterlioness(expression, motif, ppi, output_folder, fmt, computing, precision, mode_process='intersection', iterations=60, lam=0.035, gamma=0.335, Iter=60, eta=0.00001, bexp=1):
384+
"""Run Lioness otter to extract single-sample networks.
385+
First runs otter using expression, motif and ppi data.
386+
Then runs lioness and puts results in the output_lioness folder.
387+
WARNING: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
388+
if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering.
389+
390+
Example:
391+
392+
netzoopy otterlioness -e tests/puma/ToyData/ToyExpressionData.txt -m tests/puma/ToyData/ToyMotifData.txt -p tests/puma/ToyData/ToyPPIData.txt -of lioness_otter/
393+
394+
"""
395+
# Run PANDA
396+
print('Start Otter run ...')
397+
398+
# First we create the LIONESS OTTER instance with the expression, motif, ppi files
399+
lioobj = LionessOtter(expression, motif, ppi, mode_process=mode_process)
400+
401+
print('Starting Otter Lioness')
402+
lioobj.run_lioness_otter(output_folder, save_fmt = fmt, save_single=True, precision = precision, computing = computing, Iter = iterations, lam=lam, gamma=gamma, eta=eta, bexp=bexp)
403+
404+
405+
406+
407+
#####################################################################################
408+
############## OTTER ################################################################
409+
#####################################################################################
410+
411+
from netZooPy.lioness.lioness_for_otter import LionessOtter
412+
413+
@click.command()
414+
@click.option('-e', '--expression', 'expression', type=str, required=True,
415+
help='Path to file containing the gene expression data. By default, \
416+
the expression file does not have a header, and the cells are separated by a tab.')
417+
@click.option('-m', '--motif', 'motif', type=str, required=True,
418+
help='Path to pair file containing the transcription factor DNA binding motif edges in the form of TF-gene-weight(0/1). If not provided, the gene coexpression matrix is returned as a result network.')
419+
@click.option('-p', '--ppi', 'ppi', type=str, required=True,
420+
help='Path to pair file containing the PPI edges. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.')
421+
@click.option('-o', '--out-file', 'output_file', type=str, default = 'otter.txt',
422+
help='Output otter file. Use one of the extensions between .npy,.txt,.mat')
423+
@click.option('--computing', type=str, show_default=True, default='cpu',
424+
help='computing option, choose one between cpu and gpu')
425+
@click.option('--precision', type=str, show_default=True, default='double',
426+
help='precision option')
427+
@click.option('--mode_process', type=str, default='intersection', show_default=True,
428+
help='panda option for input data processing. Choose between union(default), \
429+
legacy and intersection')
430+
@click.option('--iterations', type=int, default=60, show_default=True,
431+
help='otter iterations, Iter')
432+
@click.option('--lam', type=float, default=0.035, show_default=True,
433+
help='lambda parameter')
434+
@click.option('--gamma', type=float, default=0.335, show_default=True,
435+
help='gamma parameter')
436+
@click.option('--eta', type=float, default=0.00001, show_default=True,
437+
help='eta parameter')
438+
@click.option('--bexp', type=int, default=1., show_default=True,
439+
help='bexp parameter')
440+
def otter(expression, motif, ppi, output_file='otter.txt', computing='cpu', precision='double', mode_process='intersection', iterations=60, lam=0.035, gamma=0.335, Iter=60, eta=0.00001, bexp=1):
441+
"""Run Lioness otter to extract single-sample networks.
442+
First runs otter using expression, motif and ppi data.
443+
Then runs lioness and puts results in the output_lioness folder.
444+
445+
WARNING: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
446+
if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering.
447+
Example:
448+
449+
netzoopy otterlioness -e tests/puma/ToyData/ToyExpressionData.txt -m tests/puma/ToyData/ToyMotifData.txt -p tests/puma/ToyData/ToyPPIData.txt -of lioness_otter/
450+
451+
"""
452+
# Run PANDA
453+
print('Start Otter run ...')
454+
455+
# First we create the LIONESS OTTER instance with the expression, motif, ppi files
456+
lioobj = LionessOtter(expression, motif, ppi, mode_process=mode_process)
457+
458+
print('Starting Otter Lioness')
459+
460+
lioobj.run_otter(output_file, precision = precision, computing = computing, Iter = iterations, lam=lam, gamma=gamma, eta=eta, bexp=bexp )
461+

netZooPy/lioness/io.py

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
from __future__ import print_function
2+
import math
3+
from random import sample
4+
import time
5+
import pandas as pd
6+
from scipy.stats import zscore
7+
from .timer import Timer
8+
import numpy as np
9+
from netZooPy.panda import calculations as calc
10+
import sys
11+
12+
def check_expression_integrity(df):
13+
"""Check data integrity
14+
- Number of NA
15+
16+
Args:
17+
df (dataframe): gene expression dataframe
18+
"""
19+
20+
# check that for each
21+
if (df.isna().sum(axis = 1)>(len(df.columns)-3)).any():
22+
sys.exit('Too many nan in gene expression (need more than 1 sample to compute coexpression)')
23+
24+
def read_ppi(ppi_fn, tf_list = None):
25+
"""Read PPI network
26+
27+
Args:
28+
ppi_fn (str): ppi network filename
29+
"""
30+
with open(ppi_fn, 'r') as f:
31+
ppi_data = pd.read_csv(f, sep="\t", header=None)
32+
ppi_data.columns = ['tf1','tf2','exists']
33+
34+
# get all tfs from first and second column
35+
if tf_list:
36+
ppi_tfs = tf_list
37+
ppi_data = ppi_data[(ppi_data.tf1.isin(ppi_tfs)) & (ppi_data.tf2.isin(ppi_tfs))]
38+
else:
39+
ppi_tfs = sorted(set(ppi_data.iloc[:,0].values.tolist()).union(set(ppi_data.iloc[:,1].values.tolist())))
40+
41+
# create adjacency matrix
42+
df = pd.DataFrame(np.eye(len(ppi_tfs)), index=ppi_tfs, columns=ppi_tfs)
43+
z = ppi_data.pivot_table(columns='tf2',index = 'tf1',values = 'exists', fill_value=0)
44+
df = df.add(z, fill_value=0).add(z.T, fill_value=0)
45+
df = 1*(df>0)
46+
# return adjacency matrix and tfs list
47+
return(df, ppi_tfs)
48+
49+
def read_motif(motif_fn, pivot = True):
50+
""" Read a motif edgelist, generates
51+
52+
Args:
53+
motif_fn (_type_): filename of the motif edgelist
54+
pivot (bool): if true returns a pivot tfs X genes table. Otherwise keeps the edgelist
55+
Returns:
56+
piv/df: motif as edgelist or pivot table
57+
tfs: list of tfs
58+
genes: list of genes
59+
"""
60+
61+
with open(motif_fn, 'r') as f:
62+
df = pd.read_csv(f, sep= '\t', header = None)
63+
64+
presenttf = df.iloc[:,0].unique()
65+
presentgene = df.iloc[:,1].unique()
66+
67+
if pivot:
68+
piv = df.pivot_table(values=2, index=0, columns=1, fill_value=0)
69+
return(piv, list(presenttf), list(presentgene))
70+
else:
71+
return(df, list(presenttf), list(presentgene))
72+
73+
def read_expression(expression_fn, header = 0, usecols = None, nrows = None):
74+
"""Read expression data.
75+
76+
Parameters
77+
-----------
78+
expression_fn: str
79+
filename of the expression file
80+
header: str or int
81+
header row
82+
usecols:list
83+
pass a list of the columns that need to be read
84+
"""
85+
with open(expression_fn, 'r') as f:
86+
if expression_fn.endswith('.txt'):
87+
df = pd.read_csv(f, sep = '\t', usecols = usecols, index_col=0, nrows=nrows)
88+
elif expression_fn.endswith('.csv'):
89+
df = pd.read_csv(f, sep = ' ', usecols = usecols, index_col=0, nrows=nrows)
90+
elif expression_fn.endswith('.tsv'):
91+
df = pd.read_csv(f, sep = '\t', usecols = usecols, index_col=0, nrows=nrows)
92+
else:
93+
sys.exit("Format of expression filename not recognised %s" %str(expression_fn))
94+
95+
return(df)
96+
97+
98+
def prepare_expression(expression_filename, samples = None):
99+
100+
""" Prepare main coexpression network by reading the expression file.
101+
102+
Parameters
103+
----------
104+
expression_filename :str
105+
A table (tsv, csv, or txt) where each column is a sample
106+
and each row is a gene. Values are expression.
107+
samples: list
108+
list of sample names. If None all samples are read (default: None)
109+
110+
Returns
111+
---------
112+
expression_data: pd.DataFrame
113+
expression_genes:set
114+
115+
"""
116+
# expression file is properly annotated with the sample name and
117+
# a list of sample of interest is passed
118+
print(samples)
119+
if type(expression_filename) is str:
120+
columns = read_expression(expression_filename, nrows = 1)
121+
if (isinstance(samples, list)):
122+
usecols = samples.copy()
123+
usecols.insert(0,columns.index.name)
124+
expression_data = read_expression(expression_filename, usecols = usecols)
125+
else:
126+
expression_data = read_expression(expression_filename)
127+
128+
elif isinstance(expression_filename, pd.DataFrame):
129+
if (isinstance(samples, list)):
130+
usecols = samples.copy()
131+
usecols.insert(0,columns.index.name)
132+
expression_data = expression_filename.loc[:,usecols]
133+
else:
134+
expression_data = expression_filename
135+
else:
136+
sys.exit('Expression filename needs to be either a table string or a panda dataframe')
137+
138+
# keep names of expression genes
139+
expression_genes = set(expression_data.index.tolist())
140+
141+
if len(expression_data) != len(expression_genes):
142+
print(
143+
"Duplicate gene symbols detected. Consider averaging before running PANDA"
144+
)
145+
146+
check_expression_integrity(expression_data)
147+
148+
return(expression_data, expression_genes)

0 commit comments

Comments
 (0)