netZoo
diff --git a/‎README.md‎
Lines changed: 7 additions & 0 deletions b/‎README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎netZooPy/cli.py‎
Lines changed: 2 additions & 0 deletions b/‎netZooPy/cli.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎netZooPy/command_line.py‎
Lines changed: 118 additions & 0 deletions b/‎netZooPy/command_line.py‎
Lines changed: 118 additions & 0 deletions
diff --git a/‎netZooPy/lioness/io.py‎
Lines changed: 148 additions & 0 deletions b/‎netZooPy/lioness/io.py‎
Lines changed: 148 additions & 0 deletions
@@ -19,6 +19,10 @@ netZooPy is a python package to reconstruct, analyse, and plot biological networ
 **WARNING**: for macos arm64 architectures you might have to manually install pytables. We are only testing macos-13
 intel architecture for the moment
 
+
+**WARNING**: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
+if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering. 
+
 ## Features
 
 netZooPy currently integrates
@@ -44,6 +48,9 @@ genes by miRNA.
 * **SAMBAR** (Subtyping Agglomerated Mutations By Annotation Relations) [[Kuijjer et al.]](https://www.nature.com/articles/s41416-018-0109-7): SAMBAR is a tool for studying cancer subtypes based on patterns of somatic mutations in curated biological pathways. Rather than characterize cancer according to mutations at the gene level, SAMBAR agglomerates mutations within pathways to define a pathway mutation score. To avoid bias based on pathway representation, these pathway mutation scores correct for the number of genes in each pathway as well as the number of times each gene is represented in the universe of pathways. By taking a pathway rather than gene-by-gene lens, SAMBAR both de-sparsifies somatic mutation data and incorporates important prior biological knowledge. Kuijjer et al. (2018) demonstrate that SAMBAR is capable of outperforming other methods for cancer subtyping, producing subtypes with greater between-subtype distances; the authors use SAMBAR for a pan-cancer subtyping analysis that identifies four diverse pan-cancer subtypes linked to distinct molecular processes. 
 
 * **OTTER** (Optimization to Estimate Regulation) [[Weighill et al.]](https://www.biorxiv.org/content/10.1101/2020.06.23.167999v2.abstract): OTTER is a GRN inference method based on the idea that observed biological data (PPI data and gene co-expression data) are projections of a bipartite GRN between TFs and genes. Specifically, PPI data represent the projection of the GRN onto the TF-TF space and gene co-expression data represent the projection of the GRN onto the gene-gene space. OTTER reframes the problem of GRN inference as a problem of relaxed graph matching and finds a GRN that has optimal agreement with the observed PPI and coexpression data. The OTTER objective function is tunable in two ways: first, one can prioritize matching the PPI data or the coexpression data more heavily depending on one's confidence in the data source; second, there is a regularization parameter that can be applied to induce sparsity on the estimated GRN. The OTTER objective function can be solved using spectral decomposition techniques and gradient descent; the latter is shown to be closely related to the PANDA message-passing approach (Glass et al. 2013).
+  
+**WARNING**: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
+if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering. 
 
 * **DRAGON** (Determining Regulatory Associations using Graphical models on Omics Networks) [[Shutta et al.]](https://arxiv.org/abs/2104.01690) is a method for estimating multiomic Gaussian graphical models (GGMs, also known as partial correlation networks) that incorporate two different omics data types. DRAGON builds off of the popular covariance shrinkage method of Ledoit and Wolf with an optimization approach that explicitly accounts for the differences in two separate omics "layers" in the shrinkage estimator. The resulting sparse covariance matrix is then inverted to obtain a precision matrix estimate and a corresponding GGM.  Although GGMs assume normally distributed data, DRAGON can be used on any type of continuous data by transforming data to approximate normality prior to network estimation. Currently, DRAGON can be applied to estimate networks with two different types of omics data. Investigators interested in applying DRAGON to more than two types of omics data can consider estimating pairwise networks and "chaining" them together.
 
 
@@ -12,3 +12,5 @@ def cli():
 cli.add_command(cl.lioness)
 cli.add_command(cl.condor)
 cli.add_command(cl.bonobo)
+cli.add_command(cl.otterlioness)
+cli.add_command(cl.otter)
@@ -158,7 +158,9 @@ def panda(expression, motif, ppi, output, computing='cpu', precision='double',wi
               help='If true, the final PANDA is saved as an adjacency matrix. Works only when save_memory is false')
 @click.option('--old_compatible', is_flag=True, show_default=True,
               help='If true, PANDA is saved without headers. Pass this if you want the same results of netzoopy before v0.9.11')
+
 def lioness(expression, motif, ppi, output_panda, output_lioness, el, fmt, computing, precision, ncores, save_tmp, rm_missing, mode_process,output_type, alpha, panda_start, panda_end, start, end, subset_numbers='', subset_names='',with_header=False, save_single_lioness=False,ignore_final=False, as_adjacency=False, old_compatible=False):    
+
     """Run Lioness to extract single-sample networks.
     First runs panda using expression, motif and ppi data. 
     Then runs lioness and puts results in the output_lioness folder.
@@ -341,3 +343,119 @@ def bonobo(
                                  precision = precision, 
                                  sample_names=sample_names)
 
+
+
+#####################################################################################
+############## OTTER LIONESS ########################################################
+#####################################################################################
+    
+from netZooPy.lioness.lioness_for_otter import LionessOtter
+
+@click.command()
+@click.option('-e', '--expression', 'expression', type=str, required=True,
+              help='Path to file containing the gene expression data. By default, \
+                  the expression file does not have a header, and the cells are separated by a tab.')
+@click.option('-m', '--motif', 'motif', type=str, required=True,
+              help='Path to pair file containing the transcription factor DNA binding motif edges in the form of TF-gene-weight(0/1). If not provided, the gene coexpression matrix is returned as a result network.')
+@click.option('-p', '--ppi', 'ppi', type=str, required=True,
+              help='Path to pair file containing the PPI edges. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.')
+@click.option('-of', '--out-folder', 'output_folder', type=str, required=True,
+              help='Output lioness otter folder')
+@click.option('--fmt', type=str, show_default=True, default='h5',
+              help='Lioness network files output format. Choose one between .npy,.txt,.mat')
+@click.option('--computing', type=str, show_default=True, default='cpu',
+              help='computing option, choose one between cpu and gpu')
+@click.option('--precision', type=str, show_default=True, default='double',
+              help='precision option')
+@click.option('--mode_process', type=str, default='intersection', show_default=True,
+              help='panda option for input data processing. Choose between union(default), \
+                  legacy and intersection')
+@click.option('--iterations', type=int, default=60, show_default=True,
+              help='otter iterations, Iter')
+@click.option('--lam', type=float, default=0.035, show_default=True,
+              help='lambda parameter')
+@click.option('--gamma', type=float, default=0.335, show_default=True,
+              help='gamma parameter')
+@click.option('--eta', type=float, default=0.00001, show_default=True,
+              help='eta parameter')
+@click.option('--bexp', type=int, default=1., show_default=True,
+              help='bexp parameter')
+def otterlioness(expression, motif, ppi, output_folder, fmt, computing, precision, mode_process='intersection', iterations=60, lam=0.035, gamma=0.335, Iter=60, eta=0.00001, bexp=1):
+    """Run Lioness otter to extract single-sample networks.
+    First runs otter using expression, motif and ppi data. 
+    Then runs lioness and puts results in the output_lioness folder.
+    WARNING: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
+    if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering. 
+
+    Example:
+
+            netzoopy otterlioness -e tests/puma/ToyData/ToyExpressionData.txt -m tests/puma/ToyData/ToyMotifData.txt -p tests/puma/ToyData/ToyPPIData.txt -of lioness_otter/
+    
+    """
+    # Run PANDA
+    print('Start Otter run ...')
+
+    # First we create the LIONESS OTTER instance with the expression, motif, ppi files        
+    lioobj = LionessOtter(expression, motif, ppi, mode_process=mode_process)
+    
+    print('Starting Otter Lioness')
+    lioobj.run_lioness_otter(output_folder, save_fmt = fmt, save_single=True, precision = precision, computing = computing, Iter = iterations, lam=lam, gamma=gamma, eta=eta, bexp=bexp)
+
+
+
+
+#####################################################################################
+############## OTTER ################################################################
+#####################################################################################
+    
+from netZooPy.lioness.lioness_for_otter import LionessOtter
+
+@click.command()
+@click.option('-e', '--expression', 'expression', type=str, required=True,
+              help='Path to file containing the gene expression data. By default, \
+                  the expression file does not have a header, and the cells are separated by a tab.')
+@click.option('-m', '--motif', 'motif', type=str, required=True,
+              help='Path to pair file containing the transcription factor DNA binding motif edges in the form of TF-gene-weight(0/1). If not provided, the gene coexpression matrix is returned as a result network.')
+@click.option('-p', '--ppi', 'ppi', type=str, required=True,
+              help='Path to pair file containing the PPI edges. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.')
+@click.option('-o', '--out-file', 'output_file', type=str, default = 'otter.txt',
+              help='Output otter file. Use one of the extensions between .npy,.txt,.mat')
+@click.option('--computing', type=str, show_default=True, default='cpu',
+              help='computing option, choose one between cpu and gpu')
+@click.option('--precision', type=str, show_default=True, default='double',
+              help='precision option')
+@click.option('--mode_process', type=str, default='intersection', show_default=True,
+              help='panda option for input data processing. Choose between union(default), \
+                  legacy and intersection')
+@click.option('--iterations', type=int, default=60, show_default=True,
+              help='otter iterations, Iter')
+@click.option('--lam', type=float, default=0.035, show_default=True,
+              help='lambda parameter')
+@click.option('--gamma', type=float, default=0.335, show_default=True,
+              help='gamma parameter')
+@click.option('--eta', type=float, default=0.00001, show_default=True,
+              help='eta parameter')
+@click.option('--bexp', type=int, default=1., show_default=True,
+              help='bexp parameter')
+def otter(expression, motif, ppi, output_file='otter.txt', computing='cpu', precision='double', mode_process='intersection', iterations=60, lam=0.035, gamma=0.335, Iter=60, eta=0.00001, bexp=1):
+    """Run Lioness otter to extract single-sample networks.
+    First runs otter using expression, motif and ppi data. 
+    Then runs lioness and puts results in the output_lioness folder.
+    
+    WARNING: the OTTER CLI and class are still relying on a simple approach for reading and merging. Please be careful
+    if you have NAs and want a non-intersection between W,P,C please rely on PANDA or on your own filtering. 
+    Example:
+
+            netzoopy otterlioness -e tests/puma/ToyData/ToyExpressionData.txt -m tests/puma/ToyData/ToyMotifData.txt -p tests/puma/ToyData/ToyPPIData.txt -of lioness_otter/
+    
+    """
+    # Run PANDA
+    print('Start Otter run ...')
+
+    # First we create the LIONESS OTTER instance with the expression, motif, ppi files        
+    lioobj = LionessOtter(expression, motif, ppi, mode_process=mode_process)
+    
+    print('Starting Otter Lioness')
+    
+    lioobj.run_otter(output_file, precision = precision, computing = computing, Iter = iterations, lam=lam, gamma=gamma, eta=eta, bexp=bexp )
+
@@ -0,0 +1,148 @@
+from __future__ import print_function
+import math
+from random import sample
+import time
+import pandas as pd
+from scipy.stats import zscore
+from .timer import Timer
+import numpy as np
+from netZooPy.panda import calculations as calc
+import sys
+
+def check_expression_integrity(df):
+    """Check data integrity
+    - Number of NA
+
+    Args:
+        df (dataframe): gene expression dataframe
+    """
+
+    # check that for each
+    if (df.isna().sum(axis = 1)>(len(df.columns)-3)).any():
+        sys.exit('Too many nan in gene expression (need more than 1 sample to compute coexpression)')
+
+def read_ppi(ppi_fn, tf_list = None):
+    """Read PPI network
+
+    Args:
+        ppi_fn (str): ppi network filename
+    """
+    with open(ppi_fn, 'r') as f:
+        ppi_data = pd.read_csv(f, sep="\t", header=None)
+        ppi_data.columns = ['tf1','tf2','exists']
+
+    # get all tfs from first and second column
+    if tf_list:
+        ppi_tfs = tf_list
+        ppi_data = ppi_data[(ppi_data.tf1.isin(ppi_tfs)) & (ppi_data.tf2.isin(ppi_tfs))]
+    else:
+        ppi_tfs = sorted(set(ppi_data.iloc[:,0].values.tolist()).union(set(ppi_data.iloc[:,1].values.tolist())))
+    
+    # create adjacency matrix
+    df = pd.DataFrame(np.eye(len(ppi_tfs)), index=ppi_tfs, columns=ppi_tfs)
+    z = ppi_data.pivot_table(columns='tf2',index = 'tf1',values = 'exists', fill_value=0)
+    df = df.add(z, fill_value=0).add(z.T, fill_value=0)
+    df = 1*(df>0)
+    # return adjacency matrix and tfs list
+    return(df, ppi_tfs)
+
+def read_motif(motif_fn, pivot = True):
+    """ Read a motif edgelist, generates
+
+    Args:
+        motif_fn (_type_): filename of the motif edgelist
+        pivot (bool): if true returns a pivot tfs X genes table. Otherwise keeps the edgelist
+    Returns:
+        piv/df: motif as edgelist or pivot table
+        tfs: list of tfs
+        genes: list of genes
+    """
+
+    with open(motif_fn, 'r') as f:
+        df = pd.read_csv(f, sep= '\t', header = None)
+
+    presenttf = df.iloc[:,0].unique()
+    presentgene = df.iloc[:,1].unique()
+
+    if pivot:
+        piv = df.pivot_table(values=2, index=0, columns=1, fill_value=0)
+        return(piv, list(presenttf), list(presentgene))
+    else:
+        return(df, list(presenttf), list(presentgene))
+
+def read_expression(expression_fn, header = 0, usecols = None, nrows = None):
+    """Read expression data.
+
+    Parameters
+    -----------
+        expression_fn: str
+            filename of the expression file
+        header: str or int
+            header row
+        usecols:list
+            pass a list of the columns that need to be read
+    """
+    with open(expression_fn, 'r') as f:
+        if expression_fn.endswith('.txt'):
+            df = pd.read_csv(f, sep = '\t', usecols = usecols, index_col=0, nrows=nrows)
+        elif expression_fn.endswith('.csv'):
+            df = pd.read_csv(f, sep = ' ', usecols = usecols, index_col=0, nrows=nrows)
+        elif expression_fn.endswith('.tsv'):
+            df = pd.read_csv(f, sep = '\t', usecols = usecols, index_col=0, nrows=nrows)
+        else:
+            sys.exit("Format of expression filename not recognised %s" %str(expression_fn))
+    
+    return(df)
+
+
+def prepare_expression(expression_filename, samples = None):
+
+    """ Prepare main coexpression network by reading the expression file.
+    
+    Parameters
+    ----------
+        expression_filename :str
+            A table (tsv, csv, or txt) where each column is a sample 
+            and each row is a gene. Values are expression.
+        samples: list
+            list of sample names. If None all samples are read (default: None)
+
+	Returns
+	---------
+		expression_data: pd.DataFrame
+		expression_genes:set
+
+    """    
+    # expression file is properly annotated with the sample name and 
+    # a list of sample of interest is passed
+    print(samples)
+    if type(expression_filename) is str:
+        columns = read_expression(expression_filename, nrows = 1)
+        if (isinstance(samples, list)):
+            usecols = samples.copy()
+            usecols.insert(0,columns.index.name)
+            expression_data = read_expression(expression_filename, usecols = usecols)
+        else:
+            expression_data = read_expression(expression_filename)
+
+    elif isinstance(expression_filename, pd.DataFrame):
+        if (isinstance(samples, list)):
+            usecols = samples.copy()
+            usecols.insert(0,columns.index.name)
+            expression_data = expression_filename.loc[:,usecols]
+        else:
+            expression_data = expression_filename
+    else: 
+        sys.exit('Expression filename needs to be either a table string or a panda dataframe')
+    
+    # keep names of expression genes
+    expression_genes = set(expression_data.index.tolist())
+
+    if len(expression_data) != len(expression_genes):
+        print(
+            "Duplicate gene symbols detected. Consider averaging before running PANDA"
+        )
+
+    check_expression_integrity(expression_data)
+
+    return(expression_data, expression_genes)