Skip to content

Commit d2032aa

Browse files
add joss paper draft (#331)
* add joss paper * update metadata * add figure 2 to paper * Update .gitignore * Update paper.md Clarified and added funding information * Create joss-pdf.yml * Update .gitignore * Commit from GitHub Actions (Create JOSS PDF) --------- Co-authored-by: Justin van der Hooft <[email protected]> Co-authored-by: CunliangGeng <[email protected]>
1 parent a7f9da0 commit d2032aa

File tree

7 files changed

+337
-1
lines changed

7 files changed

+337
-1
lines changed

.github/workflows/joss-pdf.yml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: Create JOSS PDF
2+
3+
on:
4+
push:
5+
paths:
6+
- joss/**
7+
- .github/workflows/joss-pdf.yml
8+
9+
jobs:
10+
paper:
11+
runs-on: ubuntu-latest
12+
name: Paper Draft
13+
steps:
14+
- name: Checkout
15+
uses: actions/checkout@v4
16+
- name: Build draft PDF
17+
uses: openjournals/openjournals-draft-action@master
18+
with:
19+
journal: joss
20+
# This should be the path to the paper within your repo.
21+
paper-path: joss/paper.md
22+
- name: Upload
23+
uses: actions/upload-artifact@v4
24+
with:
25+
name: paper
26+
# This is the output path where Pandoc will write the compiled
27+
# PDF. Note, this should be the same directory as the input
28+
# paper.md
29+
path: joss/paper.pdf
30+
- name: Commit PDF to repository
31+
uses: EndBug/add-and-commit@v9
32+
with:
33+
# This should be the path to the paper within your repo.
34+
add: 'joss/paper.pdf'

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,5 @@ tests/integration/data/nplinker_local_mode_example.zip
6666
strain_mappings.json
6767

6868
# docs
69-
docs/webapp/readme.md
69+
docs/webapp/readme.md
70+
joss/jats/*

joss/fig1.png

120 KB
Loading

joss/fig2.png

83.6 KB
Loading

joss/paper.bib

Lines changed: 170 additions & 0 deletions
Large diffs are not rendered by default.

joss/paper.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
title: 'NPLinker 2: a modular and customizable framework for paired omics analyses'
3+
tags:
4+
- Python
5+
- bioinformatics
6+
- natural products
7+
- genomics
8+
- metabolomics
9+
- multi-omics
10+
- biosynthetic gene clusters
11+
- mass spectrometry
12+
authors:
13+
- name: Cunliang Geng
14+
orcid: 0000-0002-1409-8358
15+
corresponding: true
16+
affiliation: 1
17+
- name: Giulia Crocioni
18+
orcid: 0000-0002-0823-0121
19+
affiliation: 1
20+
- name: Helge Hecht
21+
orcid: 0000-0001-6744-996X
22+
affiliation: 2
23+
- name: Arjan Draisma
24+
orcid: 0009-0004-0503-6261
25+
affiliation: 3
26+
- name: Annette Lien
27+
orcid: 0009-0006-0578-9225
28+
affiliation: 3
29+
- name: Laura Rosina Torres Ortega
30+
orcid: 0000-0003-4439-6740
31+
affiliation: 3
32+
- name: Dora Ferreira
33+
orcid: 0000-0002-5823-4219
34+
affiliation: 4
35+
- name: Pablo Lopez-Tarifa
36+
orcid: 0000-0002-4136-1860
37+
affiliation: 1
38+
- name: Katherine R. Duncan
39+
orcid: 0000-0002-3670-4849
40+
affiliation: 5
41+
- name: Marnix H. Medema
42+
orcid: 0000-0002-2191-2821
43+
affiliation: 3
44+
- name: Justin J.J. van der Hooft
45+
orcid: 0000-0002-9340-5511
46+
corresponding: true
47+
affiliation: "3, 6"
48+
affiliations:
49+
- name: Netherlands eScience Center, Netherlands
50+
index: 1
51+
- name: RECETOX, Faculty of Science, Masaryk University, Kotlářská 2, 60200, Brno, Czech Republic
52+
index: 2
53+
- name: Bioinformatics Group, Wageningen University & Research, Netherlands
54+
index: 3
55+
- name: Naicons Srl, Milan, Italy
56+
index: 4
57+
- name: Newcastle University, Biosciences Institute, Newcastle upon Tyne, UK
58+
index: 5
59+
- name: Department of Biochemistry, University of Johannesburg, 2006 Johannesburg, South Africa
60+
index: 6
61+
date: 8 July 2025
62+
bibliography: paper.bib
63+
---
64+
65+
# Summary
66+
Natural product discovery increasingly relies on the integration of multi-omics data to explore and prioritize biochemical diversity. To advance these efforts, we present NPLinker 2, a redesigned Python framework to do paired omics analyses by prioritizing genomics-metabolomics links. It provides a modular workflow that allows defining custom modules for data preparation, data loading and scoring methods. In addition, NPLinker 2 includes a web application for the interactive analysis and visualisation of promising links.
67+
68+
69+
# Statement of need
70+
Omics datasets have become a key resource for natural products discovery, enabling the systematic exploration of specialized metabolites, the refinement of knowledge of known natural products, and the identification of novel bioactive compounds or metabolic enzymes. Paired omics analyses combine complementary genomics (e.g., biosynthetic gene clusters (BGCs)) and metabolomics (e.g., mass spectra) datasets to elucidate gene-metabolite relationships, accelerating the discovery process [@goering_metabologenomics_2016; @leao_npomix_2022; @hooft_linking_2020]. However, omics data structures, preproccessing pipelines, resources, and annotation tools are constantly being improved. For example, newer releases of MIBiG contain more validated BGCs and new annotation fields [@zdouc_mibig_2025], while mass spectral libraries are growing in size and information as well [@wang_sharing_2016]. Besides, newer versions of omics clustering tools have different output file formats. Together with the constant expansion of available experimental datasets, this puts a strain on downstream frameworks that integrate the data and results. Hence, natural products discovery would benefit from up-to-date and user-friendly software packages that parse processed omics data and connect it with algorithms returning ranked, queryable gene cluster - mass spectra links to prioritize links to further investigate manually. Here, we redesigned NPLinker to provide such an integrative omics tool that guides both users and developers in paired omics mining with its modular setup. For example, recent developments in omics processing, annotation tools, and ranking metrics could be added to the framework [@louwen_enhanced_2023; @louwen_ipresto_2023]. Moreover, several of such linking scores could then be used together with the currently implemented strain correlation score to further improve ranking results.
71+
72+
![The NPLinker 2 framework. The current pipeline consists of five main components: 1. Initiating an analysis with an input block that includes configuration file and optional input data; 2. Preparing dataset by automatically downloading or generating data; 3. Loading and parsing data from data files; 4. Scoring and linking data; 5. Creating an output for analysis and visualization of results.\label{fig:1}](fig1.png)
73+
74+
75+
# Features of NPLinker 2
76+
NPLinker 2 is redesigned based on NPLinker version 1.x [@eldjarn_ranking_2021] to provide a more flexible, modular and extensible framework for linking BGCs to mass spectra. The pipeline is shown in the \autoref{fig:1}, and the key features are highlighted below.
77+
78+
## Installation ease
79+
NPLinker 2 is distributed as a Python package, but it relies on several third-party tools and databases that are not available via PyPi, which can make the installation more complex. To simplify the setup process, NPLinker 2 includes an installation script that automatically installs the required non-PyPi dependencies and databases.
80+
81+
To install NPLinker 2 and its dependencies, users can run the following commands:
82+
83+
```bash
84+
# Install the NPLinker package
85+
pip install --pre nplinker
86+
87+
# Install non-PyPi dependencies and required databases
88+
install-nplinker-deps
89+
```
90+
91+
## Configurable
92+
NPLinker 2 can be easily configured using the file `nplinker.yaml`, which is required to customise the pipeline according to users' needs, e.g., by selecting the run mode, choosing scoring methods. A [friendly template](https://nplinker.github.io/nplinker/latest/concepts/config_file/) is available on the doc website to help users create and fill in their configuration file from scratch.
93+
94+
## Local mode and PODP mode
95+
NPLinker 2 supports both local and remote data sources through two operational modes: **local mode** and **PODP mode**. In local mode, users provide their local data files, e.g., AntiSMASH [@blin_antismash_2025] output files and mass spectral data [@wang_sharing_2016] supported by matchms [@huber_matchms_2020], as input. In contrast, the Paired Omics Data Platform (PODP) [@schorn_community_2021] mode requires no input data files from the user, as the pipeline automatically downloads necessary data files from the [PODP (Paired Omics Data Platform) server](https://pairedomicsdata.bioinformatics.nl/) using the PODP ID specified in the `nplinker.yaml` file. This dual-mode support enables private and public data analysis using the same pipeline.
96+
97+
## Modular and extensible
98+
Modularity and extensibility are key features of NPLinker 2, which provides a set of interfaces and data models that users can extend.
99+
100+
**"Prepare Data" component:** The core class of this component, `DatasetArranger`, orchestrates various downloaders and runners to automatically download and generate the required data files. These files are then stored in the local working directory specified in the `nplinker.yaml` configuration file. Users can extend this component by adding new downloaders or runners, e.g., to download BGC data from a new source or generate data files using a different method or tool.
101+
102+
**"Load Data" component:** The `DatasetLoader` class manages data loaders responsible for loading and parsing genomics, metabolomics, and strain data files. Users can add new data loaders to support additional sources or formats. For example, to load BGC data from a new source, one can define a Python class `NewBGCLoader` that inherits from the `BGCLoaderBase` interface and implements the `get_files` and `get_bgcs` methods, then register it within the `DatasetLoader` class.
103+
104+
**"Scoring" component:** This component handles the linking of data and the scoring of those links. A undirected graph is used to store the linked data, with nodes corresponding to genomics or metabolomics data items and edges representing the links between them with scoring values, as illustrated in \autoref{fig:2}. The `ScoringBase` interface is provided to allow the implementation of custom scoring methods.
105+
106+
107+
![Graph representation of linkings. \label{fig:2}](fig2.png){ width=40%}
108+
109+
## New documentation website
110+
A dedicated documentation website is available to help users and developers understand how to use and extend NPLinker 2. It includes tutorials, conceptual overviews, pipeline diagrams, and an API reference. The documentation is available at [https://nplinker.github.io/nplinker](https://nplinker.github.io/nplinker).
111+
112+
## New unit tests and integration tests
113+
Unit and integration tests are included in NPLinker 2 to ensure the codebase and the overall pipeline is correct. The tests can be run in parallel to speed up the testing process.
114+
115+
## Forced static typing
116+
NPLinker 2 is developed with forced static typing, which means that all functions and methods have type hints to specify the input and output types. This helps developers to understand the code better and catch type errors early when dealing with complex genomic and metabolomic data and the processed and annotated data derived thereof.
117+
118+
## User-friendly webapp
119+
120+
The [NPLinker web application](https://github.com/NPLinker/nplinker-webapp) (webapp) is an interactive dashboard built with [Plotly Dash](https://plotly.com/), designed to make NPLinker’s linking results accessible through a user-friendly web interface. A [publicly hosted demo](https://nplinker-webapp.onrender.com/) allows immediate testing: users can load a sample dataset with a single click and start exploring. To enable full functionality with larger datasets, the webapp can be installed locally or run via Docker (using the [nplinker-webapp image](https://github.com/nplinker/nplinker-webapp/pkgs/container/nplinker-webapp)). Notably, the webapp focuses on visualization and post-analysis; link scoring between genomic and metabolomic entries is performed beforehand by the NPLinker backend.
121+
122+
The linking is currently provided starting from both omics views: from Gene Cluster Families (GCFs) clustered by BiG-SCAPE [@navarro-munoz_computational_2020] to mass spectra or molecular families (MFs) clustered by molecular networking [@wang_sharing_2016] or vice vera. Once the data is loaded, the interface provides two complementary views: genomics-to-metabolomics and metabolomics-to-genomics. This dual-tab layout allows users to begin from either data type and inspect associated links in the other domain. Each view presents the input data and predicted links in sortable, filterable tables, with support for multiple filtering criteria (e.g., GCF, MF, spectrum IDs, BGC classes, score thresholds). This enables rapid prioritization of promising BGC–metabolite links. Results can also be exported as Excel files for downstream analysis and record-keeping, allowing smooth integration into existing workflows.
123+
124+
# Acknowledgements
125+
126+
This work was supported by the Netherlands eScience Center under grant number NLESC.OEC.2021.002 (M.H. Medema and J.J.J. van der Hooft). A. Lien, L. R. Torres Ortega, D. Ferreira, K.R. Duncan, M.H. Medema and J. J. J. van der Hooft acknowledge the MAGic-MOLFUN Doctoral Training Network that has received funding from the European Union's Horizon Europe programme under the Marie Skłodowska-Curie grant agreement No. 101072485 (M.H. Medema and J.J.J. van der Hooft) and and UKRI EP/X03142X/2 (K.R. Duncan).
127+
128+
# Conflict of Interest
129+
JJJvdH is member of the Scientific Advisory Board of NAICONS Srl., Milano, Italy and consults for Corteva Agriscience, Indianapolis, IN, USA. MHM is a member of the scientific advisory boards of Hexagon Bio and Hothouse Therapeutics Ltd. All other authors declare to have no competing interests.
130+
131+
# Reference

joss/paper.pdf

359 KB
Binary file not shown.

0 commit comments

Comments
 (0)