Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
5bbac56
aist new license
trieuhl Mar 21, 2022
230cf36
setup environment via conda
trieuhl Mar 21, 2022
8bbf011
update requirements
trieuhl Mar 21, 2022
e1a812d
torch version
trieuhl Mar 21, 2022
1428e99
prepare for training cg
trieuhl Mar 21, 2022
e5202ca
event structures
trieuhl Mar 21, 2022
c889d8a
generate training configs
trieuhl Mar 21, 2022
9fbf3a2
configs for debug mode
trieuhl Mar 21, 2022
a83ddaa
fix bug
trieuhl Mar 21, 2022
a1950b1
fix bug
trieuhl Mar 21, 2022
d9c7bcb
fix bug
trieuhl Mar 21, 2022
f5ca6f6
fix bugs
trieuhl Mar 21, 2022
63e73b6
cg debug data
trieuhl Mar 21, 2022
dc8c535
training model
trieuhl Mar 21, 2022
b3f7e61
training
trieuhl Mar 21, 2022
b566412
readme
trieuhl Mar 21, 2022
b2734fb
training scripts
trieuhl Mar 21, 2022
7458407
fix bug
trieuhl Mar 21, 2022
fb2a898
script name
trieuhl Mar 21, 2022
c12486d
missing script
trieuhl Mar 21, 2022
59808e3
epochs for debug mode
trieuhl Mar 21, 2022
025690f
scripts path
trieuhl Mar 21, 2022
d2197b0
predict
trieuhl Mar 21, 2022
cc7eaa4
prediction script
trieuhl Mar 21, 2022
ff8f6db
readme
trieuhl Mar 21, 2022
d18e2d7
readme
trieuhl Mar 21, 2022
22fac68
prediction configs
trieuhl Mar 21, 2022
0103c80
bionlp prediction configs
trieuhl Mar 21, 2022
ba13b99
prediction path
trieuhl Mar 21, 2022
3f98bcf
prediction path
trieuhl Mar 21, 2022
7b9116e
prediction configs
trieuhl Mar 21, 2022
3c63c93
fix bug
trieuhl Mar 21, 2022
0af84d6
fix bug
trieuhl Mar 21, 2022
fed491a
fix bug
trieuhl Mar 21, 2022
0d453c4
fix path
trieuhl Mar 21, 2022
1fa3839
output path
trieuhl Mar 21, 2022
5495e83
update config path
trieuhl Mar 21, 2022
1c41fda
load saved parameters
trieuhl Mar 21, 2022
a21989c
savaed params config
trieuhl Mar 21, 2022
19ceca7
prediction scripts
trieuhl Mar 21, 2022
d66972e
fix bug
trieuhl Mar 21, 2022
7a87507
python path
trieuhl Mar 21, 2022
babfeb3
bionlp prediction
trieuhl Mar 21, 2022
c44d387
fix path
trieuhl Mar 21, 2022
246a105
setup
trieuhl Mar 21, 2022
5eec912
sklearn version
trieuhl Mar 21, 2022
7018136
sklearn
trieuhl Mar 21, 2022
c803f2c
fix bug
trieuhl Mar 21, 2022
12205e3
fix bug
trieuhl Mar 21, 2022
d740223
process input
trieuhl Mar 21, 2022
9666202
fix bug
trieuhl Mar 21, 2022
e3b6763
fix bug
trieuhl Mar 21, 2022
a4c5dea
nested events in prediction
trieuhl Mar 21, 2022
f879ca7
modality in prediction
trieuhl Mar 21, 2022
2eecdb1
fix bug
trieuhl Mar 21, 2022
98f5fe8
write annotations
trieuhl Mar 21, 2022
4c913d0
write output
trieuhl Mar 21, 2022
1ff2145
fix bug
trieuhl Mar 21, 2022
aad6562
write output for prediction
trieuhl Mar 21, 2022
3c2eca0
fix bug
trieuhl Mar 21, 2022
653de58
write output
trieuhl Mar 21, 2022
6ba0689
fix bug
trieuhl Mar 21, 2022
1150ab2
fix bug
trieuhl Mar 21, 2022
46a0f93
event prediction
trieuhl Mar 21, 2022
f0c2d08
data path
trieuhl Mar 21, 2022
94abab3
install pubmed requirements
trieuhl Mar 21, 2022
c468db2
pubmed configs
trieuhl Mar 21, 2022
2a4c359
predict on pubmed for raw-text
trieuhl Mar 21, 2022
bdd0c9b
raw text config
trieuhl Mar 21, 2022
be6f613
brat path
trieuhl Mar 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2019 National Institute of Advanced Industrial Science and Technology (AIST)
Copyright National Institute of Advanced Industrial Science and Technology (AIST), AIST-Product-ID: 2022PRO-2776

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
203 changes: 159 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,110 +1,200 @@
# 1. DeepEventMine
# DeepEventMine

A deep leanring model to predict named entities, triggers, and nested events from biomedical texts.

- The model and results are reported in our paper:

[DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts](https://doi.org/10.1093/bioinformatics/btaa540), Bioinformatics, 2020.
[DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts](https://doi.org/10.1093/bioinformatics/btaa540)
, Bioinformatics, 2020.

## Overview

1. Features

## 1.1. Features
- Based on [pre-trained BERT](https://github.com/allenai/scibert)
- Predict nested entities and nested events
- Provide our trained models on the seven biomedical tasks
- End-to-end event extraction, fine-tuned on [pre-trained BERT](https://github.com/allenai/scibert)
- Train and predict nested entities and nested events
- Provide our pre-trained models on seven biomedical tasks
- Reproduce the results reported in our [Bioinformatics](https://doi.org/10.1093/bioinformatics/btaa540) paper
- Predict for new data given raw text input or PubMed ID
- Visualize the predicted entities and events on the [brat](http://brat.nlplab.org)

## 1.2. Tasks
2. Tasks

- DeepEventMine has been trained and evaluated on the following tasks (six BioNLP shared tasks and MLEE).

1. cg: [Cancer Genetics (CG), 2013](http://2013.bionlp-st.org/tasks/cancer-genetics)
2. ge11: [GENIA Event Extraction (GENIA), 2011](http://2011.bionlp-st.org/home/genia-event-extraction-genia)
3. ge13: [GENIA Event Extraction (GENIA), 2013](http://bionlp.dbcls.jp/projects/bionlp-st-ge-2013/wiki/Overview)
4. id: [Infectious Diseases (ID), 2011](http://2011.bionlp-st.org/home/infectious-diseases)
5. epi: [Epigenetics and Post-translational Modifications (EPI), 2011](http://2011.bionlp-st.org/home/epigenetics-and-post-translational-modifications)
5.
epi: [Epigenetics and Post-translational Modifications (EPI), 2011](http://2011.bionlp-st.org/home/epigenetics-and-post-translational-modifications)
6. pc: [Pathway Curation (PC), 2013](http://2013.bionlp-st.org/tasks/pathway-curation)
7. mlee: [Multi-Level Event Extraction (MLEE)](http://nactem.ac.uk/MLEE/)

## 1.3. Our trained models and scores
# 1. Preparation

- [Our trained models](https://b2share.eudat.eu/records/80d2de0c57d64419b722dc1afa375f28)
- [Our scores](https://b2share.eudat.eu/api/files/3cf6c1f4-5eed-4ee3-99c5-d99f5f011be3/scores.tar.gz)
1. Install conda environment

```bash
sh setup/conda-install.sh
```

2. Create a conda environment (for the first time)

```bash
. setup/conda-create.sh
```

3. Activate the conda environment

- Every time you run: before installing packages, before running evaluation scripts, etc

```bash
. setup/conda-activate.sh
```

4. Install requirements

# 2. Preparation
## 2.1. Requirements
- Python 3.6.5
- PyTorch (torch==1.1.0 torchvision==0.3.0, cuda92)
- Python dependencies

```bash
virtualenv -p python3 pytorch-env
source pytorch-env/bin/activate
export CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/usr/local/cuda pip install torch==1.1.0 torchvision==0.3.0
pip install -r setup/requirements.txt
```

- Install Python packages
5. [Brat](https://github.com/nlplab/brat) for visualization

- brat instructions](http://brat.nlplab.org/installation.html)

```bash
sh install.sh
sh setup/install-brat.sh
python2 standalone.py
```

## 2.2. BERT
- Download SciBERT BERT model from PyTorch AllenNLP
# 2. Training CG

1. Download data and process

- Download data
- Process data to appropriate format
- Tokenize texts and retrieve offsets
- Data statistics
- Download the processed event structures
- The [original BioNLP 2013](http://2013.bionlp-st.org/tasks/cancer-genetics) (for downloading CG data) seems unavailable recently. We found an alternative link
for [CG13 task](https://sites.google.com/site/bionlpst2013/tasks/cancer-genetics-cg-task). You may download the data
by yourself. We are not sure the data is the same as the original link, so please check by yourself or contact the workshop's organizers.

```bash
sh download.sh bert
sh run/prepare-cg.sh
```

## 2.3. DeepEventMine
- Download pre-trained DeepEventMine model on a given task
- [task] = cg (or pc, ge11, epi, etc)
2. Download models

- Download SciBERT model from PyTorch AllenNLP

```bash
sh download.sh deepeventmine [task]
sh run/download-bert.sh
```

## 2.4 Brat
- To visualize the output using the [brat](http://brat.nlplab.org)
- Download [brat v1.3](http://brat.nlplab.org)
3. Generate configs

- Configs for training CG task

```bash
sh download.sh brat
sh run/generate_configs.sh cg basic
```

- Install brat based on the [brat instructions](http://brat.nlplab.org/installation.html)
- Experiment name: basic, exp1, exp2, etc
- Or running this debug mode (on a small data with several epochs)

```bash
cd brat/brat-v1.3_Crunchy_Frog/
./install.sh -u
python2 standalone.py
sh run/generate_configs-debug.sh cg debug
```

4. Training

- Pretrain layers (these need to be done before training the joint model)
- Replace "basic" by "debug" to quickly try experiments on the small data (debug mode)

```bash
sh run/train.sh experiments/cg/basic/configs/train-ner.yaml
sh run/train.sh experiments/cg/basic/configs/train-rel.yaml
sh run/train.sh experiments/cg/basic/configs/train-ev.yaml
```

- Train joint model: given gold entity

```bash
sh run/train.sh experiments/cg/basic/configs/train-joint-gold.yaml
```

- Train joint end-to-end model

```bash
sh run/train.sh experiments/cg/basic/configs/train-joint-e2e.yaml
```

5. Predict

- Given gold entity

```bash
sh run/predict.sh experiments/cg/basic/configs/predict-gold-dev.yaml
sh run/predict.sh experiments/cg/basic/configs/predict-gold-test.yaml
```

- End-to-end

```bash
sh run/predict.sh experiments/cg/basic/configs/predict-e2e-dev.yaml
sh run/predict.sh experiments/cg/basic/configs/predict-e2e-test.yaml
```

# 3. Predict (BioNLP tasks)

## 3.1. Prepare data

1. Download corpora

- To download the original data sets from BioNLP shared tasks.
- [task] = cg, pc, ge11, etc

```bash
sh download.sh bionlp [task]
```

2. Preprocess data
2. Download our pre-trained DeepEventMine model on a given task

- [Our trained models](https://b2share.eudat.eu/records/80d2de0c57d64419b722dc1afa375f28)
- [Our scores](https://b2share.eudat.eu/api/files/3cf6c1f4-5eed-4ee3-99c5-d99f5f011be3/scores.tar.gz)
- [task] = cg (or pc, ge11, epi, etc)

```bash
sh download.sh deepeventmine [task]
```

3. Preprocess data

- Tokenize texts and prepare data for prediction

```bash
sh preprocess.sh bionlp
```

3. Generate configs
4. Generate configs

- If using GPU: [gpu] = 0, otherwise: [gpu] = -1
- [task] = cg, pc, etc

```bash
sh run.sh config [task] [gpu]
```

## 3.2. Predict

1. For development and test sets (given gold entities)

- CG task: [task] = cg
- PC task: [task] = pc
- Similarly for: ge11, ge13, epi, id, mlee
Expand All @@ -113,7 +203,9 @@ sh run.sh config [task] [gpu]
sh run.sh predict [task] gold dev
sh run.sh predict [task] gold test
```

- Check the output in the path

```bash
experiments/[task]/predict-gold-dev/
experiments/[task]/predict-gold-test/
Expand All @@ -122,6 +214,7 @@ experiments/[task]/predict-gold-test/
## 3.3. Evaluate

1. Retrieve the original offsets and create zip format

```bash
sh run.sh offset [task] gold dev
sh run.sh offset [task] gold test
Expand All @@ -130,10 +223,14 @@ sh run.sh offset [task] gold test
2. Submit the zipped file to the shared task evaluation sites:

- [CG Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST-2013/CG/submission/)
- [GE11 Test](http://bionlp-st.dbcls.jp/GE/2011/eval-test/), [GE11 Devel](http://bionlp-st.dbcls.jp/GE/2011/eval-development/)
- [GE13 Test](http://bionlp-st.dbcls.jp/GE/2013/eval-test/), [GE13 Devel](http://bionlp-st.dbcls.jp/GE/2013/eval-development/)
- [ID Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/ID/test-eval.html), [ID Devel](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/ID/devel-eval.htm)
- [EPI Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/EPI/test-eval.html), [EPI Devel](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/EPI/devel-eval.htm)
- [GE11 Test](http://bionlp-st.dbcls.jp/GE/2011/eval-test/)
, [GE11 Devel](http://bionlp-st.dbcls.jp/GE/2011/eval-development/)
- [GE13 Test](http://bionlp-st.dbcls.jp/GE/2013/eval-test/)
, [GE13 Devel](http://bionlp-st.dbcls.jp/GE/2013/eval-development/)
- [ID Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/ID/test-eval.html)
, [ID Devel](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/ID/devel-eval.htm)
- [EPI Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/EPI/test-eval.html)
, [EPI Devel](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST/EPI/devel-eval.htm)
- [PC Test](http://weaver.nlplab.org/~bionlp-st/BioNLP-ST-2013/PC/submission/)

3. Evaluate events
Expand All @@ -148,18 +245,23 @@ sh run.sh eval [task] gold dev sp
# 4. End-to-end

## 4.1. Input: a single PMID or PMCID

- Abstract

```bash
sh pubmed.sh e2e pmid 1370299 cg 0
```

- Full text

```bash
sh pubmed.sh e2e pmcid PMC4353630 cg 0
```

- Input: [PMID: 1370299](https://pubmed.ncbi.nlm.nih.gov/1370299/), [PMCID: PMC4353630](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4353630/) (a single PubMed ID to get raw text)
- Model to predict: DeepEventMine trained on [cg (Cancer Genetics 2013)](http://2013.bionlp-st.org/tasks/cancer-genetics), (other options: pc, ge11, etc)
- Input: [PMID: 1370299](https://pubmed.ncbi.nlm.nih.gov/1370299/)
, [PMCID: PMC4353630](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4353630/) (a single PubMed ID to get raw text)
- Model to predict: DeepEventMine trained
on [cg (Cancer Genetics 2013)](http://2013.bionlp-st.org/tasks/cancer-genetics), (other options: pc, ge11, etc)
- GPU: 0 (if CPU: -1)
- Output: in brat format and [brat visualization](http://brat.nlplab.org)

Expand Down Expand Up @@ -193,6 +295,7 @@ E24 Positive_regulation:T61 Theme:E10

- Given an arbitrary name for your raw text data, for example "my-pubmed"
- Prepare a list of PMID and PMCID in the path

```bash
data/my-pubmed/pmid.txt
```
Expand All @@ -205,6 +308,7 @@ sh pubmed.sh e2e pmids my-pubmed cg 0

- Given an arbitrary name for your raw text data, for example "my-pubmed"
- Prepare your raw text files in the path

```bash
data/my-pubmed/text/PMID-*.txt
data/my-pubmed/text/PMC-*.txt
Expand Down Expand Up @@ -236,6 +340,7 @@ data/my-pubmed/text/PMC-*.txt
### Get raw text

1. PubMed ID list

- In order to get full text given PMC ID, the text should be available in ePub (for our current version).
- Prepare your list of PubMed ID and PMC ID in the path

Expand All @@ -244,12 +349,15 @@ data/my-pubmed/pmid.txt
```

- Get text from the PubMed ID

```bash
sh pubmed.sh pmids my-pubmed
```

2. PubMed ID

- You can also get text by directly input a PubMed or PMC ID

```bash
sh pubmed.sh pmid 1370299
sh pubmed.sh pmcid PMC4353630
Expand All @@ -264,6 +372,7 @@ sh pubmed.sh preprocess my-pubmed
## 5.3. Predict

1. Generate config

- Generate config for prediction
- The data name to predict: my-pubmed
- The trained model used for predict: cg (or pc, ge11, etc)
Expand All @@ -286,6 +395,7 @@ sh pubmed.sh offset my-pubmed
```

- Check the output in

```bash
experiments/my-pubmed/results/ev-last/my-pubmed-brat
```
Expand All @@ -296,11 +406,13 @@ experiments/my-pubmed/results/ev-last/my-pubmed-brat

- Copy the predicted data into the brat folder to visualize
- For the raw text prediction:

```bash
sh pubmed.sh brat my-pubmed cg
```

- Or for the shared task

```bash
sh run.sh brat [task] gold dev
sh run.sh brat [task] gold test
Expand All @@ -316,10 +428,13 @@ brat/brat-v1.3_Crunchy_Frog/data/[task]-brat
```

# 7. Acknowledgements
This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion PrograM).

This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology
Development Organization (NEDO). This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion
PrograM).

# 8. Citation

```bash
@article{10.1093/bioinformatics/btaa540,
author = {Trieu, Hai-Long and Tran, Thy Thy and Duong, Khoa N A and Nguyen, Anh and Miwa, Makoto and Ananiadou, Sophia},
Expand Down
Loading