Skip to content

Add XLSum evaluation / unify eval script #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 166 commits into
base: sentence_retrieval_eval
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
166 commits
Select commit Hold shift + click to select a range
4a8cd34
Merge pull request #8 from bigscience-workshop/sentence_retrieval_eval
vnikouliNLE Apr 3, 2022
347d22f
added script to train tokenizer only on a subset of the dataset
vnikouliNLE Apr 22, 2022
7069276
added script to train tokenizer only on a subset of the dataset
vnikouliNLE Apr 22, 2022
ad6d511
updated instructions for samples tokenizer
vnikouliNLE Apr 22, 2022
4e1f137
updated training script: added some extra parameters in the running s…
vnikouliNLE Apr 22, 2022
7ff1c18
added overlap-replace parameter, added possibility to save embedding …
vnikouliNLE Apr 22, 2022
888f49f
Merge branch 'sentence_retrieval_eval' of https://github.com/bigscien…
vnikouliNLE Apr 22, 2022
0967f07
Merge remote-tracking branch 'remotes/origin/sentence_retrieval_eval'…
vnikouliNLE May 5, 2022
afb108d
update madx_run_clm
yongzx May 6, 2022
a79bfd0
adapted xnli script to properly load wte, wpe and adapters
vnikouliNLE May 6, 2022
c9a8cec
updated the way we save the model; added fp16 training
vnikouliNLE May 6, 2022
3e8bd62
add xlsum script (version #1)
May 9, 2022
2cd27a3
add unified eval script
May 9, 2022
fac77dc
xlsum separate script
May 9, 2022
96339f4
script bugfixes
May 9, 2022
043ece7
change zero_shot to cross_lingual
yongzx May 11, 2022
6ac743a
load language adapters during inference setting
yongzx May 11, 2022
d4b0e30
updated tokenizer training script
vnikouliNLE May 11, 2022
f3a165e
added xnli zero shot training and eval scripts
vnikouliNLE May 11, 2022
5ed40b9
added xnli zero shot training and eval scripts
vnikouliNLE May 11, 2022
f69d167
Merge branch 'ext_exp' of https://github.com/bigscience-workshop/mult…
vnikouliNLE May 11, 2022
2497ba3
merged with current version
vnikouliNLE May 11, 2022
f35b984
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
dbf3f0e
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
639c4da
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
685f402
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
04f9fab
fixed tokenizer training with unk token
vnikouliNLE May 11, 2022
04f893f
add num_classes arg to model init
May 12, 2022
448035e
rename pretrained_model to adapted_model
yongzx May 13, 2022
7a8f899
use updated eval_xnli/adapters_xnli_de_vn.py
yongzx May 13, 2022
ac86e1c
update XNLI
yongzx May 13, 2022
f25c8b8
Merge pull request #11 from bigscience-workshop/ext_exp
yongzx May 13, 2022
b31f805
add seq2seq training and fix compute_metrics
May 13, 2022
f6f6bd5
Merge branch 'bigscience-workshop:master' into eval_xlsum
haileyschoelkopf May 27, 2022
1dbf727
merge
yongzx May 30, 2022
3b59222
exp-001: finetune gpt-2 model with new tokenizer on fr
Sep 13, 2021
3335209
add run_clm.py
yongzx Sep 13, 2021
b58c999
update run_clm and run_clm_no_tok
yongzx Sep 15, 2021
2dd82ad
reduce per_device_{train, eval}_batch_size and increase {gradient, ev…
yongzx Sep 16, 2021
27d5675
update exp-001 exp-002
yongzx Oct 6, 2021
5e81cbe
update README
yongzx Oct 6, 2021
19ac169
Update README.md
yongzx Oct 13, 2021
c3620d2
Update README.md
yongzx Oct 13, 2021
049960c
update exp-002 and exp-004
yongzx Oct 15, 2021
d611b2e
requirements.txt
yongzx Oct 18, 2021
5fe1e58
update
yongzx Oct 27, 2021
d5beb9c
remove experiments folder
yongzx Oct 27, 2021
f751bfb
korean exp-001
yongzx Oct 27, 2021
ef390fc
update
yongzx Oct 27, 2021
2809dd6
upload scripts
yongzx Dec 21, 2021
4975573
remove exp-001
yongzx Dec 21, 2021
f003286
remove README
yongzx Dec 21, 2021
aa5f308
added scripts to evaluate each layer of pretrained LM (encoder only) …
vnikouliNLE Feb 9, 2022
75bf06e
fix due to diff tokenization for ted dataset
vnikouliNLE Feb 11, 2022
6b200d5
added scripts for training model with adapters and embedding layer FT
vnikouliNLE Feb 23, 2022
481ee2c
eval_xnli_de
yongzx Mar 5, 2022
fbb3cf3
update
yongzx Mar 29, 2022
2574ac9
update
yongzx Mar 29, 2022
895cd76
update
yongzx Mar 29, 2022
2bef084
update README
yongzx Mar 29, 2022
6c5c05a
update
yongzx Mar 29, 2022
1c701c5
update xnli evaluation
yongzx Apr 4, 2022
76815fa
added script to train tokenizer only on a subset of the dataset
vnikouliNLE Apr 22, 2022
5012f7f
added script to train tokenizer only on a subset of the dataset
vnikouliNLE Apr 22, 2022
55f62fd
updated instructions for samples tokenizer
vnikouliNLE Apr 22, 2022
4ce7678
updated training script: added some extra parameters in the running s…
vnikouliNLE Apr 22, 2022
5cadd43
added overlap-replace parameter, added possibility to save embedding …
vnikouliNLE Apr 22, 2022
e619e73
update eval
yongzx May 2, 2022
7d906f6
update eval
yongzx May 2, 2022
067a2a7
clean
yongzx May 2, 2022
7659d05
update madx_run_clm
yongzx May 6, 2022
a47e65b
adapted xnli script to properly load wte, wpe and adapters
vnikouliNLE May 6, 2022
e9ca92f
updated the way we save the model; added fp16 training
vnikouliNLE May 6, 2022
91f739a
change zero_shot to cross_lingual
yongzx May 11, 2022
aa5256e
load language adapters during inference setting
yongzx May 11, 2022
630f2f6
updated tokenizer training script
vnikouliNLE May 11, 2022
c744531
added xnli zero shot training and eval scripts
vnikouliNLE May 11, 2022
045c32d
added xnli zero shot training and eval scripts
vnikouliNLE May 11, 2022
3496bec
merged with current version
vnikouliNLE May 11, 2022
cc4b113
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
4a2d031
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
4dd8cea
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
286706a
added script to get stats about different tokenizers
vnikouliNLE May 11, 2022
cf376c6
fixed tokenizer training with unk token
vnikouliNLE May 11, 2022
089071d
rename pretrained_model to adapted_model
yongzx May 13, 2022
d969c5c
use updated eval_xnli/adapters_xnli_de_vn.py
yongzx May 13, 2022
a13f082
update XNLI
yongzx May 13, 2022
c96f5fd
Merge branch 'sentence_retrieval_eval' of https://github.com/bigscien…
yongzx May 30, 2022
ab01bce
update
yongzx May 31, 2022
f98bb12
update
yongzx Jun 1, 2022
fe81c52
Merge commit 'refs/pull/12/head' of https://github.com/bigscience-wor…
yongzx Jun 1, 2022
614ca0d
update tokenizer for BLOOM
yongzx Jun 2, 2022
1fdd535
language adaptation BLOOM
yongzx Jun 2, 2022
ff9c0a2
update
yongzx Jun 2, 2022
4c55c95
updated scripts for sentence retrieval eval
vnikouliNLE Jun 9, 2022
ef0b4ac
added prefix tuning option
vnikouliNLE Jun 13, 2022
dbea748
backcompatibility with original BS model
vnikouliNLE Jun 13, 2022
8f4815a
Debugging XNLI
yongzx Jun 15, 2022
17aae41
seems like embedding+adapter wasn't working as expected in the previo…
vnikouliNLE Jun 16, 2022
eb8c73f
Merge pull request #13 from bigscience-workshop/backcompatibility
yongzx Jun 20, 2022
86b3046
add baseline arguments to eval
yongzx Jun 21, 2022
5c9501b
xnli adappters
yongzx Jun 21, 2022
8a8b9ca
remove legacy code adapter_lang_name
yongzx Jun 21, 2022
7eca399
add wikiann
yongzx Jun 22, 2022
e232c7b
Merge pull request #15 from bigscience-workshop/eval_wikiann
yongzx Jun 22, 2022
088eb5a
scripts for wikiann
yongzx Jun 22, 2022
c775e96
Merge pull request #16 from bigscience-workshop/eval_wikiann
yongzx Jun 22, 2022
790c561
add do_train
yongzx Jun 22, 2022
1f7119b
add comment about tokenizer training
yongzx Jun 23, 2022
2be6fb5
use register hook to update subelements
yongzx Jun 23, 2022
5412d2e
don't use mask to zero out grad
Jun 23, 2022
2d9d1de
Merge pull request #20 from haileyschoelkopf/extend-vocab-hailey
yongzx Jun 24, 2022
1bbc8d5
added argument for enabling finetuning mode
Jun 25, 2022
7042e62
added logic to freeze all param except bias if bitfit is set to true
Jun 25, 2022
2ab63ff
switch order
Jun 25, 2022
a7f7df5
added script
Jun 26, 2022
729bd67
minor changes in script
Jun 26, 2022
a63ccb6
hardcoded DATA_SAMPLES
Jun 26, 2022
2fe1d40
changed lline from freezing base_model to transformer
Jun 26, 2022
41f0f02
overlap-replace support for BLOOM
yongzx Jun 26, 2022
b8e1985
Merge branch 'extend-vocab'
yongzx Jun 26, 2022
da5bdd3
added script to calculate bias changes, wip
Jun 27, 2022
88ee6cb
Merge branch 'master' into bitfit
Jun 27, 2022
84d0880
support for WikiANN
yongzx Jun 28, 2022
de9ae89
replace seed magic number with args.seed
yongzx Jun 28, 2022
9524065
fix unintended bugs arising from cached tokenized data
yongzx Jun 28, 2022
c2722a7
add last-layer finetuning for tasks
yongzx Jun 28, 2022
9224957
Merge pull request #32 from bigscience-workshop/eval_last_layer
yongzx Jun 28, 2022
c9b7773
remove assert False
yongzx Jun 28, 2022
662700f
Merge pull request #33 from bigscience-workshop/eval_last_layer
yongzx Jun 28, 2022
e1079c1
supprt BERT training
yongzx Jun 28, 2022
7c2b034
moved bitfit scripts
Jun 28, 2022
625a43b
support LoRA
yongzx Jun 29, 2022
6d7a59b
config changes
lintangsutawika Jun 29, 2022
1fb6504
refactor repo
yongzx Jun 29, 2022
8c811bb
WIP
yongzx Jun 30, 2022
a8486d4
update eval/scripts_* directory
yongzx Jul 1, 2022
127adf1
Update README.md
yongzx Jul 1, 2022
f3223e4
Update README.md
yongzx Jul 1, 2022
153f7da
Delete calculate_bias_changes.py
Jul 1, 2022
7c48c20
removed finetune_strategies in favor of lang_adapt_strategis
Jul 1, 2022
820cbc5
Merge branch 'bitfit' of https://github.com/bigscience-workshop/multi…
Jul 1, 2022
cbe45bd
change
Jul 2, 2022
1c49adf
Update README.md
Jul 2, 2022
e955acf
Update README.md
Jul 2, 2022
2484b22
fixed logic
Jul 2, 2022
81ace49
jz
yongzx Jul 5, 2022
0be97d4
Merge branch 'master' of https://github.com/bigscience-workshop/multi…
yongzx Jul 5, 2022
3edfc06
Merge pull request #41 from bigscience-workshop/jz
yongzx Jul 5, 2022
d3feb31
update README
yongzx Jul 5, 2022
2c89ef5
Merge branch 'master' of https://github.com/bigscience-workshop/multi…
yongzx Jul 5, 2022
1731027
update
yongzx Jul 5, 2022
4c9d075
Merge pull request #22 from bigscience-workshop/bitfit
yongzx Jul 5, 2022
3fd568a
uncomment pip install
yongzx Jul 5, 2022
d1faa64
Merge branch 'master' of https://github.com/bigscience-workshop/multi…
yongzx Jul 5, 2022
2f806eb
update tokenizer training
yongzx Jul 5, 2022
2e94a9e
update bigs_model
yongzx Jul 5, 2022
d5209a7
update tokenizer_dir
yongzx Jul 5, 2022
5fac29a
load_best using eval metrics
yongzx Jul 5, 2022
7e0feca
load best model
yongzx Jul 5, 2022
d4a887e
missing output_dir
yongzx Jul 6, 2022
585cb5d
scripts for wikiann
yongzx Jul 7, 2022
9cca893
tok_strategy adds overlap-replace
yongzx Jul 7, 2022
031a14a
remove outdated code and support continual pretraining
yongzx Jul 7, 2022
b0a23c5
update xlsum
yongzx Jul 7, 2022
c7e1e6f
fix tokenization
yongzx Jul 7, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 0 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +0,0 @@
### Previous Experiments
- `exp-001`: train gpt-2's tokenizer and finetune gpt-2's embedding layers `wte` and `wpe` on HF's OSCAR `unshuffled_deduplicated_fr` and `unshuffled_dudplicated_kr`.
- `exp-002`: evaluate gpt-2 on FLUE's tasks (CLS, XNLI, PAWS)
- `exp-003`: TODO: evaluate on multiatis
- `exp-004`: Does the embedding layer learn anything useful? Take a dataset in English for PAWS-X, finetune GPT-2 on this dataset, evaluate it on English test set T_e. Then, take the same test-set T_e translated in French (T_f), take GPT-2 parameters fine-tuned for the task X, replace English embeddings with French embeddings and evaluate thus obtained model on French test set.

# Experiment folders below after Conversation with Vassilina, Hady, Iz, and Maruf [Link](https://huggingface.slack.com/archives/C020G6A9KHQ/p1637023149074800)
- `exp-005`: cleaned from `exp-001` for finetuning GPT-2 embedding layers for DE and KO on Oscar.
- `exp-006`: run zero-shot and finetuned evaluation setting for XNLI ✅, PAWS ❌, and XQuAD ❌. (❌ means not done. ✅ means done.)
- `exp-007`: apply MAD-X adapter method. [Paper link](https://arxiv.org/abs/2005.00052)
- `exp-008`: from exp-006, but using mBERT on the zero-shot and finetuning setting.


# Carbon Tracking
Do not forget to log your experiments [in this spreadsheet](https://docs.google.com/spreadsheets/d/1Mk8mYCOF_WxMv-Uv5ThkFs5Ak5B9s9EnRUh1CpykEJ0/edit#gid=0)

64 changes: 64 additions & 0 deletions jz/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Run on JZ

## Getting Started
Clone the GitHub Repository and `cd` into it to run commands like `sbatch jz/emb.sh my 100000 24000 extend`.

```
git clone https://github.com/bigscience-workshop/multilingual-modeling.git
cd multilingual-modeling/
```

## Change Configuration
### SLURM Configuration
We need to change the SLURM setting according to JZ to get the necessary compute.
```
# use a single V100 for each run
#SBATCH --partition=gpu-he --gres=gpu:1

# output/error files for tracking pip installation
#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.out
#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.err
```

### Directory configuration (Line 22 - 28 in jz/emb.sh)
Also, we need to change 6 lines of the directory configuration.
```
# virtual environment folder for `python3 -m venv $env_dir`
env_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/jz/env_jz_lang_adapter"

# cache directory for HuggingFace datasets
cache_dir="/users/zyong2/data/zyong2/huggingface"

# cloned GitHub directory
mm_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling"

# directory to save adapted models and trained tokenizers
output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/misc/"

# folder for storing error and output logging text files
logging_txt_dir="/users/zyong2/data/zyong2/bigscience/logs/misc"

# folder for storing all tensorboard logging
logging_tb_dir="/users/zyong2/data/zyong2/bigscience/reports/misc/"
```

## Runs
### 07/05/2022 (Language Adaptation - Embedding-only)
Run the following commands for doing language adaptation for 4 languages varying along the the size of training samples.
```
sbatch jz/emb.sh my 100000 24000 extend
sbatch jz/emb.sh my 10000 5000 extend
sbatch jz/emb.sh my 1000 5000 extend

sbatch jz/emb.sh si 100000 24000 extend
sbatch jz/emb.sh si 10000 5000 extend
sbatch jz/emb.sh si 1000 5000 extend

sbatch jz/emb.sh az 100000 24000 extend
sbatch jz/emb.sh az 10000 5000 extend
sbatch jz/emb.sh az 1000 5000 extend

sbatch jz/emb.sh de 100000 24000 extend
sbatch jz/emb.sh de 10000 5000 extend
sbatch jz/emb.sh de 1000 5000 extend
```
99 changes: 99 additions & 0 deletions jz/emb.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/bin/bash

# Request half an hour of runtime:
#SBATCH --time=2-23:59:00

# Ask for the GPU partition and 1 GPU
#SBATCH --partition=gpu-he --gres=gpu:1

# Default resources are 1 core with 2.8GB of memory.
#SBATCH --ntasks=8

# Use more memory (10GB) (CPU RAM):
#SBATCH --mem=200g

# Specify a job name:
#SBATCH -J lang-adapt-env_jz_lang_adapter

# Specify an output file
#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.out
#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.err

env_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/jz/env_jz_lang_adapter"
cache_dir="/users/zyong2/data/zyong2/huggingface"
mm_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling"

output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/misc/" # adapted model and trained tokenizer directory
logging_txt_dir="/users/zyong2/data/zyong2/bigscience/logs/misc" # error and output logging
logging_tb_dir="/users/zyong2/data/zyong2/bigscience/reports/misc/" # tensorboard logging

mkdir -p $output_dir
mkdir -p $logging_tb_dir
mkdir -p $logging_txt_dir

lang=$1 # language
sample_size=$2 # training sample size
vocab_size=$3 # vocab size of tokenizer
tok_strategy=$4 # extend, replace, overlap-replace
bigs_model="bigscience/bloom-1b3"
adpt_strategy="emb"

tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}"
logging_tb_dir="${logging_tb_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}"

# setup environment
module load python/3.7.4
[ -d $env_dir ] || python3 -m venv $env_dir
source "${env_dir}/bin/activate"
pip3 install --upgrade pip
pip3 install -r "${mm_dir}/requirements.txt"

# train tokenizer
python "${mm_dir}/scripts/lang_adapt/tokenized4clm_sampled.py" \
--lang $lang \
--model $bigs_model \
--tokenizer_dir $tokenizer_dir \
--hf_cache_dir $cache_dir \
--vocab_size $vocab_size \
--sample_size $sample_size \
--use_auth_token \
--tok_strategy $tok_strategy \
> "${logging_txt_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}.txt" \
2> "${logging_txt_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}.err"


# finetune language model for langauge adaptation
python "${mm_dir}/scripts/lang_adapt/madx_run_clm.py" \
--seed 0 \
--fp16 \
--model_name_or_path $bigs_model \
--tokenizer_name $tokenizer_dir \
--dataset_name oscar \
--cache_dir $cache_dir \
--dataset_config_name "unshuffled_deduplicated_${lang}" \
--logging_dir $logging_tb_dir \
--report_to "tensorboard" \
--learning_rate 0.001 \
--do_train \
--do_eval \
--output_dir $output_dir \
--preprocessing_num_workers 8 \
--overwrite_output_dir \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--per_device_eval_batch_size 2 \
--eval_accumulation_steps 4 \
--eval_steps 1000 \
--evaluation_strategy "steps" \
--max_eval_samples 5000 \
--save_steps 5000 \
--save_strategy "steps" \
--max_train_samples $sample_size \
--max_steps 50000 \
--logging_steps 1000 \
--lang_adapt_strategies $adpt_strategy \
--embedding_strategies $tok_strategy \
--load_best_model_at_end \
--use_auth_token \
> "${logging_txt_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}.txt" \
2> "${logging_txt_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}.err"
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
git+https://github.com/yongzx/adapter-transformers.git@f55ab013599088a35c87a880ba13a6d912e27ef4
--extra-index-url https://download.pytorch.org/whl/cu113
torch
datasets
tensorboardX
6 changes: 6 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
### README

This folder contains everything we need for running BigScience language adaptation experiments.

Google Doc: [BigScience - Extending BLOOM to New Languages](https://docs.google.com/document/d/1OEJq2max5kLPF4mnnb9nyoodqR_z_UVQlw4tVx9TvTc/edit#heading=h.kk1966kbedef)

Loading