diff --git a/README.md b/README.md index 11f32dd..e69de29 100644 --- a/README.md +++ b/README.md @@ -1,16 +0,0 @@ -### Previous Experiments -- `exp-001`: train gpt-2's tokenizer and finetune gpt-2's embedding layers `wte` and `wpe` on HF's OSCAR `unshuffled_deduplicated_fr` and `unshuffled_dudplicated_kr`. -- `exp-002`: evaluate gpt-2 on FLUE's tasks (CLS, XNLI, PAWS) -- `exp-003`: TODO: evaluate on multiatis -- `exp-004`: Does the embedding layer learn anything useful? Take a dataset in English for PAWS-X, finetune GPT-2 on this dataset, evaluate it on English test set T_e. Then, take the same test-set T_e translated in French (T_f), take GPT-2 parameters fine-tuned for the task X, replace English embeddings with French embeddings and evaluate thus obtained model on French test set. - -# Experiment folders below after Conversation with Vassilina, Hady, Iz, and Maruf [Link](https://huggingface.slack.com/archives/C020G6A9KHQ/p1637023149074800) -- `exp-005`: cleaned from `exp-001` for finetuning GPT-2 embedding layers for DE and KO on Oscar. -- `exp-006`: run zero-shot and finetuned evaluation setting for XNLI ✅, PAWS ❌, and XQuAD ❌. (❌ means not done. ✅ means done.) -- `exp-007`: apply MAD-X adapter method. [Paper link](https://arxiv.org/abs/2005.00052) -- `exp-008`: from exp-006, but using mBERT on the zero-shot and finetuning setting. - - -# Carbon Tracking -Do not forget to log your experiments [in this spreadsheet](https://docs.google.com/spreadsheets/d/1Mk8mYCOF_WxMv-Uv5ThkFs5Ak5B9s9EnRUh1CpykEJ0/edit#gid=0) - diff --git a/jz/README.md b/jz/README.md new file mode 100644 index 0000000..684755c --- /dev/null +++ b/jz/README.md @@ -0,0 +1,64 @@ +# Run on JZ + +## Getting Started +Clone the GitHub Repository and `cd` into it to run commands like `sbatch jz/emb.sh my 100000 24000 extend`. + +``` +git clone https://github.com/bigscience-workshop/multilingual-modeling.git +cd multilingual-modeling/ +``` + +## Change Configuration +### SLURM Configuration +We need to change the SLURM setting according to JZ to get the necessary compute. +``` +# use a single V100 for each run +#SBATCH --partition=gpu-he --gres=gpu:1 + +# output/error files for tracking pip installation +#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.out +#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.err +``` + +### Directory configuration (Line 22 - 28 in jz/emb.sh) +Also, we need to change 6 lines of the directory configuration. +``` +# virtual environment folder for `python3 -m venv $env_dir` +env_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/jz/env_jz_lang_adapter" + +# cache directory for HuggingFace datasets +cache_dir="/users/zyong2/data/zyong2/huggingface" + +# cloned GitHub directory +mm_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling" + +# directory to save adapted models and trained tokenizers +output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/misc/" + +# folder for storing error and output logging text files +logging_txt_dir="/users/zyong2/data/zyong2/bigscience/logs/misc" + +# folder for storing all tensorboard logging +logging_tb_dir="/users/zyong2/data/zyong2/bigscience/reports/misc/" +``` + +## Runs +### 07/05/2022 (Language Adaptation - Embedding-only) +Run the following commands for doing language adaptation for 4 languages varying along the the size of training samples. +``` +sbatch jz/emb.sh my 100000 24000 extend +sbatch jz/emb.sh my 10000 5000 extend +sbatch jz/emb.sh my 1000 5000 extend + +sbatch jz/emb.sh si 100000 24000 extend +sbatch jz/emb.sh si 10000 5000 extend +sbatch jz/emb.sh si 1000 5000 extend + +sbatch jz/emb.sh az 100000 24000 extend +sbatch jz/emb.sh az 10000 5000 extend +sbatch jz/emb.sh az 1000 5000 extend + +sbatch jz/emb.sh de 100000 24000 extend +sbatch jz/emb.sh de 10000 5000 extend +sbatch jz/emb.sh de 1000 5000 extend +``` \ No newline at end of file diff --git a/jz/emb.sh b/jz/emb.sh new file mode 100644 index 0000000..06a34d3 --- /dev/null +++ b/jz/emb.sh @@ -0,0 +1,99 @@ +#!/bin/bash + +# Request half an hour of runtime: +#SBATCH --time=2-23:59:00 + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=gpu-he --gres=gpu:1 + +# Default resources are 1 core with 2.8GB of memory. +#SBATCH --ntasks=8 + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=200g + +# Specify a job name: +#SBATCH -J lang-adapt-env_jz_lang_adapter + +# Specify an output file +#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.out +#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/misc/lang-adapt-env_jz_lang_adapter.err + +env_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/jz/env_jz_lang_adapter" +cache_dir="/users/zyong2/data/zyong2/huggingface" +mm_dir="/users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling" + +output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/misc/" # adapted model and trained tokenizer directory +logging_txt_dir="/users/zyong2/data/zyong2/bigscience/logs/misc" # error and output logging +logging_tb_dir="/users/zyong2/data/zyong2/bigscience/reports/misc/" # tensorboard logging + +mkdir -p $output_dir +mkdir -p $logging_tb_dir +mkdir -p $logging_txt_dir + +lang=$1 # language +sample_size=$2 # training sample size +vocab_size=$3 # vocab size of tokenizer +tok_strategy=$4 # extend, replace, overlap-replace +bigs_model="bigscience/bloom-1b3" +adpt_strategy="emb" + +tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}" +logging_tb_dir="${logging_tb_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}" + +# setup environment +module load python/3.7.4 +[ -d $env_dir ] || python3 -m venv $env_dir +source "${env_dir}/bin/activate" +pip3 install --upgrade pip +pip3 install -r "${mm_dir}/requirements.txt" + +# train tokenizer +python "${mm_dir}/scripts/lang_adapt/tokenized4clm_sampled.py" \ +--lang $lang \ +--model $bigs_model \ +--tokenizer_dir $tokenizer_dir \ +--hf_cache_dir $cache_dir \ +--vocab_size $vocab_size \ +--sample_size $sample_size \ +--use_auth_token \ +--tok_strategy $tok_strategy \ +> "${logging_txt_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}.txt" \ +2> "${logging_txt_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}.err" + + +# finetune language model for langauge adaptation +python "${mm_dir}/scripts/lang_adapt/madx_run_clm.py" \ + --seed 0 \ + --fp16 \ + --model_name_or_path $bigs_model \ + --tokenizer_name $tokenizer_dir \ + --dataset_name oscar \ + --cache_dir $cache_dir \ + --dataset_config_name "unshuffled_deduplicated_${lang}" \ + --logging_dir $logging_tb_dir \ + --report_to "tensorboard" \ + --learning_rate 0.001 \ + --do_train \ + --do_eval \ + --output_dir $output_dir \ + --preprocessing_num_workers 8 \ + --overwrite_output_dir \ + --per_device_train_batch_size 2 \ + --gradient_accumulation_steps 4 \ + --per_device_eval_batch_size 2 \ + --eval_accumulation_steps 4 \ + --eval_steps 1000 \ + --evaluation_strategy "steps" \ + --max_eval_samples 5000 \ + --save_steps 5000 \ + --save_strategy "steps" \ + --max_train_samples $sample_size \ + --max_steps 50000 \ + --logging_steps 1000 \ + --lang_adapt_strategies $adpt_strategy \ + --embedding_strategies $tok_strategy \ + --load_best_model_at_end \ + --use_auth_token \ + > "${logging_txt_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}.txt" \ + 2> "${logging_txt_dir}/$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_tok-${tok_strategy}_adpt-${adpt_strategy}.err" diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..f3f0d9b --- /dev/null +++ b/requirements.txt @@ -0,0 +1,5 @@ +git+https://github.com/yongzx/adapter-transformers.git@f55ab013599088a35c87a880ba13a6d912e27ef4 +--extra-index-url https://download.pytorch.org/whl/cu113 +torch +datasets +tensorboardX \ No newline at end of file diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 0000000..2955ab5 --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,6 @@ +### README + +This folder contains everything we need for running BigScience language adaptation experiments. + +Google Doc: [BigScience - Extending BLOOM to New Languages](https://docs.google.com/document/d/1OEJq2max5kLPF4mnnb9nyoodqR_z_UVQlw4tVx9TvTc/edit#heading=h.kk1966kbedef) + diff --git a/scripts/eval_xnli/adapters_xnli_de.py b/scripts/archive/eval/adapters_xnli_de.py similarity index 59% rename from scripts/eval_xnli/adapters_xnli_de.py rename to scripts/archive/eval/adapters_xnli_de.py index 46140aa..3e29ddd 100644 --- a/scripts/eval_xnli/adapters_xnli_de.py +++ b/scripts/archive/eval/adapters_xnli_de.py @@ -27,18 +27,18 @@ parser.add_argument("--learning_rate", type=float, default=1e-5) parser.add_argument("--per_device_train_batch_size", type=int, default=4) parser.add_argument("--gradient_accumulation_steps", type=int, default=4) -parser.add_argument("--pretrained_model") +parser.add_argument("--adapted_model") parser.add_argument("--original_model") parser.add_argument("--tokenizer") parser.add_argument("--do_train", default=False, action="store_true") parser.add_argument("--do_eval_after_train", default=False, action="store_true") parser.add_argument("--do_predict", default=False, action="store_true") parser.add_argument("--use_partial_data", default=False, action="store_true") -parser.add_argument("--zero_shot", default=False, action="store_true") +parser.add_argument("--cross_lingual", default=False, action="store_true") finetune_strategies = ["whole", "lang_adapters", "task_adapters"] parser.add_argument("--madx_lang_adapter") -parser.add_argument("--adapter_lang_name", required=True) +#parser.add_argument("--adapter_lang_name", required=True) -- why is this required?? parser.add_argument("--finetune_strategies", choices=finetune_strategies, required=True) args = parser.parse_args() @@ -46,21 +46,20 @@ args.do_predict = True if args.original_model is None: - # here: because the wpe is not saved, pretrained_model is the original bigsciece model - args.original_model = args.pretrained_model + # here: because the wpe is not saved, adapted_model is the original bigsciece model + args.original_model = args.adapted_model print("Arguments: ========") print(args) # load dataset -if args.zero_shot: +if args.cross_lingual: print("0️⃣ 0-Shot") # 0-shot: use english as train and validation xnli_en_dataset = load_dataset("xnli", "en", cache_dir=args.cache_dir) xnli_dataset = load_dataset("xnli", args.lang, cache_dir=args.cache_dir) assert args.lang != "en" - train_dataset = xnli_en_dataset['train'] val_dataset = xnli_en_dataset['validation'] test_dataset = xnli_dataset['test'] @@ -76,7 +75,7 @@ # load tokenizer tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir) tokenizer.pad_token = tokenizer.eos_token # tokenizer.encode(tokenizer.eos_token) = [0] -if args.zero_shot: +if args.cross_lingual: en_tokenizer = AutoTokenizer.from_pretrained(args.original_model, cache_dir=args.cache_dir) # has to use AutoTokenizer instead of GPT2Tokenizer en_tokenizer.pad_token = en_tokenizer.eos_token @@ -88,21 +87,23 @@ def en_tokenize_function(examples): logger.info("Tokenizing the dataset...") -if args.zero_shot: - full_train_dataset = train_dataset.map(en_tokenize_function, batched=False) - full_val_dataset = val_dataset.map(en_tokenize_function, batched=False) -else: - full_train_dataset = train_dataset.map(tokenize_function, batched=False) - full_val_dataset = val_dataset.map(tokenize_function, batched=False) +if args.do_train: + if args.cross_lingual: + full_train_dataset = train_dataset.map(en_tokenize_function, batched=False) + full_val_dataset = val_dataset.map(en_tokenize_function, batched=False) + else: + full_train_dataset = train_dataset.map(tokenize_function, batched=False) + full_val_dataset = val_dataset.map(tokenize_function, batched=False) + + + small_train_dataset = full_train_dataset.shuffle(seed=42).select(range(100)) + small_val_dataset = full_val_dataset.shuffle(seed=42).select(range(100)) + logger.info(full_train_dataset[0]) + logger.info(full_train_dataset[100]) full_test_dataset = test_dataset.map(tokenize_function, batched=False) -small_train_dataset = full_train_dataset.shuffle(seed=42).select(range(100)) -small_val_dataset = full_val_dataset.shuffle(seed=42).select(range(100)) small_test_dataset = full_test_dataset.shuffle(seed=42).select(range(100)) -logger.info(full_train_dataset[0]) -logger.info(full_train_dataset[100]) - from datasets import load_metric metric = load_metric("xnli") @@ -132,51 +133,40 @@ def compute_metrics(eval_pred): ) def load_model(args, inference=False): - # FIXME: if we load with GPT2ForSequenceClassification, the embeddings are the original one # even when we call load_adapter - if args.zero_shot and not inference: - model = GPT2ForSequenceClassification.from_pretrained(args.pretrained_model, - num_labels=3, - pad_token_id=en_tokenizer.pad_token_id, - cache_dir=args.cache_dir) - else: - model = GPT2ForSequenceClassification.from_pretrained(args.pretrained_model, - num_labels=3, - pad_token_id=tokenizer.pad_token_id, - cache_dir=args.cache_dir) - - if not args.zero_shot or (args.zero_shot and inference): - # if not zero shot, that means that we need to replace the embedding layers during training - # we also need to replace embedding layers during inference - causal_lm_model = AutoModelForCausalLM.from_pretrained(args.original_model) + if not args.original_model == args.adapted_model and not args.cross_lingual: + wte = torch.load(f'{args.adapted_model}/embedding.pt') + wpe = torch.load(f'{args.adapted_model}/positional_embedding.pt') + + model = GPT2ForSequenceClassification.from_pretrained(args.original_model, + num_labels=3, + pad_token_id=en_tokenizer.pad_token_id, + cache_dir=args.cache_dir) - # change the embedding layer of the original big science model - # by loading the adapters (which has saved lm_head) + if inference or not args.cross_lingual: + # need to load embedding/adapters from the model adapted to the new language + causal_lm_model = AutoModelForCausalLM.from_pretrained(args.original_model) causal_lm_model.resize_token_embeddings(len(tokenizer)) + if not args.original_model == args.adapted_model: + causal_lm_model.transformer.wte = wte + causal_lm_model.transformer.wpe = wpe if args.madx_lang_adapter: - causal_lm_model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") - - # model has original bigscience embedding so replace it. - model.resize_token_embeddings(len(tokenizer)) - model._modules['transformer']._modules['wte'] = causal_lm_model._modules['transformer']._modules['wte'] + adapter_name = causal_lm_model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") + model.transformer = causal_lm_model.transformer + model.set_active_adapters(adapter_name) if not inference: - if not args.zero_shot: - if args.madx_lang_adapter: - adapter_name = model.load_adapter(args.madx_lang_adapter, - config="pfeiffer+inv", - load_as=args.adapter_lang_name) - if args.finetune_strategies == "whole": - model.set_active_adapters(adapter_name) - elif args.finetune_strategies == "lang_adapters": - model.train_adapter([args.adapter_lang_name]) - elif args.finetune_strategies == "task_adapters": - model.add_adapter("xnli-task-adapter") - model.train_adapter("xnli-task-adapter") - else: - raise ValueError("Lack configuration") - + #if not args.cross_lingual: normally need to add adapter in any case + # normally this is already done, why use adapter_lang_name here? + #if args.madx_lang_adapter: + # adapter_name = model.load_adapter(args.madx_lang_adapter, + # config="pfeiffer+inv", + # load_as=args.adapter_lang_name) + model.add_adapter("xnli-task-adapter") + model.train_adapter("xnli-task-adapter") + + print("🔥 ==================== Training: ==================== 🔥") for name, param in model.named_parameters(): if not param.requires_grad: @@ -185,24 +175,19 @@ def load_model(args, inference=False): print(f"🚀 Trainable layer '{name}'") print(model) else: - print("🔥 ==================== Inference: ==================== 🔥") - if args.finetune_strategies == "lang_adapters": - assert args.pretrained_adapters_dir - adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.adapter_lang_name}") - model.set_active_adapters(adapter_name) - elif args.finetune_strategies == "task_adapters": - if args.madx_lang_adapter: - assert args.pretrained_adapters_dir - adapter_name = model.load_adapter(args.madx_lang_adapter) - model.set_active_adapters(adapter_name) - adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/xnli-task-adapter") - model.set_active_adapters(adapter_name) - else: - # adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/013/xnli_de_de_100K_adpt_16_0shot/checkpoint-24544/xnli-task-adapter") - - # for TGT -> TGT supervised finetuning setting, change adapter_name - adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/exp-013/task_xnli_de_ft_100000_ori/checkpoint-24544/xnli-task-adapter") - model.set_active_adapters(adapter_name) + #if args.madx_lang_adapter: + assert args.pretrained_adapters_dir + # normally this is done in any case + #adapter_name = model.load_adapter(args.madx_lang_adapter) + #model.set_active_adapters(adapter_name) + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/xnli-task-adapter") + model.set_active_adapters(adapter_name) + #else: + # # adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/013/xnli_de_de_100K_adpt_16_0shot/checkpoint-24544/xnli-task-adapter") + # # not sure what happens here + # # for TGT -> TGT supervised finetuning setting, change adapter_name + # adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/exp-013/task_xnli_de_ft_100000_ori/checkpoint-24544/xnli-task-adapter") + # model.set_active_adapters(adapter_name) print(model) return model @@ -241,4 +226,4 @@ def load_model(args, inference=False): compute_metrics=compute_metrics ) - print("Evaluate on Test:", trainer.evaluate()) \ No newline at end of file + print("Evaluate on Test:", trainer.evaluate()) diff --git a/scripts/archive/eval/adapters_xnli_de_vn.py b/scripts/archive/eval/adapters_xnli_de_vn.py new file mode 100644 index 0000000..3e29ddd --- /dev/null +++ b/scripts/archive/eval/adapters_xnli_de_vn.py @@ -0,0 +1,229 @@ +import logging +import argparse +import os + +from datasets import load_dataset +from datasets import load_metric +from collections import namedtuple + +import torch +import numpy as np +from transformers import TrainingArguments, Trainer, AdapterTrainer +from transformers import AutoTokenizer, GPT2Tokenizer, GPT2ForSequenceClassification, AutoModelForCausalLM + +# setup logging +import sys +from loguru import logger +logger.remove() +logger.add(sys.stderr, format="{level} {level.icon} | [{time}] - {message}") + + +# parser +parser = argparse.ArgumentParser() +parser.add_argument("output_dir") +parser.add_argument("--lang", type=str, default="de") +parser.add_argument("--cache_dir") +parser.add_argument("--num_train_epochs", type=int, default=30) +parser.add_argument("--learning_rate", type=float, default=1e-5) +parser.add_argument("--per_device_train_batch_size", type=int, default=4) +parser.add_argument("--gradient_accumulation_steps", type=int, default=4) +parser.add_argument("--adapted_model") +parser.add_argument("--original_model") +parser.add_argument("--tokenizer") +parser.add_argument("--do_train", default=False, action="store_true") +parser.add_argument("--do_eval_after_train", default=False, action="store_true") +parser.add_argument("--do_predict", default=False, action="store_true") +parser.add_argument("--use_partial_data", default=False, action="store_true") +parser.add_argument("--cross_lingual", default=False, action="store_true") + +finetune_strategies = ["whole", "lang_adapters", "task_adapters"] +parser.add_argument("--madx_lang_adapter") +#parser.add_argument("--adapter_lang_name", required=True) -- why is this required?? +parser.add_argument("--finetune_strategies", choices=finetune_strategies, required=True) + +args = parser.parse_args() +if args.do_eval_after_train: + args.do_predict = True + +if args.original_model is None: + # here: because the wpe is not saved, adapted_model is the original bigsciece model + args.original_model = args.adapted_model + +print("Arguments: ========") +print(args) + + +# load dataset +if args.cross_lingual: + print("0️⃣ 0-Shot") + # 0-shot: use english as train and validation + xnli_en_dataset = load_dataset("xnli", "en", cache_dir=args.cache_dir) + xnli_dataset = load_dataset("xnli", args.lang, cache_dir=args.cache_dir) + assert args.lang != "en" + train_dataset = xnli_en_dataset['train'] + val_dataset = xnli_en_dataset['validation'] + test_dataset = xnli_dataset['test'] +else: + print("👀 Supervised Training") + xnli_dataset = load_dataset("xnli", args.lang, cache_dir=args.cache_dir) + + train_dataset = xnli_dataset['train'] + val_dataset = xnli_dataset['validation'] + test_dataset = xnli_dataset['test'] + + +# load tokenizer +tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir) +tokenizer.pad_token = tokenizer.eos_token # tokenizer.encode(tokenizer.eos_token) = [0] +if args.cross_lingual: + en_tokenizer = AutoTokenizer.from_pretrained(args.original_model, cache_dir=args.cache_dir) # has to use AutoTokenizer instead of GPT2Tokenizer + en_tokenizer.pad_token = en_tokenizer.eos_token + +def tokenize_function(examples): + return tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + +def en_tokenize_function(examples): + return en_tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + + +logger.info("Tokenizing the dataset...") +if args.do_train: + if args.cross_lingual: + full_train_dataset = train_dataset.map(en_tokenize_function, batched=False) + full_val_dataset = val_dataset.map(en_tokenize_function, batched=False) + else: + full_train_dataset = train_dataset.map(tokenize_function, batched=False) + full_val_dataset = val_dataset.map(tokenize_function, batched=False) + + + small_train_dataset = full_train_dataset.shuffle(seed=42).select(range(100)) + small_val_dataset = full_val_dataset.shuffle(seed=42).select(range(100)) + logger.info(full_train_dataset[0]) + logger.info(full_train_dataset[100]) + +full_test_dataset = test_dataset.map(tokenize_function, batched=False) +small_test_dataset = full_test_dataset.shuffle(seed=42).select(range(100)) + +from datasets import load_metric +metric = load_metric("xnli") + +def compute_metrics(eval_pred): + logits, labels = eval_pred + predictions = np.argmax(logits, axis=-1) + return metric.compute(predictions=predictions, references=labels) + + +training_args = TrainingArguments( + args.output_dir, + overwrite_output_dir=True, + do_train=True, + do_eval=True, + eval_steps=500 if not args.use_partial_data else 10, + num_train_epochs=args.num_train_epochs, + per_device_train_batch_size=args.per_device_train_batch_size, + gradient_accumulation_steps=args.gradient_accumulation_steps, + learning_rate=args.learning_rate, + evaluation_strategy="epoch", + save_strategy="epoch", + logging_strategy="epoch", + logging_steps=500, + report_to="tensorboard", + logging_dir=f"{args.output_dir}/logs", + load_best_model_at_end=True, +) + +def load_model(args, inference=False): + # FIXME: if we load with GPT2ForSequenceClassification, the embeddings are the original one + # even when we call load_adapter + if not args.original_model == args.adapted_model and not args.cross_lingual: + wte = torch.load(f'{args.adapted_model}/embedding.pt') + wpe = torch.load(f'{args.adapted_model}/positional_embedding.pt') + + model = GPT2ForSequenceClassification.from_pretrained(args.original_model, + num_labels=3, + pad_token_id=en_tokenizer.pad_token_id, + cache_dir=args.cache_dir) + + if inference or not args.cross_lingual: + # need to load embedding/adapters from the model adapted to the new language + causal_lm_model = AutoModelForCausalLM.from_pretrained(args.original_model) + causal_lm_model.resize_token_embeddings(len(tokenizer)) + if not args.original_model == args.adapted_model: + causal_lm_model.transformer.wte = wte + causal_lm_model.transformer.wpe = wpe + if args.madx_lang_adapter: + adapter_name = causal_lm_model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") + model.transformer = causal_lm_model.transformer + model.set_active_adapters(adapter_name) + + if not inference: + #if not args.cross_lingual: normally need to add adapter in any case + # normally this is already done, why use adapter_lang_name here? + #if args.madx_lang_adapter: + # adapter_name = model.load_adapter(args.madx_lang_adapter, + # config="pfeiffer+inv", + # load_as=args.adapter_lang_name) + model.add_adapter("xnli-task-adapter") + model.train_adapter("xnli-task-adapter") + + + print("🔥 ==================== Training: ==================== 🔥") + for name, param in model.named_parameters(): + if not param.requires_grad: + print(f"🥶 Frozen layer '{name}'") + else: + print(f"🚀 Trainable layer '{name}'") + print(model) + else: + #if args.madx_lang_adapter: + assert args.pretrained_adapters_dir + # normally this is done in any case + #adapter_name = model.load_adapter(args.madx_lang_adapter) + #model.set_active_adapters(adapter_name) + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/xnli-task-adapter") + model.set_active_adapters(adapter_name) + #else: + # # adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/013/xnli_de_de_100K_adpt_16_0shot/checkpoint-24544/xnli-task-adapter") + # # not sure what happens here + # # for TGT -> TGT supervised finetuning setting, change adapter_name + # adapter_name = model.load_adapter("/users/zyong2/data/zyong2/bigscience/data/processed/exp-013/task_xnli_de_ft_100000_ori/checkpoint-24544/xnli-task-adapter") + # model.set_active_adapters(adapter_name) + print(model) + + return model + +if args.do_train: + logger.info("Start Training") + model = load_model(args) + trainer = AdapterTrainer( + model=model, + args=training_args, + train_dataset=small_train_dataset if args.use_partial_data else full_train_dataset, + eval_dataset=small_val_dataset if args.use_partial_data else full_val_dataset, + compute_metrics=compute_metrics + ) + + trainer.train() + +if args.do_predict: + if args.do_eval_after_train: + evaluation_dirs = list(sorted([ + checkpoint_dir + for checkpoint_dir in os.listdir(args.output_dir) + if checkpoint_dir.startswith('checkpoint-') + ], key=lambda x: int(x[len('checkpoint-'):]))) + if args.madx_lang_adapter: + args.pretrained_adapters_dir = f"{args.output_dir}/{evaluation_dirs[-1]}" + logger.info(f"[Evaluation] Loading trained model from {evaluation_dirs[-1]}") + + model = load_model(args, inference=True) + training_args.report_to = list() + + trainer = AdapterTrainer( + model=model, + args=training_args, + eval_dataset=small_test_dataset if args.use_partial_data else full_test_dataset, + compute_metrics=compute_metrics + ) + + print("Evaluate on Test:", trainer.evaluate()) diff --git a/scripts/archive/eval_xnli/adapters_eval.py b/scripts/archive/eval_xnli/adapters_eval.py new file mode 100644 index 0000000..5284539 --- /dev/null +++ b/scripts/archive/eval_xnli/adapters_eval.py @@ -0,0 +1,355 @@ +import argparse +import os +import sys +from loguru import logger + +from datasets import load_dataset +from datasets import load_metric + +import torch +import numpy as np +import nltk +from transformers import TrainingArguments, AdapterTrainer, Seq2SeqAdapterTrainer, Seq2SeqTrainingArguments +from transformers import AutoTokenizer, GPT2LMHeadModel, GPT2ForSequenceClassification, AutoModelForCausalLM +from transformers import DataCollatorForSeq2Seq + + +logger.remove() +logger.add(sys.stderr, format="{level} {level.icon} | [{time}] - {message}") + +# parser +parser = argparse.ArgumentParser() +parser.add_argument("output_dir") +parser.add_argument("--lang", type=str, default="german") #xlsum requires a language name, not language code + +tasks = ["xnli", "xlsum"] +parser.add_argument("--dataset", choices=tasks, required=True) + +parser.add_argument("--cache_dir") +parser.add_argument("--num_train_epochs", type=int, default=30) +parser.add_argument("--learning_rate", type=float, default=1e-5) +parser.add_argument("--per_device_train_batch_size", type=int, default=4) +parser.add_argument("--gradient_accumulation_steps", type=int, default=4) +parser.add_argument("--per_device_eval_batch_size", type=int, default=1) +parser.add_argument("--pretrained_model") +parser.add_argument("--original_model") +parser.add_argument("--tokenizer") +parser.add_argument("--do_train", default=False, action="store_true") +parser.add_argument("--do_eval_after_train", default=False, action="store_true") +parser.add_argument("--do_predict", default=False, action="store_true") +parser.add_argument("--use_partial_data", default=False, action="store_true") +parser.add_argument("--zero_shot", default=False, action="store_true") +parser.add_argument("--revision", type=str, default="main") +parser.add_argument("--local_rank", type=int) + +finetune_strategies = ["whole", "lang_adapters", "task_adapters"] +parser.add_argument("--madx_lang_adapter") +parser.add_argument("--adapter_lang_name", required=True) +parser.add_argument("--finetune_strategies", choices=finetune_strategies, required=True) + +parser.add_argument("--deepspeed", required=False) + +# mapping of tasks to model/trainer classes +model_class_mapping = {"xnli": GPT2ForSequenceClassification, "xlsum": GPT2LMHeadModel} +trainer_class_mapping = {"xnli": AdapterTrainer, "xlsum": Seq2SeqAdapterTrainer} +trainer_args_mapping = {"xnli": TrainingArguments, "xlsum": Seq2SeqTrainingArguments} + + +args = parser.parse_args() +if args.do_eval_after_train: + args.do_predict = True + +# additional args to pass to the model init. task-dependent +optional_model_kwargs = {} +optional_trainer_args = {} +if args.dataset == "xnli": + optional_model_kwargs = {"num_labels": 3} +elif args.dataset == "xlsum": + optional_trainer_args = {"generation_max_length": 128, "predict_with_generate":True} + + +if args.local_rank: + torch.cuda.set_device(args.local_rank) + +if args.original_model is None: + # here: because the wpe is not saved, pretrained_model is the original bigscience model + args.original_model = args.pretrained_model + +print("Arguments: ========") +print(args) + +# load appropriate dataset +logger.info("Loading dataset...") + +# will need to rename splits if the dataset has different name for validation set +if args.zero_shot: + print("0️⃣ Cross Lingual") + # cross lingual: use english as train and validation set + en_dataset = load_dataset(args.dataset, "english" if args.dataset == "xlsum" else "en", cache_dir=args.cache_dir) + dataset = load_dataset(args.dataset, args.lang, cache_dir=args.cache_dir) + + train_dataset = en_dataset["train"] + val_dataset = en_dataset["validation"] + test_dataset = dataset["test"] +else: + print("👀 Supervised training") + dataset = load_dataset(args.dataset, args.lang, cache_dir=args.cache_dir) + + train_dataset = dataset["train"] + val_dataset = dataset["validation"] + test_dataset = dataset["test"] + +logger.info("Loading tokenizer...") +# load tokenizer + +tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir, revision=args.revision) +tokenizer.pad_token = tokenizer.eos_token + +if args.dataset == "xnli": + def tokenize_function(examples): + return tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + +elif args.dataset == "xlsum": + def tokenize_function(example): + inputs = tokenizer(f'summarize this article: {example["text"]}', max_length=96, padding="max_length", truncation=True) + + with tokenizer.as_target_tokenizer(): + summaries = tokenizer(f'{example["summary"]}', max_length=96, padding="max_length", truncation=True) + + inputs["labels"] = summaries["input_ids"] + + return inputs + +if args.zero_shot: + en_tokenizer = AutoTokenizer.from_pretrained(args.original_model, cache_dir=args.cache_dir, revision=args.revision) + en_tokenizer.pad_token = en_tokenizer.eos_token + + if args.dataset == "xnli": + def en_tokenize_function(examples): + return en_tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + + elif args.dataset == "xlsum": + def en_tokenize_function(example): + inputs = en_tokenizer(f'summarize this article: {example["text"]}', max_length=96, padding="max_length", truncation=True) + + with en_tokenizer.as_target_tokenizer(): + summaries = en_tokenizer(f'{example["summary"]}', max_length=96, padding="max_length", truncation=True) + + inputs["labels"] = summaries["input_ids"] + + return inputs + + + +if args.zero_shot: + full_train_dataset = train_dataset.map(en_tokenize_function, batched=False) + full_val_dataset = val_dataset.map(en_tokenize_function, batched=False) +else: + full_train_dataset = train_dataset.map(tokenize_function, batched=False) + full_val_dataset = val_dataset.map(tokenize_function, batched=False) + +full_test_dataset = test_dataset.map(tokenize_function, batched=False) +small_train_dataset = full_train_dataset.shuffle(seed=42).select(range(100)) +small_val_dataset = full_val_dataset.shuffle(seed=42).select(range(100)) +small_test_dataset = full_test_dataset.shuffle(seed=42).select(range(100)) + +logger.info(full_train_dataset[0]) + + +# load metric +logger.info("Loading metric...") + +if args.dataset == "xnli": + metric = load_metric("xnli") + + def compute_metrics(eval_pred): + logits, labels = eval_pred + predictions = np.argmax(logits, axis=-1) + return metric.compute(predictions=predictions, references=labels) + +elif args.dataset == "xlsum": + metric = load_metric("rouge", cache_dir=args.cache_dir) + + def compute_metrics(eval_preds): + preds, labels = eval_preds + + preds = tokenizer.batch_decode(preds, skip_special_tokens=True) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + + preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in preds] + labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in labels] + + result = metric.compute(predictions=preds, references=labels) + # TODO: need to confirm these are the right rouge values to report. Can report more ROUGE metrics if needed. + result = {key: value.mid.fmeasure * 100 for key, value in result.items()} + + return {k: round(v, 4) for k, v in result.items()} + +else: + raise ValueError("Unknown dataset provided") + + +training_args = trainer_args_mapping[args.dataset]( + output_dir=args.output_dir, + overwrite_output_dir=True, + do_train=True, + do_eval=True, + eval_steps=500 if not args.use_partial_data else None, + num_train_epochs=args.num_train_epochs, + per_device_train_batch_size=args.per_device_train_batch_size, + per_device_eval_batch_size=args.per_device_eval_batch_size, + gradient_accumulation_steps=args.gradient_accumulation_steps, + learning_rate=args.learning_rate, + evaluation_strategy="epoch", + save_strategy="epoch", + logging_strategy="epoch", + logging_steps=500, + report_to="tensorboard", + logging_dir=f"{args.output_dir}/logs", + load_best_model_at_end=True, + deepspeed=args.deepspeed, + **optional_trainer_args, +) + +# TODO: double-check the adapter loading logic here +def load_model(args, inference=False): + + # Hack for loading wte module not needed here, since using a causal language model class + if args.zero_shot and not inference: + # only pass in num_labels if using a seq. classification model + model = model_class_mapping[args.dataset].from_pretrained(args.pretrained_model, + pad_token_id=en_tokenizer.pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + else: + model = model_class_mapping[args.dataset].from_pretrained(args.pretrained_model, + pad_token_id=tokenizer.pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + if not args.zero_shot or (args.zero_shot and inference): + # if not zero shot, that means that we need to replace the embedding layers during training + # we also need to replace embedding layers during inference + causal_lm_model = AutoModelForCausalLM.from_pretrained(args.original_model, revision=args.revision) + + # change the embedding layer of the original big science model + # by loading the adapters (which has saved lm_head) + causal_lm_model.resize_token_embeddings(len(tokenizer)) + if args.madx_lang_adapter: + causal_lm_model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") + + # model has original bigscience embedding so replace it. + model.resize_token_embeddings(len(tokenizer)) + model._modules['transformer']._modules['wte'] = causal_lm_model._modules['transformer']._modules['wte'] + + if not inference: + if not args.zero_shot: + if args.madx_lang_adapter: + adapter_name = model.load_adapter(args.madx_lang_adapter, + config="pfeiffer+inv", + load_as=args.adapter_lang_name) + if args.finetune_strategies == "whole": + model.set_active_adapters(adapter_name) + elif args.finetune_strategies == "lang_adapters": + model.train_adapter([args.adapter_lang_name]) + elif args.finetune_strategies == "task_adapters": + model.add_adapter(f"{args.dataset}-task-adapter") + model.train_adapter(f"{args.dataset}-task-adapter") + else: + raise ValueError("invalid configuration") + + print("🔥 ==================== Training: ==================== 🔥") + # for name, param in model.named_parameters(): + # if not param.requires_grad: + # print(f"🥶 Frozen layer '{name}'") + # else: + # print(f"🚀 Trainable layer '{name}'") + # print(model) + else: + print("🔥 ==================== Inference: ==================== 🔥") + if args.finetune_strategies == "lang_adapters": + assert args.pretrained_adapters_dir + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.adapter_lang_name}") + model.set_active_adapters(adapter_name) + elif args.finetune_strategies == "task_adapters": + if args.madx_lang_adapter: + assert args.pretrained_adapters_dir + adapter_name = model.load_adapter(args.madx_lang_adapter) + model.set_active_adapters(adapter_name) + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.dataset}-task-adapter") + model.set_active_adapters(adapter_name) + else: + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.dataset}-task-adapter") #TODO: change the argument to this + model.set_active_adapters(adapter_name) + # print(model) + + + return model + + +if args.do_train: + logger.info("Starting training...") + model = load_model(args) + + + # only use seq2seq collator if doing seq2seq task + if args.dataset == "xlsum": + data_collator = DataCollatorForSeq2Seq( + tokenizer, + model=model, + label_pad_token_id=-100, + ) + + + trainer = trainer_class_mapping[args.dataset]( + model=model, + args=training_args, + train_dataset=small_train_dataset if args.use_partial_data else full_train_dataset, + eval_dataset=small_val_dataset if args.use_partial_data else full_val_dataset, + compute_metrics=compute_metrics, + # args for xlsum only + **{"data_collator": data_collator} if args.dataset == "xlsum" else {}, + ) + + trainer.train() + + + +if args.do_predict: + if args.do_eval_after_train: + evaluation_dirs = list(sorted([ + checkpoint_dir + for checkpoint_dir in os.listdir(args.output_dir) + if checkpoint_dir.startswith("checkpoint-")], + key=lambda x: int(x[len('checkpoint-'):]))) + assert len(evaluation_dirs) > 0 + logger.info(f"Found {len(evaluation_dirs)} checkpoints") + + # load the last checkpoint. + args.pretrained_adapters_dir = f"{args.output_dir}/{evaluation_dirs[-1]}" + logger.info(f"[Evaluation] Loading trained model from {evaluation_dirs[-1]}") + + model = load_model(args, inference=True) + training_args.report_to = list() + + if args.dataset == "xlsum": + data_collator = DataCollatorForSeq2Seq( + tokenizer, + model=model, + label_pad_token_id=-100, + pad_to_multiple_of=8 if training_args.fp16 else None, + ) + + trainer = trainer_class_mapping[args.dataset]( + model=model, + args=training_args, + eval_dataset=small_test_dataset if args.use_partial_data else full_test_dataset, + compute_metrics=compute_metrics, + # args for xlsum only + **{"data_collator": data_collator} if args.dataset == "xlsum" else {} + + ) + + print("Evaluating on test set...", trainer.evaluate()) diff --git a/scripts/archive/eval_xnli/adapters_xlsum_de.py b/scripts/archive/eval_xnli/adapters_xlsum_de.py new file mode 100644 index 0000000..6d434e3 --- /dev/null +++ b/scripts/archive/eval_xnli/adapters_xlsum_de.py @@ -0,0 +1,265 @@ +import argparse +import os +import sys +from loguru import logger + +from datasets import load_dataset +from datasets import load_metric + +import torch +import numpy as np +from transformers import TrainingArguments, Trainer, AdapterTrainer +from transformers import AutoTokenizer, GPT2Tokenizer, GPT2LMHeadModel, AutoModelForCausalLM + + +logger.remove() +logger.add(sys.stderr, format="{level} {level.icon} | [{time}] - {message}") + +# parser +parser = argparse.ArgumentParser() +parser.add_argument("output_dir") +parser.add_argument("--lang", type=str, default="german") #xlsum requires a language name, not language code +parser.add_argument("--cache_dir") +parser.add_argument("--num_train_epochs", type=int, default=30) +parser.add_argument("--learning_rate", type=float, default=1e-5) +parser.add_argument("--per_device_train_batch_size", type=int, default=4) +parser.add_argument("--gradient_accumulation_steps", type=int, default=4) +parser.add_argument("--pretrained_model") +parser.add_argument("--original_model") +parser.add_argument("--tokenizer") +parser.add_argument("--do_train", default=False, action="store_true") +parser.add_argument("--do_eval_after_train", default=False, action="store_true") +parser.add_argument("--do_predict", default=False, action="store_true") +parser.add_argument("--use_partial_data", default=False, action="store_true") +parser.add_argument("--zero_shot", default=False, action="store_true") +parser.add_argument("--revision", type=str, default="main") +parser.add_argument("--local_rank", type=int, default=0) + +finetune_strategies = ["whole", "lang_adapters", "task_adapters"] +parser.add_argument("--madx_lang_adapter") +parser.add_argument("--adapter_lang_name", required=True) +parser.add_argument("--finetune_strategies", choices=finetune_strategies, required=True) + +parser.add_argument("--deepspeed", required=False) + +args = parser.parse_args() +if args.do_eval_after_train: + args.do_predict = True + +torch.cuda.set_device(args.local_rank) + +if args.original_model is None: + # here: because the wpe is not saved, pretrained_model is the original bigscience model + args.original_model = args.pretrained_model + +print("Arguments: ========") +print(args) + + +# load xlsum dataset +if args.zero_shot: + print("Cross Lingual") + en_dataset = load_dataset("xlsum", "english", cache_dir=args.cache_dir) + dataset = load_dataset("xlsum", args.lang, cache_dir=args.cache_dir) + + train_dataset = en_dataset["train"] + val_dataset = en_dataset["validation"] + test_dataset = dataset["test"] +else: + print("Supervised training") + dataset = load_dataset("xlsum", args.lang, cache_dir=args.cache_dir) + + train_dataset = dataset["train"] + val_dataset = dataset["validation"] + test_dataset = dataset["test"] + + +# load tokenizer + +# if args.revision is not None: +# print("revision: ", args.revision) +# tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir, revision=args.revision) + +tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir, revision=args.revision) + +tokenizer.pad_token = tokenizer.eos_token + +if args.zero_shot: + en_tokenizer = AutoTokenizer.from_pretrained(args.original_model, cache_dir=args.cache_dir, revision=args.revision) + + en_tokenizer.pad_token = en_tokenizer.eos_token + +def tokenize_function(example): + inputs = tokenizer(f'summarize this article: {example["text"]}', max_length=256, padding="max_length", truncation=True) + + with tokenizer.as_target_tokenizer(): + summaries = tokenizer(f'{example["summary"]}', max_length=256, padding="max_length", truncation=True) + + inputs["labels"] = summaries["input_ids"] + + return inputs + +if args.zero_shot: + def en_tokenize_function(example): + inputs = en_tokenizer(f'summarize this article: {example["text"]}', max_length=256, padding="max_length", truncation=True) + + with en_tokenizer.as_target_tokenizer(): + summaries = en_tokenizer(f'{example["summary"]}', max_length=256, padding="max_length", truncation=True) + + inputs["labels"] = summaries["input_ids"] + + return inputs + +logger.info("tokenizing dataset...") + +full_train_dataset = train_dataset.map(tokenize_function, batched=False) #TODO: unbatch this? +full_val_dataset = val_dataset.map(tokenize_function, batched=False) +full_test_dataset = test_dataset.map(tokenize_function, batched=False) + +small_train_dataset = full_train_dataset.shuffle(seed=42).select(range(100)) +small_val_dataset = full_val_dataset.shuffle(seed=42).select(range(100)) +small_test_dataset = full_test_dataset.shuffle(seed=42).select(range(100)) + + +logger.info(full_train_dataset[0]) +logger.info(full_val_dataset[0]) + +metric = load_metric("rouge", cache_dir=args.cache_dir) + +def compute_metrics(eval_preds): ##TODO: implement this + preds, labels = eval_preds + + return metric(preds, labels) + + +training_args = TrainingArguments( + output_dir=args.output_dir, + overwrite_output_dir=True, + do_train=True, + do_eval=True, + eval_steps=500 if not args.use_partial_data else None, + num_train_epochs=args.num_train_epochs, + per_device_train_batch_size=args.per_device_train_batch_size, + gradient_accumulation_steps=args.gradient_accumulation_steps, + learning_rate=args.learning_rate, + evaluation_strategy="epoch", + save_strategy="epoch", + logging_strategy="epoch", + logging_steps=500, + report_to="tensorboard", + logging_dir=f"{args.output_dir}/logs", + load_best_model_at_end=True, + deepspeed=args.deepspeed, +) + +def load_model(args, inference=False): + + # Hack for loading wte module not needed here, since using a causal language model class + if args.zero_shot and not inference: + model = GPT2LMHeadModel.from_pretrained(args.pretrained_model, + pad_token_id=en_tokenizer.pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision) + else: + model = GPT2LMHeadModel.from_pretrained(args.pretrained_model, + pad_token_id=tokenizer.pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision) + if not args.zero_shot or (args.zero_shot and inference): + # if not zero shot, that means that we need to replace the embedding layers during training + # we also need to replace embedding layers during inference + causal_lm_model = AutoModelForCausalLM.from_pretrained(args.original_model, revision=args.revision) + + # change the embedding layer of the original big science model + # by loading the adapters (which has saved lm_head) + causal_lm_model.resize_token_embeddings(len(tokenizer)) + if args.madx_lang_adapter: + causal_lm_model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") + + # model has original bigscience embedding so replace it. + model.resize_token_embeddings(len(tokenizer)) + model._modules['transformer']._modules['wte'] = causal_lm_model._modules['transformer']._modules['wte'] + + # TODO: change the logic here for loading/training the adapters + if not inference: + if not args.zero_shot: + if args.madx_lang_adapter: + adapter_name = model.load_adapter(args.madx_lang_adapter, + config="pfeiffer+inv", + load_as=args.adapter_lang_name) + if args.finetune_strategies == "whole": + model.set_active_adapters(adapter_name) + elif args.finetune_strategies == "lang_adapters": + model.train_adapter([args.adapter_lang_name]) + elif args.finetune_strategies == "task_adapters": + model.add_adapter("xlsum-task-adapter") + model.train_adapter("xlsum-task-adapter") + else: + raise ValueError("invalid configuration") + + print("🔥 ==================== Training: ==================== 🔥") + # for name, param in model.named_parameters(): + # if not param.requires_grad: + # print(f"🥶 Frozen layer '{name}'") + # else: + # print(f"🚀 Trainable layer '{name}'") + # print(model) + else: + print("🔥 ==================== Inference: ==================== 🔥") + if args.finetune_strategies == "lang_adapters": + assert args.pretrained_adapters_dir + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.adapter_lang_name}") + model.set_active_adapters(adapter_name) + elif args.finetune_strategies == "task_adapters": + if args.madx_lang_adapter: + assert args.pretrained_adapters_dir + adapter_name = model.load_adapter(args.madx_lang_adapter) + model.set_active_adapters(adapter_name) + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/xlsum-task-adapter") + model.set_active_adapters(adapter_name) + else: + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/xlsum-task-adapter") #TODO: change the argument to this + model.set_active_adapters(adapter_name) + # print(model) + + + return model + +if args.do_train: + logger.info("Starting training...") + model = load_model(args) + trainer = AdapterTrainer( + model=model, + args=training_args, + train_dataset=small_train_dataset if args.use_partial_data else full_train_dataset, + eval_dataset=small_val_dataset if args.use_partial_data else full_val_dataset, + compute_metrics=compute_metrics, + ) + + trainer.train() + +if args.do_predict: + if arg.do_eval_after_train: + evaluation_dirs = list(sorted([ + checkpoint_dir + for checkpoint_dir in os.listdir(args.output_dir) + if checkpoint_dir.startswith("checkpoint-")], + key=lambda x: int(x[len('checkpoint-'):]))) + assert len(evaluation_dirs) > 0 + logger.info(f"Found {len(evaluation_dirs)} checkpoints") + + if args.madx_lang_adapter: + args.pretrained_adapters_dir = f"{args.output_dir}/{evaluation_dirs[-1]}" + logger.info(f"[Evaluation] Loading trained model from {evaluation_dirs[-1]}") + + model = load_model(args, inference=True) + training_args.report_to = list() + + trainer = AdapterTrainer( + model=model, + args=training_args, + eval_dataset=small_test_dataset if args.use_partial_data else full_test_dataset, + compute_metrics=compute_metrics, + ) + + print("Evaluating on test set...", trainer.evaluate()) diff --git a/scripts/archive/eval_xnli/crosslingual_exp.sh b/scripts/archive/eval_xnli/crosslingual_exp.sh new file mode 100644 index 0000000..8f1bafe --- /dev/null +++ b/scripts/archive/eval_xnli/crosslingual_exp.sh @@ -0,0 +1,40 @@ +OUTPUT_DIR=./xlsum_ckpts # where to save checkpoints +LANG="thai" # language name, e.g. "thai" not "th" for xlsum. language code e.g. "de" for xnli. +TASK="xlsum" # xlsum or xnli +CACHE_DIR=~/.cache/huggingface/ # cache dir for saving/loading HF models and datasets +LR=1e-5 +MODEL_NAME="bigscience/tr5b-1B3-multilingual-alpha-checkpoints" +TOKENIZER_NAME="bigscience/tr5b-1B3-multilingual-alpha-checkpoints" +REVISION="global_step118500" # branch name, e.g. "global_step118500", if applicable + +DEEPSPEED_CONFIG="./deepspeed_config.json" # deepspeed config file, if using deepspeed +# language adapters checkpoint folder +MADX_LANG_ADAPTER_NAME="" + +# only finetune task adapters +FT_STRATEGIES="task_adapters" + + +mkdir -p $OUTPUT_DIR +deepspeed --include localhost:0 adapters_xlsum_de.py \ +$OUTPUT_DIR \ +--lang $LANG \ +--dataset $TASK \ +--cache_dir $CACHE_DIR \ +--num_train_epochs 2 \ +--learning_rate $LR \ +--per_device_train_batch_size 1 \ +--gradient_accumulation_steps 1 \ +--pretrained_model $MODEL_NAME \ +--tokenizer $TOKENIZER_NAME \ +--do_train \ +--do_eval_after_train \ +--use_partial_data \ +--zero_shot \ +--revision "$REVISION" \ +--adapter_lang_name "xlsum-de" \ +--finetune_strategies $FT_STRATEGIES \ +# --use_partial_data +# --deepspeed $DEEPSPEED_CONFIG + +# --madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ \ No newline at end of file diff --git a/scripts/archive/exp_sentence_retrievale_eval/compute_retrieval_acc.sh b/scripts/archive/exp_sentence_retrievale_eval/compute_retrieval_acc.sh new file mode 100644 index 0000000..a0afcd8 --- /dev/null +++ b/scripts/archive/exp_sentence_retrievale_eval/compute_retrieval_acc.sh @@ -0,0 +1,22 @@ +#!/bin/bash +#SBATCH -p gpu +#SBATCH --gres="gpu:1" +#SBATCH --ntasks=16 +#SBATCH --mem=50g + +# Specify a job name: +#SBATCH -J eval_retrieval_acc + +# Specify an output file +#SBATCH -o /tmp-network/user/vnikouli/Projects/bigscience/logs/eval_retrieval_acc-%j.out +#SBATCH -e /tmp-network/user/vnikouli/Projects/bigscience/logs/eval_retrieval_acc-%j.err + +#SBATCH --mail-type=BEGIN,END,FAIL +#SBATCH --mail-user=vassilina.nikoulina@naverlabs.com + + +model=$1 +dataset=$2 +outdir=retrieval_acc_${model}-${dataset} +mkdir $outdir +python eval_sentence_retrieval.py $outdir --pretrained_model $model --tokenizer $model --dataset $dataset diff --git a/scripts/exp_sentence_retrievale_eval/compute_retrieval_acc_bs.sh b/scripts/archive/exp_sentence_retrievale_eval/compute_retrieval_acc_bs.sh similarity index 100% rename from scripts/exp_sentence_retrievale_eval/compute_retrieval_acc_bs.sh rename to scripts/archive/exp_sentence_retrievale_eval/compute_retrieval_acc_bs.sh diff --git a/scripts/archive/exp_sentence_retrievale_eval/eval_sentence_retrieval.py b/scripts/archive/exp_sentence_retrievale_eval/eval_sentence_retrieval.py new file mode 100644 index 0000000..3fdf4e3 --- /dev/null +++ b/scripts/archive/exp_sentence_retrievale_eval/eval_sentence_retrieval.py @@ -0,0 +1,222 @@ +import logging +import argparse +import os +from datasets import load_dataset +from collections import namedtuple +import torch +import numpy as np +from transformers import BertTokenizer, BertModel +from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForMaskedLM +import matplotlib +import matplotlib.pyplot as plt +import seaborn as sns +import pandas as pd +import os.path +import sys +from loguru import logger +import random +logger.remove() +logger.add(sys.stderr, format="{level} {level.icon} | [{time}] - {message}") + + +# parser +parser = argparse.ArgumentParser() +parser.add_argument("output_dir") +parser.add_argument("--pretrained_model", default="bert-base-multilingual-cased") +parser.add_argument("--tokenizer", default="bert-base-multilingual-cased") +parser.add_argument("--dataset", default="ted_multi") +parser.add_argument("--device", default="cuda") +args = parser.parse_args() + +tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) +ted_lngs = ['am', 'ar', 'bn', 'ca', 'en', 'es', 'fr', 'hi', 'id', 'ja', 'pt', 'zh-cn', 'zh-tw', 'pt-br'] +flores_lng = ["amh", "bos", "cat", "eng", "spa", "fra", "hin", "ind", "jpn", "por", "swh", "vie", "urd"] +bs_languages = ["id", "eu", "vi", "zh", "ur", "es", "ca", "pt", "fr", "en", "hi", "ar", "bn"] +lngcode_map = {"am":"amh", "bn":"bos", "ca":"cat", "en":"eng", "es":"spa", "fr": "fra", "hi": "hin", "id": "ind", "ja": "jpn", "pt": "por", "ur":"urd", "vi":"vie" } + + +print("Arguments: ========") +print(args) + + +def load_dataset_(args): + if args.dataset == "ted_multi": + return load_dataset_ted(args) + if args.dataset == "flores": + return load_dataset_flores(args) + + +def load_dataset_flores_for_lng(args, lng): + dataset = load_dataset("gsarti/flores_101", lngcode_map[lng])['dev'] + return dataset + +def load_dataset_flores(args): + dataset = {} + for lng in bs_languages: + if lng in lngcode_map: + load_dataset_flores_for_lng(args, lng) + return dataset + +def load_dataset_ted(args): + dataset = load_dataset("ted_multi")['validation'] + return dataset + +def get_talks(dataset, nb_talks): + talk_names = [] + for t in dataset['talk_name']: + if len(talk_names) < nb_talks and not t in talk_names: + talk_names.append(t) + + + print([(t1, len([t for t in dataset['talk_name'] if t == t1])) for t1 in talk_names]) + return talk_names + +def load_model(args): + if "xlm" in args.pretrained_model or "bert" in args.pretrained_model: + model = AutoModelForMaskedLM.from_pretrained(args.pretrained_model) + else: + model = AutoModelForCausalLM.from_pretrained(args.pretrained_model) + model.config.output_hidden_states=True + return model.to(args.device) + +Sample = namedtuple( + "Sample", + ("id", "hidden_state") +) + +def load_from_file(fname): + return torch.load(fname) + + +def get_hidden_states(args, model): + if args.dataset == "ted_multi": + dataset = load_dataset_(args) + nb_talks = 2 + talks = get_talks(dataset, nb_talks) + + emb = get_hidden_states_for_talks(dataset, model, talks, args.pretrained_model) + + outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-talks-valid-{len(talks)}" + + elif args.dataset == "flores": + nb_samples = 200 + emb = get_hidden_states_for_flores(args, model, args.pretrained_model, nb_samples = nb_samples) + outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-flores-{nb_samples}" + + retrieval_acc = {} + nb_states = model.config.num_hidden_layers + fig, ax = plt.subplots(1, int(nb_states/step), figsize=(12*int(nb_states/step), 10)) + + + with open(f"{outname}.log", 'w') as fout: + for state in range(0, nb_states, step): + plot_retrieval_acc(state, emb, ax[int(state/step)], fout) + + fig.tight_layout() + plt.savefig(f'{outname}-heatmap.png') + + +def get_hidden_states_for_flores(args, model, mname, nb_samples=50): + emb = {} + hidden_state_size = model.config.num_hidden_layers + for lng in bs_languages: + if lng in lngcode_map: + fname = f"{args.output_dir}/flores-{lng}-{nb_samples}-{mname.replace('/','-')}.pt" + if os.path.isfile(fname): + emb[lng] = load_from_file(fname) + else: + dataset = load_dataset_flores_for_lng(args, lng) + emb[lng] = {} + for state in range(hidden_state_size): + emb[lng][state] = [] + for i, sid in enumerate(dataset['id'][:nb_samples]): + t = dataset['sentence'][i] + x = tokenizer(t, return_tensors="pt").input_ids.to(model.device) + out = model(x) + for state in range(hidden_state_size): + hs = torch.mean(out.hidden_states[state][0][1:-1], dim=0).detach() + emb[lng][state].append(Sample(sid, hs)) + torch.save(emb[lng], fname) + return emb + + +def get_hidden_states_for_talks(dataset, model, talks, mname): + emb = {} + hidden_state_size = model.config.num_hidden_layers + fname = f"{args.output_dir}/ted_multi-{mname.replace('/','-')}-ted_multi-{len(talks)}.pt" + if os.path.isfile(fname): + emb = load_from_file(fname) + return emb + for sid, sample in enumerate(dataset): + if sample['talk_name'] in talks: + tsample = sample['translations'] + for i, lng in enumerate(tsample['language']): + if lng in bs_languages: + t = tsample['translation'][i] + x = tokenizer(t, return_tensors="pt").input_ids.to(model.device) + if not lng in emb: + emb[lng] = {} + for state in range(hidden_state_size): + emb[lng][state] = [] + out = model(x) + for state in range(hidden_state_size): + hs = torch.mean(out.hidden_states[state][0], dim=0).detach() + emb[lng][state].append(Sample(sid, hs)) + torch.save(emb, fname) + return emb + + +def compute_sent_retrieval_acc(lng1, lng2, emb, state, out): + cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6) + E1 = torch.stack([s[1] for s in emb[lng1][state]]) + E2 = torch.stack([s[1] for s in emb[lng2][state]]) + #cos_matrix = [[cos(E2[i],E1[j]) for i in range(E2.shape[0]) ] for j in range(E1.shape[0])] + match = 0 + intersection_ids = set([emb[lng1][state][i][0] for i in range(E1.shape[0])]).intersection( + set([emb[lng2][state][i][0] for i in range(E2.shape[0])]) + ) + if len(intersection_ids)>0: + random_acc = 1/len(intersection_ids) + for i in range(E1.shape[0]): + if emb[lng1][state][i][0] in intersection_ids: + cos_sim = [cos(E2[j], E1[i]) for j in range(E2.shape[0])] + best_match = torch.argmax(torch.stack(cos_sim)) + if emb[lng2][state][best_match][0] == emb[lng1][state][i][0]: + match +=1 + acc = match/len(intersection_ids) + out.write(f"{lng1}-{lng2} = {acc} (random {random_acc} )\n") + return acc, len(intersection_ids) + else: + return 0, 0 + +def plot_retrieval_acc(state, emb, ax, out): + cmap="RdYlBu" + mean_per_state = 0 + for lng1 in emb: + if not lng1 in retrieval_acc: + retrieval_acc[lng1] = {} + for lng2 in emb: + lng2_chance = 1.0/len(emb[lng2][0]) + #if not lng1 == lng2: + acc, random_acc = compute_sent_retrieval_acc(lng1, lng2, emb, state, out) + retrieval_acc[lng1][lng2] = acc + #retrieval_acc[lng1]["random"] = lng2_chance + mean_acc = np.mean([v for v in retrieval_acc[lng1].values()]) + out.write(f"ACC per {lng1}, layer {state} = {mean_acc} \n" ) + mean_per_state +=mean_acc + mean_per_state = mean_per_state/len(emb.keys()) + out.write(f"ACC overall, layer {state} = {mean_per_state}\n" ) + m_res = pd.DataFrame(retrieval_acc) + m_res.columns=emb.keys() + m_res.index=emb.keys()#[e for e in emb.keys()]+["random"] + ax.set_title(f"state {state}") + sns.heatmap(m_res, ax=ax, annot=False, vmin=0, vmax=1.0, center=0, cmap=cmap) + + + +lngs2consider = ['am', 'ar', 'bn', 'ca', 'en', 'es', 'fr', 'hi', 'id', 'ja', 'pt', 'zh-cn', 'zh-tw', 'pt-br'] +samples = 10 +model = load_model(args) +retrieval_acc = {} +step=1 +get_hidden_states(args, model) diff --git a/scripts/madx_exp/madx_lngembft_clm.py b/scripts/archive/madx_exp/madx_lngembft_clm.py similarity index 100% rename from scripts/madx_exp/madx_lngembft_clm.py rename to scripts/archive/madx_exp/madx_lngembft_clm.py diff --git a/scripts/archive/madx_exp/madxlastlayer_lngembft_clm.py b/scripts/archive/madx_exp/madxlastlayer_lngembft_clm.py new file mode 100644 index 0000000..7234cea --- /dev/null +++ b/scripts/archive/madx_exp/madxlastlayer_lngembft_clm.py @@ -0,0 +1,618 @@ +""" +Source: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_clm.py +""" + +import logging +import math +import os +import sys +from dataclasses import dataclass, field +from typing import Optional + +import torch +import pathlib + +import datasets +from datasets import load_dataset + +import transformers +import transformers.adapters.composition as ac +from transformers import ( + CONFIG_MAPPING, + MODEL_FOR_CAUSAL_LM_MAPPING, + AdapterTrainer, + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, + HfArgumentParser, + MultiLingAdapterArguments, + Trainer, + TrainingArguments, + default_data_collator, + set_seed, +) +from transformers.adapters.configuration import AdapterConfig +from transformers.testing_utils import CaptureLogger +from transformers.trainer_utils import get_last_checkpoint +from transformers.utils import check_min_version +from transformers.utils.versions import require_version + + +# Will error if the minimal version of Transformers is not installed. Remove at your own risks. +check_min_version("4.11.0") + +require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt") + +logger = logging.getLogger(__name__) + + +MODEL_CONFIG_CLASSES = list(MODEL_FOR_CAUSAL_LM_MAPPING.keys()) +MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. + """ + + model_name_or_path: Optional[str] = field( + default=None, + metadata={ + "help": "The model checkpoint for weights initialization." + "Don't set if you want to train a model from scratch." + }, + ) + model_type: Optional[str] = field( + default=None, + metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)}, + ) + config_overrides: Optional[str] = field( + default=None, + metadata={ + "help": "Override some existing default config settings when a model is trained from scratch. Example: " + "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index" + }, + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, + ) + use_fast_tokenizer: bool = field( + default=True, + metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, + ) + model_revision: str = field( + default="main", + metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, + ) + use_auth_token: bool = field( + default=False, + metadata={ + "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " + "with private models)." + }, + ) + + def __post_init__(self): + if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None): + raise ValueError( + "--config_overrides can't be used in combination with --config_name or --model_name_or_path" + ) + + +@dataclass +class DataTrainingArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + """ + + dataset_name: Optional[str] = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + dataset_config_name: Optional[str] = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."}) + validation_file: Optional[str] = field( + default=None, + metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."}, + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + }, + ) + max_eval_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " + "value if set." + }, + ) + + block_size: Optional[int] = field( + default=None, + metadata={ + "help": "Optional input sequence length after tokenization. " + "The training dataset will be truncated in block of this size for training. " + "Default to the model max input length for single sentence inputs (take into account special tokens)." + }, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + validation_split_percentage: Optional[int] = field( + default=5, + metadata={ + "help": "The percentage of the train set used as validation set in case there's no validation split" + }, + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + keep_linebreaks: bool = field( + default=True, metadata={"help": "Whether to keep line breaks when using TXT files or not."} + ) + + def __post_init__(self): + if self.dataset_name is None and self.train_file is None and self.validation_file is None: + raise ValueError("Need either a dataset name or a training/validation file.") + else: + if self.train_file is not None: + extension = self.train_file.split(".")[-1] + assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file." + if self.validation_file is not None: + extension = self.validation_file.split(".")[-1] + assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file." + + +def load_tokenizer(model_args): + tokenizer_kwargs = { + "cache_dir": model_args.cache_dir, + "use_fast": model_args.use_fast_tokenizer, + "revision": model_args.model_revision, + "use_auth_token": True if model_args.use_auth_token else None, + } + + if model_args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) + elif model_args.model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) + else: + raise ValueError( + "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You can do it from another script, save it, and load it from here, using --tokenizer_name." + ) + return tokenizer + + + +def load_data(data_args, model_args): + # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) + # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ + # (the dataset will be downloaded automatically from the datasets Hub). + # + # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called + # 'text' is found. You can easily tweak this behavior (see below). + # + # In distributed training, the load_dataset function guarantee that only one local process can concurrently + # download the dataset. + if data_args.dataset_name is not None: + # Downloading and loading a dataset from the hub. + raw_datasets = load_dataset( + data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir + ) + + else: + data_files = {} + dataset_args = {} + if data_args.train_file is not None: + data_files["train"] = data_args.train_file + if data_args.validation_file is not None: + data_files["validation"] = data_args.validation_file + extension = ( + data_args.train_file.split(".")[-1] + if data_args.train_file is not None + else data_args.validation_file.split(".")[-1] + ) + if extension == "txt": + extension = "text" + dataset_args["keep_linebreaks"] = data_args.keep_linebreaks + raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir, **dataset_args) + + if "validation" not in raw_datasets.keys(): + if data_args.max_eval_samples is not None and data_args.max_train_samples is not None: + raw_datasets = raw_datasets['train'].train_test_split(train_size = data_args.max_train_samples, test_size = data_args.max_eval_samples) + elif data_args.max_eval_samples is not None : + raw_datasets = raw_datasets['train'].train_test_split(test_size = data_args.max_eval_samples) + else: + raw_datasets = raw_datasets['train'].train_test_split(test_size = data.args.validation_split_percentage/100.0) + + raw_datasets['validation'] = raw_datasets['test'] + # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at + # https://huggingface.co/docs/datasets/loading_datasets.html. + + # Load pretrained model and tokenizer + # + # Distributed training: + # The .from_pretrained methods guarantee that only one local process can concurrently + # download model & vocab. + + return raw_datasets + +def load_model(model_args, tokenizer): + config_kwargs = { + "cache_dir": model_args.cache_dir, + "revision": model_args.model_revision, + "use_auth_token": True if model_args.use_auth_token else None, + } + if model_args.config_name: + config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs) + elif model_args.model_name_or_path: + config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs) + else: + config = CONFIG_MAPPING[model_args.model_type]() + logger.warning("You are instantiating a new config instance from scratch.") + if model_args.config_overrides is not None: + logger.info(f"Overriding config: {model_args.config_overrides}") + config.update_from_string(model_args.config_overrides) + if model_args.model_name_or_path: + model = AutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + from_tf=bool(".ckpt" in model_args.model_name_or_path), + config=config, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + use_auth_token=True if model_args.use_auth_token else None, + ) + else: + model = AutoModelForCausalLM.from_config(config) + n_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values()) + logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params") + + #TODO: remap embedding parameters + #if not tokenizer.name_or_path == model_args.model_name_or_path: + # orig_tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + + model.resize_token_embeddings(len(tokenizer)) + return model + +def preprocess_data(training_args, data_args, model_args, tokenizer): + with training_args.main_process_first(desc="dataset map tokenization"): + saved_tokenized_datasets_fp = pathlib.Path(f"{training_args.data_dir}/tokenized_datasets.pt") + if not tokenizer.name_or_path == model_args.model_name_or_path: + saved_tokenized_datasets_fp = pathlib.Path(f"{training_args.data_dir}/lngemb_tokenized_datasets.pt") + + saved_tokenized_datasets_fp.parent.mkdir(parents=True, exist_ok=True) + if saved_tokenized_datasets_fp.exists() and saved_tokenized_datasets_fp.is_file(): + tokenized_datasets = torch.load(str(saved_tokenized_datasets_fp)) + logger.info("Sanity check: loaded tokenized_datasets") + else: + raw_datasets = load_data(data_args, model_args) + # First we tokenize all the texts. + if training_args.do_train: + column_names = raw_datasets["train"].column_names + else: + column_names = raw_datasets["validation"].column_names + + text_column_name = "text" if "text" in column_names else column_names[0] + # since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function + tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base") + + def tokenize_function(examples): + + with CaptureLogger(tok_logger) as cl: + output = tokenizer(examples[text_column_name]) + # clm input could be much much longer than block_size + if "Token indices sequence length is longer than the" in cl.out: + tok_logger.warning( + "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model." + ) + return output + tokenized_datasets = raw_datasets.map( + tokenize_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on dataset", + ) + torch.save(tokenized_datasets, saved_tokenized_datasets_fp) + logger.info("Sanity check: saved tokenized_datasets") + if "train" not in tokenized_datasets and training_args.do_train: + raise ValueError("--do_train requires a train dataset") + if "validation" not in tokenized_datasets and training_args.do_eval: + raise ValueError("--do_eval requires a validation dataset") + return tokenized_datasets + + +def get_lm_dataset(training_args, data_args, model_args, tokenizer): + if data_args.block_size is None: + block_size = tokenizer.model_max_length + if block_size > 1024: + logger.warning( + f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " + "Picking 1024 instead. You can change that default value by passing --block_size xxx." + ) + block_size = 1024 + else: + if data_args.block_size > tokenizer.model_max_length: + logger.warning( + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." + ) + block_size = min(data_args.block_size, tokenizer.model_max_length) + # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size. + def group_texts(examples): + # Concatenate all texts. + concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} + total_length = len(concatenated_examples[list(examples.keys())[0]]) + # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can + # customize this part to your needs. + if total_length >= block_size: + total_length = (total_length // block_size) * block_size + # Split by chunks of max_len. + result = { + k: [t[i : i + block_size] for i in range(0, total_length, block_size)] + for k, t in concatenated_examples.items() + } + result["labels"] = result["input_ids"].copy() + return result + + # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder + # for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower + # to preprocess. + # + # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: + # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + + with training_args.main_process_first(desc="grouping texts together"): + saved_lm_datasets_fp = pathlib.Path(f"{training_args.data_dir}/lm_datasets.pt") + if not tokenizer.name_or_path == model_args.model_name_or_path: + saved_lm_datasets_fp = pathlib.Path(f"{training_args.data_dir}/lngemb_lm_datasets.pt") + if saved_lm_datasets_fp.exists() and saved_lm_datasets_fp.is_file(): + lm_datasets = torch.load(str(saved_lm_datasets_fp)) + logger.info("Sanity check: loaded lm_datasets") + else: + + tokenized_datasets = preprocess_data(training_args, data_args, model_args, tokenizer) + lm_datasets = tokenized_datasets.map( + group_texts, + batched=True, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + desc=f"Grouping texts in chunks of {block_size}", + ) + torch.save(lm_datasets, saved_lm_datasets_fp) + logger.info("Sanity check: saved lm_datasets") + return lm_datasets + +def add_adapters(adapter_args, data_args, model): + # Setup adapters + if adapter_args.train_adapter: + task_name = data_args.dataset_name or "clm" + task_name += f"_{adapter_args.language}" + # check if adapter already exists, otherwise add it + if task_name not in model.config.adapters: + # resolve the adapter config + adapter_config = AdapterConfig.load( + adapter_args.adapter_config, + non_linearity=adapter_args.adapter_non_linearity, + reduction_factor=adapter_args.adapter_reduction_factor, + leave_out = [i for i in range(0,23)] + ) + # load a pre-trained from Hub if specified + if adapter_args.load_adapter: + model.load_adapter( + adapter_args.load_adapter, + config=adapter_config, + load_as=task_name, + ) + # otherwise, add a fresh adapter + else: + model.add_adapter(task_name, config=adapter_config) + # optionally load a pre-trained language adapter + if adapter_args.load_lang_adapter: + # resolve the language adapter config + lang_adapter_config = AdapterConfig.load( + adapter_args.lang_adapter_config, + non_linearity=adapter_args.lang_adapter_non_linearity, + reduction_factor=adapter_args.lang_adapter_reduction_factor, + ) + # load the language adapter from Hub + lang_adapter_name = model.load_adapter( + adapter_args.load_lang_adapter, + config=lang_adapter_config, + load_as=adapter_args.language, + ) + else: + lang_adapter_name = None + # Freeze all model weights except of those of this adapter + model.train_adapter([task_name]) + # Set the adapters to be used in every forward pass + if lang_adapter_name: + model.set_active_adapters(ac.Stack(lang_adapter_name, task_name)) + else: + model.set_active_adapters(task_name) + else: + if adapter_args.load_adapter or adapter_args.load_lang_adapter: + raise ValueError( + "Adapters can only be loaded in adapters training mode." + "Use --train_adapter to enable adapter training" + ) + trainable_params = 0 + frozen_params = 0 + emb_params = 0 + for name, param in model.named_parameters(): + if not param.requires_grad: + if not "wte" in name and not "lm_head" in name: + print(f"🥶 Frozen layer '{name}'") + frozen_params +=param.numel() + else: + param.requires_grad = True + print(f"🚀 Trainable layer '{name}'") + emb_params += param.numel() + else: + print(f"🚀 Trainable layer '{name}'") + trainable_params += param.numel() + print(f"Total frozen parameters: {frozen_params}") + print(f"Total emb parameters: {emb_params}") + print(f"Total trainable parameters: {trainable_params}") + +def main(): + # See all possible arguments in src/transformers/training_args.py + # or by passing the --help flag to this script. + # We now keep distinct sets of args, for a cleaner separation of concerns. + + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, MultiLingAdapterArguments)) + + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + # If we pass only one argument to the script and it's the path to a json file, + # let's parse it to get our arguments. + model_args, data_args, training_args, adapter_args = parser.parse_json_file( + json_file=os.path.abspath(sys.argv[1]) + ) + else: + model_args, data_args, training_args, adapter_args = parser.parse_args_into_dataclasses() + training_args.data_dir = f'{training_args.output_dir}/../' + # Setup logging + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + datasets.utils.logging.set_verbosity(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + logger.info(f"model_args {model_args}") + logger.info(f"data_args {data_args}") + logger.info(f"Training/evaluation parameters {training_args}") + logger.info(f"Adapter parameters {adapter_args}") + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + pass + #raise ValueError( + # f"Output directory ({training_args.output_dir}) already exists and is not empty. " + # "Use --overwrite_output_dir to overcome." + #) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Set seed before initializing model. + set_seed(training_args.seed) + + tokenizer = load_tokenizer(model_args) + model = load_model(model_args, tokenizer) + + add_adapters(adapter_args, data_args, model) + # Preprocessing the datasets. + lm_datasets = get_lm_dataset(training_args, data_args, model_args, tokenizer) + if training_args.do_train: + train_dataset = lm_datasets["train"] + + if training_args.do_eval: + + eval_dataset = lm_datasets["validation"] + + + # Initialize our Trainer + trainer_class = AdapterTrainer if adapter_args.train_adapter else Trainer + trainer = trainer_class( + model=model, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + tokenizer=tokenizer, + # Data collator will default to DataCollatorWithPadding, so we change it. + data_collator=default_data_collator, + ) + + logger.info(model) + + # Training + if training_args.do_train: + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + trainer.save_model() # Saves the tokenizer too for easy upload + + metrics = train_result.metrics + + max_train_samples = ( + data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) + ) + metrics["train_samples"] = min(max_train_samples, len(train_dataset)) + + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluation + if training_args.do_eval: + logger.info("*** Evaluate ***") + + metrics = trainer.evaluate() + + max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) + metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) + try: + perplexity = math.exp(metrics["eval_loss"]) + except OverflowError: + perplexity = float("inf") + metrics["perplexity"] = perplexity + + trainer.log_metrics("eval", metrics) + trainer.save_metrics("eval", metrics) + + kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-generation"} + if data_args.dataset_name is not None: + kwargs["dataset_tags"] = data_args.dataset_name + if data_args.dataset_config_name is not None: + kwargs["dataset_args"] = data_args.dataset_config_name + kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}" + else: + kwargs["dataset"] = data_args.dataset_name + +# if training_args.push_to_hub: +# trainer.push_to_hub(**kwargs) +# else: +# trainer.create_model_card(**kwargs) + + +def _mp_fn(index): + # For xla_spawn (TPUs) + main() + + +if __name__ == "__main__": + main() diff --git a/scripts/madx_exp/run_clm_madx_lngemb.sh b/scripts/archive/madx_exp/run_clm_madx_lngemb.sh similarity index 100% rename from scripts/madx_exp/run_clm_madx_lngemb.sh rename to scripts/archive/madx_exp/run_clm_madx_lngemb.sh diff --git a/scripts/xnli/README.md b/scripts/archive/xnli/README.md similarity index 100% rename from scripts/xnli/README.md rename to scripts/archive/xnli/README.md diff --git a/scripts/xnli/archive_xnli.py b/scripts/archive/xnli/archive_xnli.py similarity index 100% rename from scripts/xnli/archive_xnli.py rename to scripts/archive/xnli/archive_xnli.py diff --git a/scripts/xnli/xnli_v2.py b/scripts/archive/xnli/xnli_v2.py similarity index 100% rename from scripts/xnli/xnli_v2.py rename to scripts/archive/xnli/xnli_v2.py diff --git a/scripts/eval_xnli/README.md b/scripts/eval/README.md similarity index 84% rename from scripts/eval_xnli/README.md rename to scripts/eval/README.md index 17fc051..f7c1195 100644 --- a/scripts/eval_xnli/README.md +++ b/scripts/eval/README.md @@ -30,13 +30,16 @@ $OUTPUT_DIR \ --do_train \ --do_eval_after_train \ --madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ ---adapter_lang_name "xnli-de" \ --finetune_strategies $FT_STRATEGIES \ --zero_shot ``` Remove `--zero_shot` for supervised finetuning setting. +Notes: +- `adapters_xnli_de_vn.py` is Vassilina's forked of `adapters_xnli_de.py`. +- `train_xnli_zero_shot.sh` is the batch script for XNLI training, and `run_eval_xnli_zero_shot.sh` is for evaluating trained XNLI task adapters. + ### Zero-shot Prompt-based Setting See branch [`bigscience-lm-adapt`](https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt) of yongzx/lm-evaluation-harness (forked repo). \ No newline at end of file diff --git a/scripts/eval/eval.py b/scripts/eval/eval.py new file mode 100644 index 0000000..f08ea1a --- /dev/null +++ b/scripts/eval/eval.py @@ -0,0 +1,547 @@ +import logging +import argparse +import os +import json +from tqdm import tqdm + +from datasets import load_dataset +from datasets import load_metric +from collections import namedtuple + +import nltk +import torch +import numpy as np +from transformers import TrainingArguments, Trainer, Seq2SeqTrainer, AdapterTrainer, Seq2SeqAdapterTrainer, Seq2SeqTrainingArguments +from transformers import AutoTokenizer, AutoModelWithLMHead, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForTokenClassification +from transformers import DataCollatorForSeq2Seq +from transformers import ( + get_linear_schedule_with_warmup, + LogitsProcessorList, + BeamSearchScorer, + ForcedEOSTokenLogitsProcessor +) + +# setup logging +import sys +from loguru import logger +logger.remove() +logger.add(sys.stderr, format="{level} {level.icon} | [{time}] - {message}") + + +# AVAILABLE TASKS +XNLI = "xnli" +XLSUM = "csebuetnlp/xlsum" +WIKIANN = "wikiann" + +# parser +parser = argparse.ArgumentParser() +parser.add_argument("output_dir") +parser.add_argument("--train_lang", type=str) +parser.add_argument("--lang", type=str) #xlsum requires a language name, not language code + +tasks = [XNLI, XLSUM, WIKIANN] +parser.add_argument("--dataset", choices=tasks, required=True) + +parser.add_argument("--cache_dir") +parser.add_argument("--num_train_epochs", type=int, default=30) +parser.add_argument("--max_steps", type=int, default=-1) +parser.add_argument("--seed", type=int, default=42) +parser.add_argument("--learning_rate", type=float, default=1e-5) +parser.add_argument("--per_device_train_batch_size", type=int, default=4) +parser.add_argument("--gradient_accumulation_steps", type=int, default=4) +parser.add_argument("--per_device_eval_batch_size", type=int, default=1) +parser.add_argument("--adapted_model_dir") +parser.add_argument("--original_model") +parser.add_argument("--tokenizer") +parser.add_argument("--do_train", default=False, action="store_true") +parser.add_argument("--do_predict", default=False, action="store_true") +parser.add_argument("--use_partial_data", default=False, action="store_true") +parser.add_argument("--use_partial_train_data", type=int, default=100) +parser.add_argument("--use_partial_val_data", type=int, default=-1) +parser.add_argument("--use_partial_test_data", type=int, default=-1) +parser.add_argument("--cross_lingual", default=False, action="store_true") +parser.add_argument("--revision", type=str, default="main") +parser.add_argument("--local_rank", type=int) + +parser.add_argument("--madx_lang_adapter", default=None) +parser.add_argument("--baseline", default=False, action="store_true") +parser.add_argument("--deepspeed", required=False) + +task_layers = ["task-adapters", "last-layer", "full-model"] +parser.add_argument("--task_layers", choices=task_layers, required=True) + + +# mapping of tasks to model/trainer classes +model_class_mapping = { + XNLI: AutoModelForSequenceClassification, + XLSUM: AutoModelWithLMHead, + WIKIANN: AutoModelForTokenClassification +} +trainer_no_task_adpt_class_mapping = {XNLI: Trainer, XLSUM: Seq2SeqTrainer, WIKIANN: Trainer} +trainer_class_mapping = {XNLI: AdapterTrainer, XLSUM: Seq2SeqAdapterTrainer, WIKIANN: AdapterTrainer} +trainer_args_mapping = {XNLI: TrainingArguments, XLSUM: Seq2SeqTrainingArguments, WIKIANN: TrainingArguments} +task_eval_metric_best_model = {XNLI: 'eval_accuracy', XLSUM: 'eval_loss', WIKIANN: 'eval_overall_f1'} + +args = parser.parse_args() + +# XLSUM +XLSUM_INPUT_LEN = 512 +XLSUM_OUTPUT_LEN = 64 +XLSUM_NUM_BEAMS = 1 +XLSUM_LEN_PENALTY = 0.6 + +#### Process args +if not args.cross_lingual and not args.train_lang: + args.train_lang = args.lang +# ensure that only when cross_lingual, train_lang is not the same as lang +assert not ((args.train_lang != args.lang) ^ args.cross_lingual) + +if args.baseline: + logger.warning("❗️ No 'madx_lang_adapter' loaded. This should be the baseline performance.") + assert not args.madx_lang_adapter + +# additional args to pass to the model init. task-dependent +optional_model_kwargs = {} +optional_trainer_args = {} +optional_eval_args = {} +if args.dataset == XNLI: + optional_model_kwargs = {"num_labels": 3} +elif args.dataset == WIKIANN: + optional_model_kwargs = {"num_labels": 7} +elif args.dataset == XLSUM: + optional_trainer_args = {"generation_max_length": XLSUM_INPUT_LEN + XLSUM_OUTPUT_LEN, + "predict_with_generate":True, + "optim": "adafactor", + "lr_scheduler_type": "linear", + "warmup_ratio": 0.1} + +if args.local_rank: + torch.cuda.set_device(args.local_rank) + +if args.original_model is None: + # here: because the wpe is not saved, adapted_model_dir is the original bigsciece model + args.original_model = args.adapted_model_dir + +print("Arguments: ========") +print(args) + +# load appropriate dataset +logger.info("Loading dataset...") + +# will need to rename splits if the dataset has different name for validation set +if args.cross_lingual: + logger.info(f"0️⃣ Cross Lingual setting") + logger.info(f"train lang: {args.train_lang}; inference lang: {args.lang}") + # cross lingual: use english as train and validation set + en_dataset = load_dataset(args.dataset, args.train_lang, cache_dir=args.cache_dir) + dataset = load_dataset(args.dataset, args.lang, cache_dir=args.cache_dir) + + train_dataset = en_dataset["train"] + val_dataset = en_dataset["validation"] + test_dataset = dataset["test"] +else: + logger.info(f"👀 Supervised training setting") + logger.info(f"language: {args.lang})") + dataset = load_dataset(args.dataset, args.lang, cache_dir=args.cache_dir) + + train_dataset = dataset["train"] + val_dataset = dataset["validation"] + test_dataset = dataset["test"] + +if args.use_partial_data: + train_dataset = train_dataset.shuffle(seed=args.seed).select(range(args.use_partial_train_data)) + if args.use_partial_val_data != -1: + val_dataset = val_dataset.shuffle(seed=args.seed).select(range(args.use_partial_val_data)) + if args.use_partial_test_data != -1: + test_dataset = test_dataset.shuffle(seed=args.seed).select(range(args.use_partial_test_data)) + logger.warning("🚨 Loading partial data!") + +if args.do_train: + logger.info(f"train = {len(train_dataset)} samples") +else: + logger.info(f"args.do_train = False") +logger.info(f"val = {len(val_dataset)} samples") +logger.info(f"test = {len(test_dataset)} samples") + +# load tokenizer +logger.info(f"Loading tokenizer from {args.tokenizer}...") +tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, cache_dir=args.cache_dir, revision=args.revision, add_prefix_space=args.dataset in [WIKIANN]) +if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token +if tokenizer.sep_token is None: + tokenizer.sep_token = tokenizer.bos_token + +# TODO: we probably need better code for this than multiple if-else statements +en_tokenizer = AutoTokenizer.from_pretrained(args.original_model, cache_dir=args.cache_dir, revision=args.revision, add_prefix_space=args.dataset in [WIKIANN]) +if en_tokenizer.pad_token is None: + en_tokenizer.pad_token = en_tokenizer.eos_token +if en_tokenizer.sep_token is None: + en_tokenizer.sep_token = en_tokenizer.bos_token + # en_tokenizer.add_special_tokens({'sep_token':'<|sep|>'}) + +if args.dataset == XNLI: + if tokenizer.eos_token is None: + tokenizer.eos_token = tokenizer.sep_token + if en_tokenizer.eos_token is None: + en_tokenizer.eos_token = en_tokenizer.sep_token + + def tokenize_function(examples): + return tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + + def en_tokenize_function(examples): + return en_tokenizer(f'{examples["premise"]} {tokenizer.eos_token} {examples["hypothesis"]}', max_length=128, padding="max_length", truncation=True) + +elif args.dataset == XLSUM: + # for decoder only structure, input and target needs to have the same length + # also, unlike enc-dec model, we cannot feed the model some text and expect the model to generate only summary + # we need to train the model with [text] + [sep] + [summary]. + def tokenize_function(example): + inputs = tokenizer(f'{example["text"]}', max_length=XLSUM_INPUT_LEN, padding="max_length", truncation=True) + inputs['input_ids'][-1] = tokenizer.sep_token_id + + with tokenizer.as_target_tokenizer(): + summaries = tokenizer(f'{example["summary"]}', max_length=XLSUM_OUTPUT_LEN, padding="max_length", truncation=True) + + inputs['input_ids'] += summaries['input_ids'] + inputs['attention_mask'] += summaries['attention_mask'] + inputs['labels'] = inputs['input_ids'] + + return inputs + + def en_tokenize_function(example): + ... + # inputs = en_tokenizer(f'{example["text"]}', max_length=512, padding="max_length", truncation=True) + + # with en_tokenizer.as_target_tokenizer(): + # summaries = en_tokenizer(f'{example["summary"]}', max_length=512, padding="max_length", truncation=True) + + # inputs["labels"] = summaries["input_ids"] + + # return inputs + +elif args.dataset == WIKIANN: + def tokenize_function(examples): + tokenized_inputs = tokenizer(examples['tokens'], is_split_into_words=True, max_length=128, padding="max_length", truncation=True) + + word_ids = tokenized_inputs.word_ids() # Map tokens to their respective word. + previous_word_idx = None + label_ids = [] + for word_idx in word_ids: # Set the special tokens to -100. + if word_idx is None: + label_ids.append(-100) + elif word_idx != previous_word_idx: # Only label the first token of a given word. + label_ids.append(examples[f"ner_tags"][word_idx]) + else: + label_ids.append(-100) + previous_word_idx = word_idx + + tokenized_inputs["labels"] = label_ids + return tokenized_inputs + + def en_tokenize_function(examples): + return en_tokenizer(examples['tokens'], is_split_into_words=True, max_length=128, padding="max_length", truncation=True) + + +# tokenizing the dataset +logger.info("Tokenizing the dataset...") +if args.do_train: + if args.cross_lingual: + train_dataset = train_dataset.map(en_tokenize_function, batched=False) + val_dataset = val_dataset.map(en_tokenize_function, batched=False) + else: + train_dataset = train_dataset.map(tokenize_function, batched=False) + val_dataset = val_dataset.map(tokenize_function, batched=False) + + logger.info("Print one tokenized dataset example ...") + logger.info(train_dataset[0]) + +test_dataset = test_dataset.map(tokenize_function, batched=False) + +# TODO: same as above, we probably need a better way than if-else statements. +# load metric +logger.info("Loading metric...") + +if args.dataset == XNLI: + metric = load_metric("xnli") + + def compute_metrics(eval_pred): + logits, labels = eval_pred + predictions = np.argmax(logits, axis=-1) + return metric.compute(predictions=predictions, references=labels) + +elif args.dataset == WIKIANN: + metric = load_metric("seqeval") + idx2labelname = {i: label for i, label in enumerate(dataset["train"].features[f"ner_tags"].feature.names)} + + def compute_metrics(eval_pred): + logits, golds = eval_pred + predictions = np.argmax(logits, axis=-1) + + converted_golds = list() + converted_preds = list() + + for i in range(golds.shape[0]): + gold, pred = list(), list() + for j in range(golds.shape[1]): + if golds[i][j] != -100: + gold.append(idx2labelname[golds[i][j]]) + pred.append(idx2labelname[predictions[i][j]]) + converted_golds.append(gold) + converted_preds.append(pred) + + return metric.compute(predictions=converted_preds, references=converted_golds) + +elif args.dataset == XLSUM: + metric = load_metric('rouge') + + def compute_metrics(eval_preds): + return {} + + def compute_xlsum_beam_search_metrics(model, dataset): + # get input sentences + # print(torch.Tensor(dataset['input_ids']).type(torch.IntTensor)) + input_ids = torch.Tensor(dataset['input_ids']).type(torch.IntTensor)[:, :XLSUM_INPUT_LEN] + bsz = args.per_device_eval_batch_size + + # get generated summaries + preds = list() + for i in tqdm(range(0, input_ids.shape[0], bsz), desc="Summarization task: generation"): + outputs = model.generate(input_ids[i:i+bsz], max_length=XLSUM_INPUT_LEN+XLSUM_OUTPUT_LEN, length_penalty=XLSUM_LEN_PENALTY, num_beams=XLSUM_NUM_BEAMS) + preds += tokenizer.batch_decode(outputs[:, XLSUM_INPUT_LEN:], skip_special_tokens=True) + + # get gold summaries + labels = np.array(dataset['input_ids'])[:, XLSUM_INPUT_LEN:] + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + + print(preds) + print(labels) + + # compute ROUGE metrics + preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in preds] + labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in labels] + result = metric.compute(predictions=preds, references=labels) + result = {key: value.mid.fmeasure * 100 for key, value in result.items()} + + return {k: round(v, 4) for k, v in result.items()} + +else: + raise ValueError("Unknown dataset provided") + + +training_args = trainer_args_mapping[args.dataset]( + output_dir=args.output_dir, + overwrite_output_dir=True, + do_train=True, + do_eval=True, + eval_steps=500 if not args.use_partial_data else None, + num_train_epochs=args.num_train_epochs, + max_steps=args.max_steps, + per_device_train_batch_size=args.per_device_train_batch_size, + per_device_eval_batch_size=args.per_device_eval_batch_size, + gradient_accumulation_steps=args.gradient_accumulation_steps, + learning_rate=args.learning_rate, + evaluation_strategy="epoch", + save_strategy="epoch", + logging_strategy="epoch", + logging_steps=500, + report_to="tensorboard", + logging_dir=f"{args.output_dir}/logs", + load_best_model_at_end=True, + metric_for_best_model=task_eval_metric_best_model[args.dataset], + deepspeed=args.deepspeed, + **optional_trainer_args, +) + +def print_model_trainable_layers(model): + for name, param in model.named_parameters(): + if not param.requires_grad: + print(f"🥶 Frozen layer '{name}'") + else: + print(f"🚀 Trainable layer '{name}'") + +def load_model(args, inference=False): + def make_last_layer_trainable(args, model, inference=False): + if model is None: + if not inference: + model_path = args.original_model + else: + model_path = args.pretrained_adapters_dir + print(f"Loaded model from {model_path}") + model = model_class_mapping[args.dataset].from_pretrained(model_path, + pad_token_id=pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + model.freeze_model(freeze=True) + return model + + def make_base_model_trainable(args, model, inference=False): + if model is None: + if not inference: + model_path = args.original_model + else: + model_path = args.pretrained_adapters_dir + print(f"Loaded model from {model_path}") + model = model_class_mapping[args.dataset].from_pretrained(model_path, + pad_token_id=pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + model.freeze_model(freeze=False) + return model + + def load_task_specific_adapters(args, model, inference=False): + if model is None: + model = model_class_mapping[args.dataset].from_pretrained(args.original_model, + pad_token_id=pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + + if not inference: + model.add_adapter(f"{args.dataset.split('/')[-1]}-task-adapter") + model.train_adapter(f"{args.dataset.split('/')[-1]}-task-adapter") + return model + + else: + print(f"[Evaluation] Load task adapters from {args.pretrained_adapters_dir}/{args.dataset.split('/')[-1]}-task-adapter") + adapter_name = model.load_adapter(f"{args.pretrained_adapters_dir}/{args.dataset.split('/')[-1]}-task-adapter") + model.set_active_adapters(adapter_name) + return model + + def load_embedding_layers(args, tokenizer, model): + if "tr5b-1B3" in args.original_model: # previous 1.3B bigsience model + token_embedding = torch.load(f'{args.adapted_model_dir}/embedding_wte.pt') + add_embedding = torch.load(f'{args.adapted_model_dir}/embedding_wpe.pt') + model.transformer.wte = token_embedding + model.transformer.wpe = add_embedding + + elif "bloom" in args.original_model: + token_embedding = torch.load(f'{args.adapted_model_dir}/word_embeddings.pt') + add_embedding = torch.load(f'{args.adapted_model_dir}/word_embeddings_layernorm.pt') + model.transformer.word_embeddings = token_embedding + model.transformer.word_embeddings_layernorm = add_embedding + + logger.info(f"Replaced embeddings with {token_embedding} and {add_embedding}...") + return model + + def load_language_adapters(args, model): + adapter_name = model.load_adapter(args.madx_lang_adapter, config="pfeiffer+inv") + model.set_active_adapters(adapter_name) + logger.info(f"Added Adapter {args.madx_lang_adapter}...") + return model + + pad_token_id = en_tokenizer.pad_token_id if (not inference and args.cross_lingual) else tokenizer.pad_token_id + + # baseline: only need to add task-specific adapters + # (keeps separated for now for easier debugging) + if args.baseline: + model = None + if args.task_layers == "task-adapters": + model = load_task_specific_adapters(args, model, inference) + elif args.task_layers == "last-layer": + model = make_last_layer_trainable(args, model, inference) + elif args.task_layers == "full-model": + model = make_base_model_trainable(args, model, inference) + return model + + # load unadapted model + model = model_class_mapping[args.dataset].from_pretrained(args.original_model, + pad_token_id=pad_token_id, + cache_dir=args.cache_dir, + revision=args.revision, + **optional_model_kwargs) + + # load adapted model + if not args.cross_lingual or inference: + model = load_embedding_layers(args, tokenizer, model) + if args.madx_lang_adapter: + model = load_language_adapters(args, model) + + if args.task_layers == "task-adapters": + model = load_task_specific_adapters(args, model, inference) + elif args.task_layers == "last-layer": + model = make_last_layer_trainable(args, model, inference) + return model + + +if args.do_train: + logger.info("Starting training...") + model = load_model(args) + print("🔥 ==================== Training: ==================== 🔥") + print_model_trainable_layers(model) + + # only use seq2seq collator if doing seq2seq task + if args.dataset == XLSUM: + data_collator = DataCollatorForSeq2Seq( + tokenizer, + model=model, + label_pad_token_id=-100, + ) + + if model.active_adapters is None: + logger.info("No active adapters") + trainer_class = trainer_no_task_adpt_class_mapping[args.dataset] + else: + trainer_class = trainer_class_mapping[args.dataset] + logger.info(f"Using {trainer_class_mapping[args.dataset]} for training") + + trainer = trainer_class( + model=model, + args=training_args, + train_dataset=train_dataset, + eval_dataset=val_dataset, + compute_metrics=compute_metrics, + # args for xlsum only + **{"data_collator": data_collator} if args.dataset == XLSUM else {}, + ) + + trainer.train() + + +if args.do_predict: + evaluation_dirs = list(sorted([ + checkpoint_dir + for checkpoint_dir in os.listdir(args.output_dir) + if checkpoint_dir.startswith("checkpoint-")], + key=lambda x: int(x[len('checkpoint-'):]))) + assert len(evaluation_dirs) > 0 + print(f"Found {len(evaluation_dirs)} checkpoints") + + # load the best checkpoint. + with open(f"{args.output_dir}/{evaluation_dirs[-1]}/trainer_state.json") as rf: + args.pretrained_adapters_dir = json.load(rf)['best_model_checkpoint'] + + print(f"[Evaluation] Loading trained model (best checkpoint) from {args.pretrained_adapters_dir}") + + model = load_model(args, inference=True) + model.eval() + training_args.report_to = list() + + if args.dataset == XLSUM: + # use beam search to get the results following the XLSUM paper + print(f"Evaluating on test set ({XLSUM})...") + result = compute_xlsum_beam_search_metrics(model, test_dataset) + print(result) + + else: + if model.active_adapters is None: + logger.info("No active adapters") + trainer_class = trainer_no_task_adpt_class_mapping[args.dataset] + else: + trainer_class = trainer_class_mapping[args.dataset] + + eval_trainer = trainer_class( + model=model, + args=training_args, + eval_dataset=test_dataset, + compute_metrics=compute_metrics, + # args for xlsum only + **{"data_collator": data_collator} if args.dataset == XLSUM else {} + + ) + + print("Evaluating on test set...") + print(eval_trainer.evaluate()) + diff --git a/scripts/eval/scripts_wikiann/baseline_wikiann_de.sh b/scripts/eval/scripts_wikiann/baseline_wikiann_de.sh new file mode 100644 index 0000000..a6b78d4 --- /dev/null +++ b/scripts/eval/scripts_wikiann/baseline_wikiann_de.sh @@ -0,0 +1,67 @@ +#!/bin/bash + +# Request half an hour of runtime: +#SBATCH --time=2-23:59:00 + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=gpu-he --gres=gpu:1 + +# Default resources are 1 core with 2.8GB of memory. +#SBATCH --ntasks=4 + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=100g + +# Specify a job name: +#SBATCH -J exp-021-wikiann-baseline_wikiann_de_task_adapters + +# Specify an output file +#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/log-021-wikiann/baseline_wikiann_de_task_adapters.out +#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/log-021-wikiann/baseline_wikiann_de_task_adapters.err + +# Set up the environment by loading modules +set -a # automatically export all variables +source ~/.env +set +a + +module load python/3.7.4 +module load gitlfs/2.7.1 +source $FP_BIGS/env_try_lang_adapter/bin/activate + + +LR=1e-5 + +BIGS_MODEL="bigscience/bloom-1b3" +MODEL_NAME="bigscience/bloom-1b3" +TOKENIZER_NAME="bigscience/bloom-1b3" + +# task-specific arguments +TASK_DATASET="wikiann" +TASK_LAYER="task-adapters" +LANG="de" +OUTPUT_DIR="/users/zyong2/data/zyong2/bigscience/data/processed/021-wikiann/$(basename $BIGS_MODEL)-baseline-${LANG}-FT-${TASK_LAYER}" # where you want to save checkpoints at +CACHE_DIR="/users/zyong2/data/zyong2/huggingface" # cache dir for saving/loading HF models and XNLI datasets. + + +mkdir -p $OUTPUT_DIR + +python /users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/scripts/eval/eval.py \ +$OUTPUT_DIR \ +--lang $LANG \ +--cache_dir $CACHE_DIR \ +--dataset $TASK_DATASET \ +--num_train_epochs 5 \ +--learning_rate $LR \ +--per_device_train_batch_size 8 \ +--gradient_accumulation_steps 4 \ +--original_model $BIGS_MODEL \ +--adapted_model_dir $MODEL_NAME \ +--tokenizer $TOKENIZER_NAME \ +--do_train \ +--do_predict \ +--task_layers $TASK_LAYER \ +--baseline +# --use_partial_data \ +# --use_partial_train_data 100 \ +# --use_partial_val_data 100 \ +# --use_partial_test_data 100 diff --git a/scripts/eval/scripts_wikiann/wikiann_de_task_adpters.sh b/scripts/eval/scripts_wikiann/wikiann_de_task_adpters.sh new file mode 100644 index 0000000..d80eff5 --- /dev/null +++ b/scripts/eval/scripts_wikiann/wikiann_de_task_adpters.sh @@ -0,0 +1,67 @@ +#!/bin/bash + +# Request half an hour of runtime: +#SBATCH --time=2-23:59:00 + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=gpu-he --gres=gpu:1 + +# Default resources are 1 core with 2.8GB of memory. +#SBATCH --ntasks=4 + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=100g + +# Specify a job name: +#SBATCH -J exp-021-wikiann-bloom1b3_extend_wikiann_de_task_adapters + +# Specify an output file +#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/log-021-wikiann/bloom1b3_extend_wikiann_de_task_adapters.out +#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/log-021-wikiann/bloom1b3_extend_wikiann_de_task_adapters.err + +# Set up the environment by loading modules +set -a # automatically export all variables +source ~/.env +set +a + +module load python/3.7.4 +module load gitlfs/2.7.1 +source $FP_BIGS/env_try_lang_adapter/bin/activate + + +LR=1e-5 + +BIGS_MODEL="bigscience/bloom-1b3" +MODEL_NAME="/users/zyong2/data/zyong2/bigscience/data/processed/020/bloom-1b3_de_emb_100000samples_24000vocab_extend" +TOKENIZER_NAME="/users/zyong2/data/zyong2/bigscience/data/processed/020/bloom-1b3_de_emb_100000samples_24000vocab_extend" + +# task-specific arguments +TASK_DATASET="wikiann" +TASK_LAYER="task-adapters" +LANG="de" +OUTPUT_DIR="/users/zyong2/data/zyong2/bigscience/data/processed/021-wikiann/$(basename $MODEL_NAME)-${LANG}-FT-${TASK_LAYER}" # where you want to save checkpoints at +CACHE_DIR="/users/zyong2/data/zyong2/huggingface" # cache dir for saving/loading HF models and XNLI datasets. + + +mkdir -p $OUTPUT_DIR + +python ./scripts/eval/eval.py \ +$OUTPUT_DIR \ +--lang $LANG \ +--cache_dir $CACHE_DIR \ +--dataset $TASK_DATASET \ +--num_train_epochs 100 \ +--learning_rate $LR \ +--per_device_train_batch_size 8 \ +--gradient_accumulation_steps 4 \ +--original_model $BIGS_MODEL \ +--adapted_model_dir $MODEL_NAME \ +--tokenizer $TOKENIZER_NAME \ +--do_train \ +--do_predict \ +--task_layers $TASK_LAYER + +# --use_partial_data \ +# --use_partial_train_data 100 \ +# --use_partial_val_data 100 \ +# --use_partial_test_data 100 diff --git a/scripts/eval/scripts_xnli/run_eval_xnli_zero_shot.sh b/scripts/eval/scripts_xnli/run_eval_xnli_zero_shot.sh new file mode 100644 index 0000000..855cde9 --- /dev/null +++ b/scripts/eval/scripts_xnli/run_eval_xnli_zero_shot.sh @@ -0,0 +1,67 @@ +#!/bin/bash +#SBATCH -p gpu +#SBATCH --gres="gpu:1" +#SBATCH --mem=100g + +#SBATCH --mail-type=BEGIN,END,FAIL +#SBATCH --mail-user=vassilina.nikoulina@naverlabs.com +#SBATCH --constraint="gpu_v100&gpu_32g" + +FP_BIGS=/tmp-network/user/vnikouli/Projects/bigscience +# Set up the environment by loading modules +source $FP_BIGS/multilingual-modeling/scripts/env/bin/activate + +# XNLI (Cross-Lingual and Supervised Setting) + +LANG=$1 +data_sample=$2 +vocabsize=$3 +adapter_reduction_factor=$4 + +ch=118500 + + +adapter_config="pfeiffer+inv" +model_name="tr5b-1B3-multilingual-alpha-checkpoints/ch${ch}" +ORIGINAL_MODEL=${FP_BIGS}/multilingual-modeling/scripts/exp-009/$model_name +TOKENIZER_DIR="${FP_BIGS}/tokenizers/${LANG}_oscar_${data_sample}_tokenizer_${vocabsize}" #default tok settings with vocab size = 24k +CACHE_DIR="${FP_BIGS}/data/" +data_dir="${FP_BIGS}/exp-ext-${LANG}/madx-bs1b3-multi-ch${ch}-${LANG}-sample${data_sample}-$( basename $TOKENIZER_DIR )" +data_tok_dir=${data_dir}/lng_tok + +MODEL_DIR="${data_dir}/bs1.3B${ch}-${adapter_config}-${adapter_reduction_factor}-es5" +XNLI_ZH_DIR=$ORIGINAL_MODEL/xnli_task_adapter_full # output directory +LR=1e-5 + +# language adapters checkpoint folder +MADX_LANG_ADAPTER_NAME="$MODEL_DIR/oscar_${LANG}" + +# we finetune task adapters for XNLI +FT_STRATEGIES="task_adapters" + +outdir=$MODEL_DIR/xnli_eval_zero_shot +# evaluate zero-shot training +python adapters_xnli_de_vn.py \ +$XNLI_ZH_DIR \ +--lang $LANG \ +--cache_dir $CACHE_DIR \ +--num_train_epochs 2 \ +--learning_rate $LR \ +--per_device_train_batch_size 8 \ +--gradient_accumulation_steps 4 \ +--pretrained_model $MODEL_DIR \ +--original_model $ORIGINAL_MODEL \ +--tokenizer $TOKENIZER_DIR \ +--do_eval_after_train \ +--madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ +--finetune_strategies "task_adapters" \ +--zero_shot &> $XNLI_ZH_DIR/$( basename $data_dir )-$( basename $MODEL_DIR )_eval.log + + + + +#Remove `--zero_shot` for supervised finetuning setting. + +### Zero-shot Prompt-based Setting + +#See branch [`bigscience-lm-adapt`](https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt) of yongzx/lm-evaluation-harness (forked repo). diff --git a/scripts/eval/scripts_xnli/train_xnli_zero_shot.sh b/scripts/eval/scripts_xnli/train_xnli_zero_shot.sh new file mode 100644 index 0000000..8a9445c --- /dev/null +++ b/scripts/eval/scripts_xnli/train_xnli_zero_shot.sh @@ -0,0 +1,66 @@ +#!/bin/bash + +# Ask for the GPU partition and 1 GPU +#SBATCH -p gpu +#SBATCH --gres="gpu:1" + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=100g + +# Specify a job name: +#SBATCH -J run_clm_madx + +#SBATCH --mail-type=BEGIN,END,FAIL +#SBATCH --mail-user=vassilina.nikoulina@naverlabs.com +#SBATCH --constraint="gpu_v100&gpu_32g" + +# XNLI (Cross-Lingual and Supervised Setting) + +FP_BIGS=/tmp-network/user/vnikouli/Projects/bigscience +# Set up the environment by loading modules +source $FP_BIGS/multilingual-modeling/scripts/env/bin/activate + +LANG=$1 +data_sample=$2 +vocabsize=$3 +adapter_reduction_factor=$4 + +ch=118500 + + +adapter_config="pfeiffer+inv" +model_name="tr5b-1B3-multilingual-alpha-checkpoints/ch${ch}" +ORIGINAL_MODEL=${FP_BIGS}/multilingual-modeling/scripts/exp-009/$model_name +TOKENIZER_DIR="${FP_BIGS}/tokenizers/${LANG}_oscar_${data_sample}_tokenizer_${vocabsize}" #default tok settings with vocab size = 24k +CACHE_DIR="${FP_BIGS}/data/" +data_dir="${FP_BIGS}/exp-ext-${LANG}/madx-bs1b3-multi-ch${ch}-${LANG}-sample${data_sample}-$( basename $TOKENIZER_DIR )" +data_tok_dir=${data_dir}/lng_tok + +MODEL_DIR="${data_dir}/bs1.3B${ch}-${adapter_config}-${adapter_reduction_factor}-es5" +OUTPUT_DIR=$ORIGINAL_MODEL/xnli_task_adapter_full +LR=1e-5 + +# language adapters checkpoint folder +MADX_LANG_ADAPTER_NAME="$MODEL_DIR/oscar_de" + +# we finetune task adapters for XNLI +FT_STRATEGIES="task_adapters" + +mkdir -p $OUTPUT_DIR +python adapters_xnli_de_vn.py \ +$OUTPUT_DIR \ +--lang $LANG \ +--cache_dir $CACHE_DIR \ +--num_train_epochs 2 \ +--learning_rate $LR \ +--per_device_train_batch_size 8 \ +--gradient_accumulation_steps 4 \ +--pretrained_model $MODEL_DIR \ +--original_model $ORIGINAL_MODEL \ +--tokenizer $TOKENIZER_DIR \ +--do_train \ +--do_eval_after_train \ +--madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ +--finetune_strategies "task_adapters" \ +--zero_shot &> $OUTPUT_DIR/train.log + diff --git a/scripts/exp_sentence_retrievale_eval/compute_retrieval_acc.sh b/scripts/exp_sentence_retrievale_eval/compute_retrieval_acc.sh index a0afcd8..b100574 100644 --- a/scripts/exp_sentence_retrievale_eval/compute_retrieval_acc.sh +++ b/scripts/exp_sentence_retrievale_eval/compute_retrieval_acc.sh @@ -1,22 +1,16 @@ #!/bin/bash #SBATCH -p gpu #SBATCH --gres="gpu:1" -#SBATCH --ntasks=16 -#SBATCH --mem=50g - -# Specify a job name: -#SBATCH -J eval_retrieval_acc - +#SBATCH --mem=200g +#SBATCH --constraint="gpu_v100&gpu_32g" # Specify an output file -#SBATCH -o /tmp-network/user/vnikouli/Projects/bigscience/logs/eval_retrieval_acc-%j.out -#SBATCH -e /tmp-network/user/vnikouli/Projects/bigscience/logs/eval_retrieval_acc-%j.err - #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --mail-user=vassilina.nikoulina@naverlabs.com +source /tmp-network/user/vnikouli/Projects/bigscience/multilingual-modeling/env_bloom/bin/activate model=$1 dataset=$2 -outdir=retrieval_acc_${model}-${dataset} -mkdir $outdir -python eval_sentence_retrieval.py $outdir --pretrained_model $model --tokenizer $model --dataset $dataset +outdir=$model/retrieval_acc-${dataset} +mkdir -p $outdir +python eval_sentence_retrieval.py $outdir --pretrained_model $model --tokenizer $model --dataset $dataset --pooling "max_min" \ No newline at end of file diff --git a/scripts/exp_sentence_retrievale_eval/eval_sentence_retrieval.py b/scripts/exp_sentence_retrievale_eval/eval_sentence_retrieval.py index 3fdf4e3..fbe33d3 100644 --- a/scripts/exp_sentence_retrievale_eval/eval_sentence_retrieval.py +++ b/scripts/exp_sentence_retrievale_eval/eval_sentence_retrieval.py @@ -26,6 +26,7 @@ parser.add_argument("--tokenizer", default="bert-base-multilingual-cased") parser.add_argument("--dataset", default="ted_multi") parser.add_argument("--device", default="cuda") +parser.add_argument("--pooling", default="mean") args = parser.parse_args() tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) @@ -94,14 +95,14 @@ def get_hidden_states(args, model): nb_talks = 2 talks = get_talks(dataset, nb_talks) - emb = get_hidden_states_for_talks(dataset, model, talks, args.pretrained_model) - - outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-talks-valid-{len(talks)}" + emb = get_hidden_states_for_talks(dataset, model, talks, args.pretrained_model, pooling=args.pooling) + + outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-talks-valid-{len(talks)}-{args.pooling}" elif args.dataset == "flores": nb_samples = 200 - emb = get_hidden_states_for_flores(args, model, args.pretrained_model, nb_samples = nb_samples) - outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-flores-{nb_samples}" + emb = get_hidden_states_for_flores(args, model, args.pretrained_model, nb_samples = nb_samples, pooling=args.pooling) + outname = f"{args.output_dir}/{args.pretrained_model.replace('/','-')}-flores-{nb_samples}-{args.pooling}" retrieval_acc = {} nb_states = model.config.num_hidden_layers @@ -116,12 +117,12 @@ def get_hidden_states(args, model): plt.savefig(f'{outname}-heatmap.png') -def get_hidden_states_for_flores(args, model, mname, nb_samples=50): +def get_hidden_states_for_flores(args, model, mname, nb_samples=50, pooling=""): emb = {} hidden_state_size = model.config.num_hidden_layers for lng in bs_languages: if lng in lngcode_map: - fname = f"{args.output_dir}/flores-{lng}-{nb_samples}-{mname.replace('/','-')}.pt" + fname = f"{args.output_dir}/flores-{lng}-{nb_samples}-{mname.replace('/','-')}-{pooling}.pt" if os.path.isfile(fname): emb[lng] = load_from_file(fname) else: @@ -134,16 +135,20 @@ def get_hidden_states_for_flores(args, model, mname, nb_samples=50): x = tokenizer(t, return_tensors="pt").input_ids.to(model.device) out = model(x) for state in range(hidden_state_size): - hs = torch.mean(out.hidden_states[state][0][1:-1], dim=0).detach() + if "max_min" in fname: + hs = torch.cat([torch.max(out.hidden_states[state][0][1:-1], dim=0).values, torch.min(out.hidden_states[state][0][1:-1], dim=0).values]).detach() + else: + hs = torch.mean(out.hidden_states[state][0][1:-1], dim=0).detach() emb[lng][state].append(Sample(sid, hs)) torch.save(emb[lng], fname) return emb -def get_hidden_states_for_talks(dataset, model, talks, mname): +def get_hidden_states_for_talks(dataset, model, talks, mname, pooling=""): emb = {} hidden_state_size = model.config.num_hidden_layers - fname = f"{args.output_dir}/ted_multi-{mname.replace('/','-')}-ted_multi-{len(talks)}.pt" + + fname = f"{args.output_dir}/ted_multi-{mname.replace('/','-')}-ted_multi-{len(talks)}-{pooling}.pt" if os.path.isfile(fname): emb = load_from_file(fname) return emb @@ -160,7 +165,10 @@ def get_hidden_states_for_talks(dataset, model, talks, mname): emb[lng][state] = [] out = model(x) for state in range(hidden_state_size): - hs = torch.mean(out.hidden_states[state][0], dim=0).detach() + if "max_min" in fname: + hs = torch.cat([torch.max(out.hidden_states[state][0], dim=0).values, torch.min(out.hidden_states[state][0], dim=0).values]).detach() + else: + hs = torch.mean(out.hidden_states[state][0], dim=0).detach() emb[lng][state].append(Sample(sid, hs)) torch.save(emb, fname) return emb diff --git a/scripts/lang_adapt/README.md b/scripts/lang_adapt/README.md index afc084d..0c3c6a7 100644 --- a/scripts/lang_adapt/README.md +++ b/scripts/lang_adapt/README.md @@ -1,13 +1,110 @@ # README ### Tokenizer and Tokenization of Dataset -Run `tokenized4clm.py` to train the tokenizer on OSCAR dataset. +Run `tokenized4clm_sampled.py` to train the tokenizer on the subset of OSCAR dataset. - `lang`: language name (e.g., "de", "th") -- `tokenizer_dir`: path directory to save the tokenizer. The tokenizer will be saved as `{lang}_oscar_tokenizer_{vocab_size}` -- `hf_cache_dir` (default is "~/.cache/huggingface/transformers"): cache directory for downloading the OSCAR dataset and GPT2 tokenizer. +- `model`: original tokenizer (e.g., "bigscience/bloom-1b3") +- `tokenizer_dir`: path directory to save the tokenizer. The tokenizer will be saved as `tok_${model}_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_{replace/extend}` +- `cache_dir` (default is "~/.cache/huggingface/transformers"): cache directory for downloading the OSCAR dataset and GPT2 tokenizer. - `vocab_size`: vocab size of the tokenizer +- `sample_size`: the amount of samples to use to train the tokenizer (randomly selected) +- `use_auth_token`: must be used for BLOOM model +- `tok_strategy`: extend, replace or overlap-replace -### Language Adaptation (6 Combinations) -- use `sbatch run_clm_emb.sh` to perform language adaptation with (emb-only, replace-vocab) strategies. Replace the LANG variable for the desired language (currently is `th`). Currently, the script uses slurm-job-array to control the size of the oscar training corpora. Note: remember to change the SLURM logging output files, `tokenizer_dir`, `cache_dir`, `output_dir`, and `logging_dir` in `run_clm_emb.sh`. -- use `sbatch run_clm_adpt.sh` to perform language adaptation with (emb-and-adpt, replace-vocab) strategies. Replace the LANG variable for the desired language (currently is `th`). Currently, the script uses slurm-job-array to control the size of the oscar training corpora and `ADPT_REDUCTION_FACTOR` to control the reduction factor of adapter modules. Note: remember to change the SLURM logging output files, `tokenizer_dir`, `cache_dir`, `output_dir`, and `logging_dir` in `run_clm_adpt.sh`. - - Hack: after `trainer.save_model()`, manually save the `wte` and `wpe` weights. \ No newline at end of file +``` +cache_dir=... +output_dir=... +lang=... # language +sample_size=... # training sample size +vocab_size=... # vocab size of tokenizer +tok_strategy=... # extend, replace, overlap-replace +bigs_model="bigscience/bloom-1b3" + +tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}" + +python ./scripts/lang_adapt/tokenized4clm_sampled.py \ +--lang $lang \ +--model $bigs_model \ +--tokenizer_dir $tokenizer_dir \ +--hf_cache_dir $cache_dir \ +--vocab_size $vocab_size \ +--sample_size $sample_size \ +--use_auth_token \ +--tok_strategy $tok_strategy +``` +--- + +### Language Adaptation +Run `madx_run_clm.py` to finetune language model on a new language. +- `LANG`: language name (e.g., "de", "th") on OSCAR +- `DATA_SAMPLES`: training sample size +- `VOCAB_SIZE`: vocab size of the tokenizer +- `BIGS_MODEL`: bigscience model +- `ADPT_STRATEGY`: language adaptation strategy (train only embedding for now: `"emb"`) +- `EMBD_SRATEGY`: embedding strategy. Either `"replace"` (replace the embedding layer entirely), `"overlap-replace"` (replace but initialize seen vocab with pretrained embedding), or `"extend"` (freeze seen vocab embeddings and add trainable embeddings for unseen vocab) +- `TOK_STRATEGY`: tokenization strategy (either `"replace"` (for embedding strategy of "replace" and "overlap-replace") or `"extend"`) +- `tokenizer_dir`: saved tokenizer directory (used in the tokenization script above) +- `cache_dir`: (as above) +- `output_dir`: directory to save adapted model +- `logging_dir`: directory to log loss curves to tensorboard +- `MAX_STEPS`: training steps +- `EVAL_STEPS`: number of training steps between two evaluations +- `SAVE_STEPS`: number of training steps between saving the checkpoints. +``` +LANG=... # language +DATA_SAMPLES=... # training sample size +VOCAB_SIZE=... # vocab size of newly trained tokenizer +BIGS_MODEL="bigscience/bloom-1b3" +ADPT_STRATEGY="emb" # language adaptation strategy (train only embedding for now) +EMBD_SRATEGY=... # either "replace", "overlap-replace", or "extend" +TOK_STRATEGY=... # either "replace" (for embedding strategy of "replace" and "overlap-replace") or "extend" + +tokenizer_dir=... # as above +tokenizer_dir="${tokenizer_dir}/tok_${BIGS_MODEL##*/}_${LANG}_oscar_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${TOK_STRATEGY}" +cache_dir=... # as above + +output_dir=... # directory to save adapted model +output_dir="${output_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}" +logging_dir=... # directory to log loss curves to tensorboard +logging_dir="${logging_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}" + +mkdir -p $output_dir +mkdir -p $logging_dir + +MAX_STEPS=50000 +EVAL_STEPS=5000 +SAVE_STEPS=5000 + +python ./scripts/lang_adapt/madx_run_clm.py \ + --seed 0 \ + --fp16 \ + --model_name_or_path $BIGS_MODEL \ + --tokenizer_name $tokenizer_dir \ + --dataset_name oscar \ + --cache_dir $cache_dir \ + --dataset_config_name "unshuffled_deduplicated_${LANG}" \ + --logging_dir $logging_dir \ + --report_to "tensorboard" \ + --learning_rate 0.001 \ + --do_train \ + --do_eval \ + --output_dir $output_dir \ + --preprocessing_num_workers 8 \ + --overwrite_output_dir \ + --per_device_train_batch_size 2 \ + --gradient_accumulation_steps 4 \ + --per_device_eval_batch_size 2 \ + --eval_accumulation_steps 4 \ + --eval_steps $EVAL_STEPS \ + --evaluation_strategy "steps" \ + --max_eval_samples 5000 \ + --save_steps $SAVE_STEPS \ + --save_strategy "steps" \ + --max_train_samples $DATA_SAMPLES \ + --max_steps $MAX_STEPS \ + --logging_steps 1000 \ + --lang_adapt_strategies $ADPT_STRATEGY \ + --embedding_strategies $EMBD_SRATEGY \ + --load_best_model_at_end \ + --use_auth_token +``` diff --git a/scripts/lang_adapt/bitfit/run_clm_bitfit_my.sh b/scripts/lang_adapt/bitfit/run_clm_bitfit_my.sh new file mode 100644 index 0000000..dce196c --- /dev/null +++ b/scripts/lang_adapt/bitfit/run_clm_bitfit_my.sh @@ -0,0 +1,56 @@ +#!/bin/bash + +# axis +LANG="my" +DATA_SAMPLES=100000 #$(($SLURM_ARRAY_TASK_ID * 1000)) +VOCAB_SIZE=5000 +CH=118500 +BIGS_MODEL="bigscience/bloom-350m" +ADPT_STRATEGY="emb-and-adpt" +EMBD_SRATEGY="extend" +FTNE_STRATEGY="bitfit" + +tokenizer_dir="checkpoint/tokenizer_ext_my/" +cache_dir="checkpoint/cache/" +output_dir="checkpoint/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${EMBD_SRATEGY}_${FTNE_STRATEGY}_${DATA_SAMPLES}samples" +logging_dir="logs/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${EMBD_SRATEGY}_${FTNE_STRATEGY}_${DATA_SAMPLES}samples" + +mkdir -p $output_dir +mkdir -p $logging_dir + +python madx_run_clm.py \ + --seed 0 \ + --fp16 \ + --model_name_or_path $BIGS_MODEL \ + --tokenizer_name $tokenizer_dir \ + --dataset_name oscar \ + --cache_dir $cache_dir \ + --dataset_config_name "unshuffled_deduplicated_${LANG}" \ + --logging_dir $logging_dir \ + --logging_first_step True \ + --logging_steps 8 \ + --report_to "tensorboard" \ + --learning_rate 1e-4 \ + --lr_scheduler_type "constant" \ + --do_train \ + --do_eval \ + --output_dir $output_dir \ + --preprocessing_num_workers 8 \ + --overwrite_output_dir \ + --per_device_train_batch_size 2 \ + --gradient_accumulation_steps 4 \ + --per_device_eval_batch_size 2 \ + --eval_accumulation_steps 1 \ + --eval_steps 1000 \ + --evaluation_strategy "epoch" \ + --max_eval_samples 5000 \ + --save_steps 10000 \ + --save_strategy "epoch" \ + --save_total_limit 3 \ + --max_train_samples ${DATA_SAMPLES}\ + --max_steps 6250 \ + --load_best_model_at_end \ + --lang_adapt_strategies $ADPT_STRATEGY \ + --embedding_strategies $EMBD_SRATEGY \ + --finetuning_strategies $FTNE_STRATEGY \ + --language $LANG &> $output_dir/train.log diff --git a/scripts/lang_adapt/bitfit/train_tokenizer_update.sh b/scripts/lang_adapt/bitfit/train_tokenizer_update.sh new file mode 100644 index 0000000..47d8fa0 --- /dev/null +++ b/scripts/lang_adapt/bitfit/train_tokenizer_update.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=cpu + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=50g + + + +lng=$1 +model=$2 +tokenizer_dir=$3 +vocab_size=$4 +sample_size=$5 +python tokenized4clm_sampled.py --lang $lng --model $model --tokenizer_dir $tokenizer_dir --vocab_size $vocab_size --sample_size $sample_size --extend_vocab + diff --git a/scripts/lang_adapt/compute_tok_overlap.py b/scripts/lang_adapt/compute_tok_overlap.py new file mode 100644 index 0000000..8f95394 --- /dev/null +++ b/scripts/lang_adapt/compute_tok_overlap.py @@ -0,0 +1,94 @@ +import sys +import json +import datasets +from datasets import load_dataset +from transformers import AutoTokenizer +import numpy as np +from collections import defaultdict +import math +import argparse +import matplotlib.pyplot as plt + +def get_en_tokenizer(): + en_tok = AutoTokenizer.from_pretrained('/tmp-network/user/vnikouli/Projects/bigscience/multilingual-modeling/scripts/exp-009/tr5b-1B3-multilingual-alpha-checkpoints/') + return en_tok + +def getdata(lng): + flores_path="/tmp-network/user/vnikouli/Projects/NLE-NMT/data/test_sets/" + with open(f'{flores_path}/FLORES-valid.{lng}') as f: + dataset = f.readlines() + return dataset + +def gettokens(tok, dataset): + from collections import defaultdict + seq_lengths = [] + toks_occ = defaultdict(int) + for i,l in enumerate(dataset): + toks = tok.tokenize(l.strip()) + seq_lengths.append(len(toks)) + toks_occ.update({t:toks_occ[t]+1 for t in toks }) + return np.array(seq_lengths), toks_occ + + + +def plot_histogram(tokoccs, name, ax, nb_bins): + ax.hist(tokoccs, nb_bins, histtype='bar', label=name) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument('--lang', type=str, required=True) + parser.add_argument('--tokenizers', type=str, nargs='+', + help='list of the tokenizers for which you want to get statstics') + parser.add_argument('--plot_name', type=str, default=None, help="If set generate plots containing tokens distribution across different axes (frequency, length, etc)") + args = parser.parse_args() + lng = args.lang + tokenizers = args.tokenizers + vocabs = {} + dataset=getdata(lng) + en_dataset = getdata("en") + seq_lengths = {} + tok_occs = {} + en_tok = get_en_tokenizer() + sl, to = gettokens(en_tok, en_dataset) + seq_lengths['en'] = sl + + for t in tokenizers: + tok = AutoTokenizer.from_pretrained(t) + sl, to = gettokens(tok, dataset) + seq_lengths[t] = sl + tok_occs[t] = to + with open(f'{t}/vocab.json') as jsonFile: + vocab = json.load(jsonFile) + vocabs[t] = set(vocab.keys()) + + + print("Print tokenization stats") + print("===============================") + fig, ax = plt.subplots(1, 4, figsize=(40, 10)) + for t in tokenizers: + print(f'Tokenizer {t}, avg tokenized seq length: {np.mean(seq_lengths[t])} (shorter sequences are better)') + #we want to decompose sentence in {lng} in approximately the same nb of tokens as in English hoping that it will favour knowledge transfer + x = seq_lengths[t]/seq_lengths["en"] + print(f'Tokenizer {t}, avg ratio with En tokenized sentence length: {np.mean(x)}+/- {np.std(x)}') + baseline_overlap = vocabs[t].intersection(set(en_tok.vocab.keys())) + print(f"Overlap with original tokenizer vocab : {len(baseline_overlap)} ") + overlap_vocab_toks = vocabs[t].intersection(set(tok_occs[t].keys())) + print(f"Which portion of new tokenizer was used? : {len(overlap_vocab_toks)}, represents {100.0*len(overlap_vocab_toks)/len(vocabs[t])}% of learnt vocab ") + + + if args.plot_name: + print("Do plotting") + fig, ax = plt.subplots(1, 4, figsize=(40, 10)) + ax[0].set_title("Token occ distribution") + plot_histogram([[math.log(v) for v in tok_occs[t].values()] for t in tokenizers], tokenizers, ax[0], 10) + ax[1].set_title("Seq length distribution") + plot_histogram([seq_lengths[t] for t in tokenizers], tokenizers, ax[1], 10) + ax[2].set_title("Diff wtih en seq length distribution") + plot_histogram([seq_lengths[t]/seq_lengths["en"] for t in tokenizers], tokenizers, ax[2], 10) + ax[3].set_title("Tok length distribution") + plot_histogram([[len(v) for v in vocabs[t] for i in range(tok_occs[t][v])] for t in tokenizers], tokenizers, ax[3], 10) + ax[1].legend() + fig.savefig(f"{args.plot_name}.png") + + diff --git a/scripts/lang_adapt/madx_run_clm.py b/scripts/lang_adapt/madx_run_clm.py index bcea14c..0d73be4 100644 --- a/scripts/lang_adapt/madx_run_clm.py +++ b/scripts/lang_adapt/madx_run_clm.py @@ -16,6 +16,8 @@ from datasets import load_dataset import transformers +from transformers import EarlyStoppingCallback + import transformers.adapters.composition as ac from transformers import ( CONFIG_MAPPING, @@ -32,6 +34,8 @@ set_seed, ) from transformers.adapters.configuration import AdapterConfig +from transformers.adapters import PrefixTuningConfig, LoRAConfig + from transformers.testing_utils import CaptureLogger from transformers.trainer_utils import get_last_checkpoint from transformers.utils import check_min_version @@ -99,15 +103,23 @@ class ModelArguments: "with private models)." }, ) - lang_adapt_strategies: str = field( - default="", + reinit_weights: bool = field( + default=False, metadata={"help": "choose one of the three strategies - 'emb', 'emb-and-adpt', 'emb-then-adpt'"}, ) + lang_adapt_strategies: str = field( + default=None, + metadata={"help": "language adaptation strategies"}, + ) embedding_strategies: str = field( default="", - metadata={"help": "choose one of the two strategies - 'replace', 'extend'"}, + metadata={"help": "choose one of the two strategies - 'replace', 'extend', 'overlap-replace'"}, ) - + adapter_placement: str = field( + default="all", + metadata={"help": "list of layers where to place the adapters: all: use all layers, '17,24': list layers id separated by ','"}, + ) + def __post_init__(self): if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None): raise ValueError( @@ -185,6 +197,40 @@ def __post_init__(self): extension = self.validation_file.split(".")[-1] assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file." +@dataclass +class ParamEfficientArguments(MultiLingAdapterArguments): + """ + Arguments pertaining to other parameter efficient techniques such as (LoRA, BitFit, etc.) + """ + # lora + selfattn_lora: bool = field( + default=True, + metadata={"help": "If True, add LoRA to the self-attention weights of a model. Defaults to True."}, + ) + intermediate_lora: bool = field( + default=False, + metadata={"help": "If True, add LoRA to the intermediate MLP weights of a model. Defaults to False."}, + ) + output_lora: bool = field( + default=False, + metadata={"help": "If True, add LoRA to the output MLP weights of a model. Defaults to False."}, + ) + r_lora: Optional[int] = field( + default=8, + metadata={"help": "If True, add LoRA to the output MLP weights of a model. Defaults to False."}, + ) + alpha_lora: Optional[int] = field( + default=8, + metadata={"help": "If True, add LoRA to the output MLP weights of a model. Defaults to False."}, + ) + dropout_lora: Optional[float] = field( + default=0.0, + metadata={"help": "If True, add LoRA to the output MLP weights of a model. Defaults to False."}, + ) + init_weights_lora: Optional[str] = field( + default='lora', + metadata={"help": "If True, add LoRA to the output MLP weights of a model. Defaults to False."}, + ) def load_tokenizer(model_args): tokenizer_kwargs = { @@ -193,11 +239,12 @@ def load_tokenizer(model_args): "revision": model_args.model_revision, "use_auth_token": True if model_args.use_auth_token else None, } - if model_args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) + print(f"✅ load tokenizer from: {model_args.tokenizer_name}") elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) + print(f"✅ load tokenizer from: {model_args.model_name_or_path}") else: raise ValueError( "You are instantiating a new tokenizer from scratch. This is not supported by this script." @@ -256,14 +303,18 @@ def load_data(data_args, model_args): # Distributed training: # The .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. - - if data_args.max_train_samples is not None: + if data_args.max_train_samples is not None and len(raw_datasets['train']) > data_args.max_train_samples: # FIXME: currently assume the loaded checkpoint is trained with the first data_args.max_train_samples number of samples - raw_datasets["train"] = raw_datasets["train"].filter(lambda example, indice: indice < data_args.max_train_samples, with_indices=True) + #raw_datasets["train"] = raw_datasets["train"].filter(lambda example, indice: indice < data_args.max_train_samples, with_indices=True) + print(raw_datasets["train"]) raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples)) - if data_args.max_eval_samples is not None: + print(raw_datasets["train"]) + + if data_args.max_eval_samples is not None and len(raw_datasets['validation']) > data_args.max_eval_samples: raw_datasets["validation"] = raw_datasets["validation"].select(range(data_args.max_eval_samples)) + print("✅ Loaded Raw Dataset:") + print(raw_datasets) return raw_datasets def load_model(model_args, tokenizer): @@ -291,30 +342,34 @@ def load_model(model_args, tokenizer): revision=model_args.model_revision, use_auth_token=True if model_args.use_auth_token else None, ) + print(f"✅ load model from: {model_args.model_name_or_path}") else: model = AutoModelForCausalLM.from_config(config) n_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values()) logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params") + print(f"✅ load model from config: ") + print(config) #TODO: remap embedding parameters - #if not tokenizer.name_or_path == model_args.model_name_or_path: - # orig_tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) - - model.resize_token_embeddings(len(tokenizer)) + return model def preprocess_data(training_args, data_args, model_args, tokenizer): with training_args.main_process_first(desc="dataset map tokenization"): - saved_tokenized_datasets_fp = pathlib.Path(f"{training_args.data_dir}/tokenized_data.pt") + # cache tokenized data + base_cache_dir = f"{model_args.cache_dir}/{data_args.dataset_name}/{data_args.dataset_config_name}" + saved_tokenized_datasets_fp = pathlib.Path(f"{base_cache_dir}/tokenized_data_{data_args.max_train_samples}train_{data_args.max_eval_samples}eval_{len(tokenizer)}vocab.pt") - if saved_tokenized_datasets_fp.exists() and saved_tokenized_datasets_fp.is_file(): + if not data_args.overwrite_cache and saved_tokenized_datasets_fp.exists() and saved_tokenized_datasets_fp.is_file(): tokenized_datasets = torch.load(str(saved_tokenized_datasets_fp)) - logger.info(f"✅ loaded tokenized_data") + logger.info(f"✅ loaded tokenized_data from {saved_tokenized_datasets_fp}") else: raw_datasets = load_data(data_args, model_args) assert len(raw_datasets['train']) == data_args.max_train_samples - logger.info(f"🧠 Sanity check: loaded raw datasets have {data_args.max_train_samples} samples") - + assert len(raw_datasets['validation']) == data_args.max_eval_samples + assert len(raw_datasets['test']) == data_args.max_eval_samples + print(f"✅ Sanity check: loaded raw datasets have {data_args.max_train_samples} training samples and {data_args.max_eval_samples} eval samples") + # First we tokenize all the texts. if training_args.do_train: column_names = raw_datasets["train"].column_names @@ -334,7 +389,7 @@ def tokenize_function(examples): "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model." ) return output - + tokenized_datasets = raw_datasets.map( tokenize_function, batched=True, @@ -343,12 +398,15 @@ def tokenize_function(examples): load_from_cache_file=not data_args.overwrite_cache, desc="Running tokenizer on dataset", ) + torch.save(tokenized_datasets, saved_tokenized_datasets_fp) - logger.info(f"✅ saved tokenized_data") + logger.info(f"✅ saved tokenized_data to {saved_tokenized_datasets_fp}") + if "train" not in tokenized_datasets and training_args.do_train: raise ValueError("--do_train requires a train dataset") if "validation" not in tokenized_datasets and training_args.do_eval: raise ValueError("--do_eval requires a validation dataset") + return tokenized_datasets @@ -393,10 +451,12 @@ def group_texts(examples): # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map with training_args.main_process_first(desc="grouping texts together"): - saved_lm_datasets_fp = pathlib.Path(f"{training_args.data_dir}/lm_data.pt") - if saved_lm_datasets_fp.exists() and saved_lm_datasets_fp.is_file(): + base_cache_dir = f"{model_args.cache_dir}/{data_args.dataset_name}/{data_args.dataset_config_name}" + saved_lm_datasets_fp = pathlib.Path(f"{base_cache_dir}/lm_data_{data_args.max_train_samples}train_{data_args.max_eval_samples}eval_{len(tokenizer)}vocab.pt") + + if not data_args.overwrite_cache and saved_lm_datasets_fp.exists() and saved_lm_datasets_fp.is_file(): lm_datasets = torch.load(str(saved_lm_datasets_fp)) - logger.info("✅ loaded lm_data") + logger.info(f"✅ loaded lm_data from {saved_lm_datasets_fp}") else: tokenized_datasets = preprocess_data(training_args, data_args, model_args, tokenizer) lm_datasets = tokenized_datasets.map( @@ -407,28 +467,57 @@ def group_texts(examples): desc=f"Grouping texts in chunks of {block_size}", ) torch.save(lm_datasets, saved_lm_datasets_fp) - logger.info("✅ saved lm_data") - print(lm_datasets) + logger.info(f"✅ saved lm_data to {saved_lm_datasets_fp}") return lm_datasets -def modify_model(adapter_args, data_args, model_args, model): - if model_args.lang_adapt_strategies == "emb": - for name, param in model.named_parameters(): - if "wte" not in name and "wpe" not in name: - param.requires_grad = False +def modify_model(adapter_args, data_args, model_args, tokenizer, model): + def get_adapter_config(adapter_args, model_args): + # modify here for new parameter efficient techniques associated with adapter-hub + if adapter_args.adapter_config == "prefix_tuning": + if model_args.adapter_placement == "all": + adapter_config = PrefixTuningConfig(bottleneck_size = 800) + else: + adapters2use = set([int(i) for i in model_args.adapter_placement.split(",")]) + adapter_config = PrefixTuningConfig(bottleneck_size = 800, + leave_out = [i for i in range(0,24) if not i in adapters2use] + ) + + elif adapter_args.adapter_config == "lora": + adapter_config = LoRAConfig( + selfattn_lora = adapter_args.selfattn_lora, + intermediate_lora = adapter_args.intermediate_lora, + output_lora = adapter_args.output_lora, + r = adapter_args.r_lora, + alpha = adapter_args.alpha_lora, + dropout = adapter_args.dropout_lora, + init_weights = adapter_args.init_weights_lora, + ) + + else: + # TODO: confirm with Vassilina what goes into this condition + if model_args.adapter_placement == "all": + adapter_config = AdapterConfig.load( + adapter_args.adapter_config, + non_linearity=adapter_args.adapter_non_linearity, + reduction_factor=adapter_args.adapter_reduction_factor + ) + else: + adapters2use = set([int(i) for i in model_args.adapter_placement.split(",")]) + adapter_config = AdapterConfig.load( + adapter_args.adapter_config, + non_linearity=adapter_args.adapter_non_linearity, + reduction_factor=adapter_args.adapter_reduction_factor, + leave_out = [i for i in range(0,24) if not i in adapters2use] + ) + return adapter_config # Setup adapters - elif adapter_args.train_adapter: + if adapter_args.train_adapter: task_name = data_args.dataset_name or "clm" - task_name += f"_{adapter_args.language}" + task_name += f"_{adapter_args.adapter_config}_{adapter_args.language}" # check if adapter already exists, otherwise add it if task_name not in model.config.adapters: - # resolve the adapter config - adapter_config = AdapterConfig.load( - adapter_args.adapter_config, - non_linearity=adapter_args.adapter_non_linearity, - reduction_factor=adapter_args.adapter_reduction_factor, - ) + adapter_config = get_adapter_config(adapter_args, model_args) # load a pre-trained from Hub if specified if adapter_args.load_adapter: model.load_adapter( @@ -436,7 +525,6 @@ def modify_model(adapter_args, data_args, model_args, model): config=adapter_config, load_as=task_name, ) - # otherwise, add a fresh adapter else: model.add_adapter(task_name, config=adapter_config) # optionally load a pre-trained language adapter @@ -455,32 +543,102 @@ def modify_model(adapter_args, data_args, model_args, model): ) else: lang_adapter_name = None + # Freeze all model weights except of those of this adapter - model.train_adapter([task_name]) + model.train_adapter(task_name, train_embeddings=True) + # Set the adapters to be used in every forward pass - if lang_adapter_name: - model.set_active_adapters(ac.Stack(lang_adapter_name, task_name)) - else: - model.set_active_adapters(task_name) + #if lang_adapter_name: + # model.set_active_adapters(ac.Stack(lang_adapter_name, task_name)) + #else: + # model.set_active_adapters(task_name) + else: if adapter_args.load_adapter or adapter_args.load_lang_adapter: raise ValueError( "Adapters can only be loaded in adapters training mode." "Use --train_adapter to enable adapter training" ) + + print(f"✅ Use Embedding Strategy: {model_args.embedding_strategies}") + + if model_args.embedding_strategies == "overlap-replace": + if not tokenizer.name_or_path == model_args.model_name_or_path: + orig_tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + else: + raise Exception("Same tokenizer so overlap-replace doesn't make sense.") + + if hasattr(model.transformer, "wte"): + # gpt-2 + ref_embedding = model.transformer.wte + elif hasattr(model.transformer, "word_embeddings"): + # bloom + ref_embedding = model.transformer.word_embeddings + else: + raise Exception("Unsupported Model") + + model.resize_token_embeddings(len(tokenizer)) + overlap = set(tokenizer.vocab).intersection(set(orig_tokenizer.vocab)) + print(f"{len(overlap)} tokens overlapped") + curr_vocab = tokenizer.vocab + orig_vocab = orig_tokenizer.vocab + for t in overlap: + if hasattr(model.transformer, "wte"): + model.transformer.wte.weight.data[curr_vocab[t]] = ref_embedding.weight[orig_vocab[t]] + elif hasattr(model.transformer, "word_embeddings"): + model.transformer.word_embeddings.weight.data[curr_vocab[t]] = ref_embedding.weight[orig_vocab[t]] + else: + raise Exception("Unsupported Model") + model.tie_weights() + + elif model_args.embedding_strategies == "replace": + model.resize_token_embeddings(len(tokenizer)) + model.tie_weights() + + elif model_args.embedding_strategies == "extend": + original_embedding_layer = model.get_input_embeddings() + original_vocab_size = original_embedding_layer.weight.shape[0] + print(f"Tokens for new languages: {len(tokenizer) - original_vocab_size}") + model.resize_token_embeddings(len(tokenizer)) + model.tie_weights() + + embedding_layer = model.get_input_embeddings() + # erases gradients for the original embedding layer, without using extra CUDA memory + def zero_grad(grad): + grad[:original_vocab_size, :] = 0 + return grad + + embedding_layer.weight.register_hook(lambda grad: zero_grad(grad)) + + if model_args.reinit_weights: + print(f"❗️ Reinitialize model's weights") + model.init_weights() + trainable_params = 0 frozen_params = 0 emb_params = 0 for name, param in model.named_parameters(): + if "word_embeddings" in name or "wte" in name or "wpe" in name or "lm_head" in name: + param.requires_grad = True + emb_params += param.numel() + + elif model_args.lang_adapt_strategies is not None: + if model_args.lang_adapt_strategies == "emb": + param.requires_grad = False + elif model_args.lang_adapt_strategies == "bitfit": + if 'bias' not in name: + param.requires_grad = False + else: + param.requires_grad = True + elif model_args.lang_adapt_strategies == "continual-pretrain": + param.requires_grad = True + if not param.requires_grad: print(f"🥶 Frozen layer '{name}'") frozen_params += param.numel() else: print(f"🚀 Trainable layer '{name}'") trainable_params += param.numel() - - if "wte" and "wpe" in name: - emb_params += param.numel() print(f"Total frozen parameters: {frozen_params}") print(f"Total emb parameters (wte, wpe): {emb_params}") @@ -491,7 +649,7 @@ def main(): # or by passing the --help flag to this script. # We now keep distinct sets of args, for a cleaner separation of concerns. - parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, MultiLingAdapterArguments)) + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments, ParamEfficientArguments)) if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): # If we pass only one argument to the script and it's the path to a json file, @@ -501,10 +659,11 @@ def main(): ) else: model_args, data_args, training_args, adapter_args = parser.parse_args_into_dataclasses() - training_args.data_dir = f'{training_args.output_dir}' + + training_args.data_dir = f'{training_args.output_dir}' - assert model_args.lang_adapt_strategies in ('emb', 'emb-and-adpt', 'emb-then-adpt') - assert model_args.embedding_strategies in ('replace', 'extend') + assert model_args.lang_adapt_strategies in ('continual-pretrain', 'emb', 'madx', 'emb-then-adpt', 'lora', 'bitfit') + assert model_args.embedding_strategies in ('replace', 'extend', 'overlap-replace') # Setup logging logging.basicConfig( @@ -551,8 +710,8 @@ def main(): tokenizer = load_tokenizer(model_args) model = load_model(model_args, tokenizer) - - modify_model(adapter_args, data_args, model_args, model) + modify_model(adapter_args, data_args, model_args, tokenizer, model) + # Preprocessing the datasets. lm_datasets = get_lm_dataset(training_args, data_args, model_args, tokenizer) if training_args.do_train: @@ -560,8 +719,7 @@ def main(): if training_args.do_eval: eval_dataset = lm_datasets["validation"] - - + # Initialize our Trainer trainer_class = AdapterTrainer if adapter_args.train_adapter else Trainer trainer = trainer_class( @@ -571,11 +729,15 @@ def main(): eval_dataset=eval_dataset if training_args.do_eval else None, tokenizer=tokenizer, # Data collator will default to DataCollatorWithPadding, so we change it. - data_collator=default_data_collator, + data_collator=default_data_collator ) - logger.info(model) + print("Model: 👇") + print(model) + + # print("Embeddings at start of run:", model.get_input_embeddings().weight[250880:,:]) # get original weight for embedding layer + # orig_embeddings = model.get_input_embeddings().weight.detach().clone() # clone original weight for embedding layer # Training if training_args.do_train: checkpoint = None @@ -583,8 +745,22 @@ def main(): checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint + trainer.add_callback(EarlyStoppingCallback(3)) train_result = trainer.train(resume_from_checkpoint=checkpoint) - trainer.save_model() # Saves the tokenizer too for easy upload + trainer.save_model() # Saves the tokenizer too for easy upload # normally this part only saves the adapters? (TODO: check) + + # save embedding and positional embedding (which is not saved by trainer) + + # This part is used if we use initial BS 1b3 model (the one used for experiments reported in the paper) + if hasattr(trainer.model.transformer, "wte"): + torch.save(trainer.model.transformer.wte, f'{trainer.args.output_dir}/embedding_wte.pt') # for sanity check + if hasattr(trainer.model.transformer, "wpe"): + torch.save(trainer.model.transformer.wpe, f'{trainer.args.output_dir}/embedding_wpe.pt') + + # this part is used for BLOOM models + if hasattr(trainer.model.transformer, "word_embeddings"): + torch.save(trainer.model.transformer.word_embeddings, f'{trainer.args.output_dir}/word_embeddings.pt') + torch.save(trainer.model.transformer.word_embeddings_layernorm, f'{trainer.args.output_dir}/word_embeddings_layernorm.pt') metrics = train_result.metrics @@ -596,6 +772,18 @@ def main(): trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() + + # uncomment to test whether extending vocab gradient masking is working correctly. + # if model_args.embedding_strategies == "extend": + # print("Unsliced, post-training:", model.get_input_embeddings().weight) # get updated weight + # if not torch.equal(orig_embeddings[:250880, :], model.get_input_embeddings().weight[:250880, :]): + # raise ValueError("embedding layer is updated where it shouldn't....") + + # if torch.equal(orig_embeddings[250880:, :], model.get_input_embeddings().weight[250880:, :]): + # print("original embeddings:", orig_embeddings[250880:, :]) + # print("updated embeddings:", model.get_input_embeddings().weight[250880:, :]) + # raise ValueError("embedding layer is not updated where it should....") + # Evaluation if training_args.do_eval: @@ -635,4 +823,4 @@ def _mp_fn(index): if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/scripts/lang_adapt/run_clm_adpt.sh b/scripts/lang_adapt/scripts/run_clm_adpt.sh similarity index 100% rename from scripts/lang_adapt/run_clm_adpt.sh rename to scripts/lang_adapt/scripts/run_clm_adpt.sh diff --git a/scripts/lang_adapt/scripts/run_clm_adpt_vn.sh b/scripts/lang_adapt/scripts/run_clm_adpt_vn.sh new file mode 100644 index 0000000..c585c22 --- /dev/null +++ b/scripts/lang_adapt/scripts/run_clm_adpt_vn.sh @@ -0,0 +1,83 @@ +#!/bin/bash + +# Ask for the GPU partition and 1 GPU +#SBATCH -p gpu +#SBATCH --gres="gpu:1" + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=100g + +# Specify a job name: +#SBATCH -J run_clm_madx + +# Specify an output file +#SBATCH -o /tmp-network/user/vnikouli/Projects/bigscience/logs/run_clm_madx-%j.out +#SBATCH -e /tmp-network/user/vnikouli/Projects/bigscience/logs/run_clm_madx-%j.err + +#SBATCH --mail-type=BEGIN,END,FAIL +#SBATCH --mail-user=vassilina.nikoulina@naverlabs.com +#SBATCH --constraint="gpu_v100&gpu_32g" + +FP_BIGS=/tmp-network/user/vnikouli/Projects/bigscience +# Set up the environment by loading modules +source $FP_BIGS/multilingual-modeling/scripts/env/bin/activate + +data_sample=$1 +ch=118500 +lng=$2 +adapter_reduction_factor=$3 +dataset=oscar +adapter_config="pfeiffer+inv" +vocabsize=1000 +model_name="tr5b-1B3-multilingual-alpha-checkpoints/ch${ch}" +tokenizer_dir="${FP_BIGS}/tokenizers/${lng}_oscar_${data_sample}_tokenizer_${vocabsize}" #default tok settings with vocab size = 24k +cache_dir="${FP_BIGS}/data/" +data_dir="${FP_BIGS}/exp-ext-${lng}/madx-bs1b3-multi-ch${ch}-${lng}-sample${data_sample}-$( basename $tokenizer_dir )" +data_tok_dir=${data_dir}/lng_tok + +output_dir="${data_dir}/bs1.3B${ch}-${adapter_config}-${adapter_reduction_factor}-es5" +logging_dir="${FP_BIGS}/logs/exp-ext-${lng}/madx-bs1b3-multi-ch${ch}-${lng}-sample${data_sample}-$( basename $tokenizer_dir )/bs1.3B${ch}-${adapter_config}-${adapter_reduction_factor}-es5" +echo $output_dir + +BIGS_MODEL=${FP_BIGS}/multilingual-modeling/scripts/exp-009/$model_name + + +mkdir -p $output_dir +mkdir -p $logging_dir + +adapter_config="pfeiffer+inv" +python $FP_BIGS/multilingual-modeling/scripts/lang_adapt/madx_run_clm.py \ + --seed 0 \ + --fp16 \ + --model_name_or_path $BIGS_MODEL \ + --tokenizer_name $tokenizer_dir \ + --dataset_name oscar \ + --cache_dir $cache_dir \ + --dataset_config_name "unshuffled_deduplicated_${lng}" \ + --logging_dir $logging_dir \ + --report_to "tensorboard" \ + --learning_rate 0.001 \ + --do_train \ + --do_eval \ + --output_dir $output_dir \ + --preprocessing_num_workers 8 \ + --overwrite_output_dir \ + --per_device_train_batch_size 2 \ + --gradient_accumulation_steps 4 \ + --per_device_eval_batch_size 2 \ + --eval_accumulation_steps 4 \ + --eval_steps 1000 \ + --evaluation_strategy "epoch" \ + --max_eval_samples 5000 \ + --save_steps 10000 \ + --save_strategy "epoch" \ + --save_total_limit 3 \ + --max_train_samples ${data_sample}\ + --max_steps 50000 \ + --train_adapter \ + --load_best_model_at_end \ + --lang_adapt_strategies "emb-and-adpt" \ + --embedding_strategies "overlap-replace" \ + --adapter_reduction_factor $adapter_reduction_factor \ + --adapter_config ${adapter_config} \ + --language $lng &> $output_dir/train.log diff --git a/scripts/lang_adapt/run_clm_emb.sh b/scripts/lang_adapt/scripts/run_clm_emb.sh similarity index 63% rename from scripts/lang_adapt/run_clm_emb.sh rename to scripts/lang_adapt/scripts/run_clm_emb.sh index cb397ff..6936acc 100644 --- a/scripts/lang_adapt/run_clm_emb.sh +++ b/scripts/lang_adapt/scripts/run_clm_emb.sh @@ -5,7 +5,7 @@ # Ask for the GPU partition and 1 GPU #SBATCH --partition=gpu-he --gres=gpu:1 -#SBATCH --array=100,200,500 +#SBATCH --array=1 # Default resources are 1 core with 2.8GB of memory. #SBATCH --ntasks=4 @@ -27,28 +27,34 @@ set +a module load python/3.7.4 module load gitlfs/2.7.1 -source $FP_BIGS/env_lang_adapter/bin/activate +source $FP_BIGS/env_try_lang_adapter/bin/activate # axis -LANG="th" -MAX_TRAIN_SAMPLES=$(($SLURM_ARRAY_TASK_ID * 1000)) -BIGS_MODEL="/users/zyong2/data/zyong2/huggingface/bigscience/tr5b-1B3-multilingual-alpha-checkpoints" +LANG="my" +DATA_SAMPLES=$(($SLURM_ARRAY_TASK_ID * 1000)) +VOCAB_SIZE=5000 +CH=118500 +BIGS_MODEL="bigscience/bloom-1b3" +ADPT_STRATEGY="emb" +EMBD_SRATEGY="replace" - -tokenizer_dir="/users/zyong2/data/zyong2/bigscience/data/processed/020/th_oscar_tokenizer_full" +tokenizer_dir="/users/zyong2/data/zyong2/bigscience/data/processed/020/tok_${BIGS_MODEL##*/}_${LANG}_oscar_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}" cache_dir="/users/zyong2/data/zyong2/huggingface/" -output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/020/${LANG}_emb_${MAX_TRAIN_SAMPLES}samples" -logging_dir="/users/zyong2/data/zyong2/bigscience/data/reports/020/${LANG}_emb_${MAX_TRAIN_SAMPLES}samples" +output_dir="/users/zyong2/data/zyong2/bigscience/data/processed/020/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples" +logging_dir="/users/zyong2/data/zyong2/bigscience/reports/020/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples" + mkdir -p $output_dir mkdir -p $logging_dir -python $FP_BIGS/scripts/exp-020/madx_run_clm.py \ +python /users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/scripts/lang_adapt/madx_run_clm.py \ + --seed 0 \ + --fp16 \ --model_name_or_path $BIGS_MODEL \ --tokenizer_name $tokenizer_dir \ --dataset_name oscar \ --cache_dir $cache_dir \ - --dataset_config_name "unshuffled_deduplicated_$LANG" \ + --dataset_config_name "unshuffled_deduplicated_${LANG}" \ --logging_dir $logging_dir \ --report_to "tensorboard" \ --learning_rate 0.001 \ @@ -66,7 +72,9 @@ python $FP_BIGS/scripts/exp-020/madx_run_clm.py \ --max_eval_samples 5000 \ --save_steps 25000 \ --save_strategy "steps" \ - --max_train_samples $MAX_TRAIN_SAMPLES \ + --max_train_samples $DATA_SAMPLES \ --max_steps 50000 \ - --lang_adapt_strategies "emb" \ - --embedding_strategies "replace" \ No newline at end of file + --lang_adapt_strategies $ADPT_STRATEGY \ + --embedding_strategies $EMBD_SRATEGY \ + --load_best_model_at_end \ + --use_auth_token \ No newline at end of file diff --git a/scripts/lang_adapt/scripts/train_tokenizer_scratch.sh b/scripts/lang_adapt/scripts/train_tokenizer_scratch.sh new file mode 100644 index 0000000..8187c2c --- /dev/null +++ b/scripts/lang_adapt/scripts/train_tokenizer_scratch.sh @@ -0,0 +1,46 @@ +#!/bin/bash + +# Request half an hour of runtime: +#SBATCH --time=1:59:00 + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=gpu-he --gres=gpu:1 + +# Default resources are 1 core with 2.8GB of memory. +#SBATCH --ntasks=4 + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=50g + +# Specify a job name: +#SBATCH -J exp-020-tokenized4clm_sampled + +# Specify an output file +#SBATCH -o /users/zyong2/data/zyong2/bigscience/logs/log-020/tokenized4clm_sampled_scratch.out +#SBATCH -e /users/zyong2/data/zyong2/bigscience/logs/log-020/tokenized4clm_sampled_scratch.err + +# Set up the environment by loading modules +set -a # automatically export all variables +source ~/.env +set +a + +module load python/3.7.4 +module load gitlfs/2.7.1 +source $FP_BIGS/env_try_lang_adapter/bin/activate + + +# call by `sbatch train_tokenizer_scratch.sh my 1000 5000` +cache_dir="/users/zyong2/data/zyong2/huggingface/" +lng=$1 +sample_size=$2 +vocab_size=$3 +MODEL="bigscience/bloom-1b3" + +python /users/zyong2/data/zyong2/bigscience/gh/multilingual-modeling/scripts/lang_adapt/tokenized4clm_sampled.py \ +--lang $lng \ +--model $MODEL \ +--tokenizer_dir /users/zyong2/data/zyong2/bigscience/data/processed/020 \ +--hf_cache_dir $cache_dir \ +--vocab_size $vocab_size \ +--sample_size $sample_size \ +--use_auth_token diff --git a/scripts/lang_adapt/scripts/train_tokenizer_update.sh b/scripts/lang_adapt/scripts/train_tokenizer_update.sh new file mode 100644 index 0000000..4c08242 --- /dev/null +++ b/scripts/lang_adapt/scripts/train_tokenizer_update.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +# Ask for the GPU partition and 1 GPU +#SBATCH --partition=cpu + +# Use more memory (10GB) (CPU RAM): +#SBATCH --mem=50g + + + +bs_dir=/tmp-network/user/vnikouli/Projects/bigscience +lng=$1 +sample_size=$2 +vocab_size=$3 +source $bs_dir/multilingual-modeling/scripts/env/bin/activate +python tokenized4clm_sampled.py --lang $lng --tokenizer_dir $bs_dir/tokenizers --hf_cache_dir $bs_dir/data --vocab_size $vocab_size --sample_size $sample_size --extend_vocab + diff --git a/scripts/lang_adapt/tokenized4clm_sampled.py b/scripts/lang_adapt/tokenized4clm_sampled.py new file mode 100644 index 0000000..abab025 --- /dev/null +++ b/scripts/lang_adapt/tokenized4clm_sampled.py @@ -0,0 +1,87 @@ +import torch +import datasets +import transformers +from transformers import AutoTokenizer +from datasets import load_dataset +import pathlib + +import argparse +import sys + +import logging +logger = logging.getLogger(__name__) +logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) +log_level = -1 +logger.setLevel(log_level) +datasets.utils.logging.set_verbosity(log_level) +transformers.utils.logging.set_verbosity(log_level) +transformers.utils.logging.enable_default_handler() +transformers.utils.logging.enable_explicit_format() +tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base") + + +parser = argparse.ArgumentParser() +parser.add_argument('--lang', type=str, required=True) +parser.add_argument('--model', type=str, required=True) +parser.add_argument('--tokenizer_dir', type=str, required=True) +parser.add_argument('--tok_strategy', type=str, choices=["replace", "extend", "overlap-replace"] ,required=True) +parser.add_argument('--hf_cache_dir', default="~/.cache/huggingface/transformers", type=str) +parser.add_argument('--vocab_size', default=24_000, type=int) +parser.add_argument('--sample_size', default=100_000, type=int) +parser.add_argument("--use_auth_token", default=False, action="store_true") +parser.add_argument("--seed", default=42, type=int) + +args = parser.parse_args() +lang = args.lang + +if args.sample_size: + raw_datasets = load_dataset( + "oscar", + f"unshuffled_deduplicated_{lang}", + cache_dir=args.hf_cache_dir + )["train"].shuffle(seed=args.seed).select(range(args.sample_size)) + +else: + raw_datasets = load_dataset( + "oscar", + f"unshuffled_deduplicated_{lang}", + cache_dir=args.hf_cache_dir + )["train"] + +print(f"✅ Loaded raw_datasets OSCAR language {lang}") + +def batch_iterator(): + global unique_toks + batch_size = 1000 + for i in range(0, len(raw_datasets), batch_size): + sample = raw_datasets[i : i + batch_size]["text"] + unique_toks = unique_toks.union(set(" ".join(sample).split(" "))) + yield sample + +unique_toks = set() +model_name = pathlib.Path(args.model).parts[-1] + +if args.tok_strategy == 'extend': + # Yong: have checked that added tokens would have indices after the original vocab size. + tokenizer = AutoTokenizer.from_pretrained(args.model) + assert tokenizer.is_fast + new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=args.vocab_size) + print("✅ Trained tokenizer with len ", len(new_tokenizer)) + added = tokenizer.add_tokens([tok for tok in new_tokenizer.vocab.keys()]) + print([tok for tok in new_tokenizer.vocab.keys()]) + print(f"Overlap with previous vocab: {args.vocab_size - added}") + tokenizer.save_pretrained(f"{args.tokenizer_dir}") + print(f"Saved tokenizer to {args.tokenizer_dir}") + +elif args.tok_strategy in ('replace', 'overlap-replace'): + tokenizer = AutoTokenizer.from_pretrained(args.model, use_auth_token=args.use_auth_token) + assert tokenizer.is_fast + new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=args.vocab_size) + print("Unique toks, ", len(unique_toks)) + print("✅ Trained tokenizer with len ", len(new_tokenizer)) + new_tokenizer.save_pretrained(f"{args.tokenizer_dir}") + print(f"Saved tokenizer to {args.tokenizer_dir}") diff --git a/scripts/requirements.txt b/scripts/requirements.txt deleted file mode 100644 index a4e486a..0000000 --- a/scripts/requirements.txt +++ /dev/null @@ -1,133 +0,0 @@ -absl-py==0.14.0 -anyio==3.3.1 -argcomplete==1.12.3 -argon2-cffi==21.1.0 -attrs==21.2.0 -Babel==2.9.1 -backcall==0.2.0 -bleach==4.1.0 -cachetools==4.2.2 -certifi==2021.5.30 -cffi==1.14.6 -charset-normalizer==2.0.4 -click==8.0.1 -configparser==5.0.2 -datasets==1.11.0 -debugpy==1.4.3 -decorator==5.0.9 -defusedxml==0.7.1 -dill==0.3.4 -docker-pycreds==0.4.0 -entrypoints==0.3 -filelock==3.0.12 -fsspec==2021.8.1 -gitdb==4.0.7 -GitPython==3.1.24 -google-auth==1.35.0 -google-auth-oauthlib==0.4.6 -grpcio==1.41.0 -huggingface-hub==0.0.16 -idna==3.2 -importlib-metadata==4.8.1 -ipykernel==6.4.1 -ipython==7.27.0 -ipython-genutils==0.2.0 -ipywidgets==7.6.4 -jedi==0.18.0 -Jinja2==3.0.1 -joblib==1.0.1 -json5==0.9.6 -jsonschema==3.2.0 -jupyter==1.0.0 -jupyter-client==7.0.2 -jupyter-console==6.4.0 -jupyter-core==4.7.1 -jupyter-server==1.11.0 -jupyterlab==3.1.11 -jupyterlab-pygments==0.1.2 -jupyterlab-server==2.8.1 -jupyterlab-widgets==1.0.1 -lxml==4.6.3 -Markdown==3.3.4 -MarkupSafe==2.0.1 -matplotlib-inline==0.1.3 -mistune==0.8.4 -multiprocess==0.70.12.2 -nbclassic==0.3.1 -nbclient==0.5.4 -nbconvert==6.1.0 -nbformat==5.1.3 -nest-asyncio==1.5.1 -notebook==6.4.3 -numpy==1.21.2 -oauthlib==3.1.1 -packaging==21.0 -pandas==1.3.2 -pandocfilters==1.4.3 -parso==0.8.2 -pathtools==0.1.2 -pexpect==4.8.0 -pickleshare==0.7.5 -Pillow==8.3.2 -prometheus-client==0.11.0 -promise==2.3 -prompt-toolkit==3.0.20 -protobuf==3.18.0 -psutil==5.8.0 -ptyprocess==0.7.0 -pyarrow==5.0.0 -pyasn1==0.4.8 -pyasn1-modules==0.2.8 -pycparser==2.20 -Pygments==2.10.0 -pyparsing==2.4.7 -pyrsistent==0.18.0 -python-dateutil==2.8.2 -python-dotenv==0.19.0 -pytz==2021.1 -PyYAML==5.4.1 -pyzmq==22.2.1 -qtconsole==5.1.1 -QtPy==1.11.0 -regex==2021.8.28 -requests==2.26.0 -requests-oauthlib==1.3.0 -requests-unixsocket==0.2.0 -rsa==4.7.2 -sacremoses==0.0.45 -scikit-learn==0.24.2 -scipy==1.7.1 -Send2Trash==1.8.0 -sentry-sdk==1.4.2 -shortuuid==1.0.1 -six==1.16.0 -sklearn==0.0 -smmap==4.0.0 -sniffio==1.2.0 -subprocess32==3.5.4 -tensorboard==2.6.0 -tensorboard-data-server==0.6.1 -tensorboard-plugin-wit==1.8.0 -termcolor==1.1.0 -terminado==0.12.1 -testpath==0.5.0 -threadpoolctl==2.2.0 -tokenizers==0.10.3 -torch==1.9.0+cu111 -torchaudio==0.9.0 -torchvision==0.10.0+cu111 -tornado==6.1 -tqdm==4.62.2 -traitlets==5.1.0 -transformers @ git+https://github.com/huggingface/transformers@010965dcde8ce9526f6a7e6e2c3f36276c153708 -typing-extensions==3.10.0.2 -urllib3==1.26.6 -wandb==0.12.2 -wcwidth==0.2.5 -webencodings==0.5.1 -websocket-client==1.2.1 -Werkzeug==2.0.1 -widgetsnbextension==3.5.1 -xxhash==2.0.2 -yaspin==2.1.0 -zipp==3.5.0