Set up your environment:
conda create -n abe python=3.11
conda activate abe
pip install -r requirements.txt
Input streams are tab-separated dictionaries with the inputs.
Using a series of simple bash commands, we can create our input stream.
echo "This is a test." | python ensembling/build/bilingual-no-tags > input.1
echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn > input.2
Then we can paste these files together and pipe into our ensembling code:
paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 beam
You can run with the flag -d to see the beams at each time step.
paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 -d beam
- A small note on running is that you need a flag between
-m(anargs='*'argument) andbeamor you will get anArgParse:
ensemble.py: error: the following arguments are required: command
All code to do our method of ensembling can be found in the ensembling directory. The important files include ensemble.py which contains the main function; models.py which has the model wrappers for each model to maintain it's own hidden state; search.py which has our cube-pruning-esque search algorithm. utils.py has some functions to help with tokenization.
For all our experiments, we use WMT24 data (en-XX, but mostly en-de).
The raw inputs can be found in refs. These were made via commands such as:
sacrebleu -t wmt24 -l en-de --echo src > wmt24.en-de.en
sacrebleu -t wmt24 -l en-de --echo ref > wmt24.en-de.de
These inputs are unsegmented (multiple sentences per line) which can make some machine translation models add or remove content. To circumvent these issues, we first segment these files into sentences. We then translate, and then reconcatenate. This requires an intermediate file (the sentences with the associated line numbers). We create this using ersatz:
cat wmt24.en-de.en | awk '{print NR "\t" $0}' | ersatz -m en -C 1 > wmt24.en-de.en.sentences
Our ensembling code requires jsonl inputs. We provide several scripts to automatically create these from plain text inputs. All scripts are in ensembling/build/
bilingual-no-tagscreates inputs for a traditional encoder-decoder model which takes the input line as encoder input and has no additional special tags. We use these for our Marianen-demodels.emptycreates an empty input. This would be used for a traditional decoder-only model that does not take prompts.promptcreates input for both LLAMA and Tower specifically for translation. This is highly constrained to the set of languages we cover but we provide both 0-shot and 3-shot options. Calling looks likeecho "This is a test." | python ensembling/build/prompt llama3-0-shot English Germansrc-tgtcreates input for both M2M and NLLB by taking the source language token and the target language token. Calling looks likeecho "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn
The processed inputs (jsonl) can be found in input_data. They are labelled by model, and language pair.
Outputs that were generated by our ensembling method can be found at translations/wmt24/$LANGUAGE_PAIR. The sentences directory contains the sentence-level translations. The targets directory contains the concatenated translations that align to the original reference file.
They were created by calling translation.sh
Outputs that were generated natively (as if the model was run alone) can be found at baselines/simple-translations/outputs. Similar to the above, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.
These files were created by calling baselines/simple-translations/scripts/*.py (depends on specific model)
Outputs that were generated using a more traditional linear interpolation of the log probabilities (only for our models which guarantee the same vocabulary) can be found at baselines/interpolation/outputs. Again, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.
These files were created by calling baselines/interpolation/interpolate-translate.py
All scores are handled in the scoring directory. We score both BLEU and COMET.
bleu-scoresis generated bybleu.pyand creates a file of BLEU scores of our ensembled outputs. The format istsvwhere the columns areMODEL_ONE,MODEL_TWO, andBLEU_SCORErespectively.comet-scoresis generated bycomet.pyand creates a file of COMET scores of our ensembled outputs. The format istsvwhere the columns areMODEL_ONE,MODEL_TWO, andCOMET_SCORErespectively.simple-bleu-scoresis generated bysimple-bleu.pyand creates a file of BLEU scores of the individual model outputs. The format istsvwhere the columns areMODELandBLEU_SCOREsimple-comet-scoresis generated bysimple-comet.pyand creates a file of COMET scores of the individual model outputs. The format istsvwhere the columns areMODELandCOMET_SCOREinterpolate-bleu-scoresis generated byinterpolate-bleu.pyand creates a file of BLEU scores of the models ensembled via linear interpolation of the log probs. The format istsvwhere the columns areMODEL_ONE,MODEL_TWO, andBLEU_SCORErespectively.interpolate-comet-scoresis generated byinterpolate-comet.pyand creates a file of COMET scores of the models ensembled via linear interpolation of the log probs. The format istsvwhere the columns areMODEL_ONE,MODEL_TWO, andCOMET_SCORErespectively.
The main file to edit is utils.py which has a variable called TOKENIZER_CONFIG.
This is a series of variables to tell our code how tokenization is handled.
So long as the new model is reasonably similar to these standard tokenization modes, it should be added in seamlessly.
Recall that we detokenize to byte strings for agreement comparison, so [de-]tokenization is extremely important to get correct.
For example:
"facebook/nllb-200-distilled-600M": {
"lstrip": False,
"special_character": SPIECE_UNDERLINE,
"begin_word": True,
"byte_map": BYTE_MAP,
"add_space": True
},
is the tokenization scheme for NLLB. In order to add a new model, you add a new key with the huggingface model id.
lstrip: Does the Tokenizer use lstrip on word beginnings? i.e., if I decode ▁Hello does it decode as Hello or [SPACE]Hello.
special_character: what is the whitespace special character. Common exampls are ▁ or Ġ.
begin_word: Does this special character begin the word?
byte_map: The mapping of how the vocabulary stores bytes to the underlying byte. For example, many SPM models store as a string <0xBYTE>
add_space: Do we need to add a space to the beginning of the string? This is typically because the model removes spaces at the beginning of sentences.
If you run into problems, please contact the authors (e.g., [email protected]) or file an issue for assistance.