Improving Language Models by Jointly Modelling Language as Distributions over Characters and Bigrams
More information on this project on my website and in the expose
This is a WIP and will not accept PRs for now.
| Model | Emebdding Size | Hidden Size | BPTT | Batch Size | Epochs | Layers | Dataset | LR | NGram | Test PPL | Test BPC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 1 | 3.76 | 1.91 |
| N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 1 | 3.72 | 1.89 |
| N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 2 | 11.72 | 8.12 |
| N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 2 | 1.96 (only unigrams) | 0.47 (only unigrams) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| N-Gram CL LSTM | 512 | 512 | 200 | 50 | 34 | 2 | Wikitext-103 | 20 (1/4 decay) | 2 | 7.96 | 2.98 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| N-Gram CL LSTM | 400 | 1840 | 200 | 128 | 23 | 3 | enwiki-8 | 10 (1/4 decay) | 2 | 1.63 | 0.69 |
| Paper LSTM 1 2 | 400 | 1840 | 200 | 128 | 50 | 3 | enwiki-8 | 0.001 (1/10 decay) | 1 | - | 1.232 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 500 | 3 | ptb | 4 (1/4 decay) | 2 | 8.01 | 3.00 |
| N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 500 | 3 | ptb | 4 (1/4 decay) | 2 | 1.60 (only unigrams) | 0.68 (only unigrams) (no optimizer) |
| N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 69 | 3 | ptb | 0.002 | 1.56 (only unigrams) | 0.64 (only unigrams) | |
| Paper LSTM 1 2 | 400 | 1000 | 150 | 128 | 500 | 3 | ptb | 0.001 (1/10 decay) | 1 | 1.232 |
| N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 1 | | | N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 2 | |
Investigate the dictionary size, the models size aswell as memory usage and training time for different n-grams and UNK threshold
Hyperparamter:
Hidden size: 128
emsize: 128
nlayers: 2
batch size: 50
lr: 20
| Dataset | UNK | Ngram | sec/epoch | dictionary size | GPU memory usage | Params |
|---|---|---|---|---|---|---|
| Wiki-2 | 0 | 1 | 43 | 1156 | 2553 MiB | 561284 |
| Wiki-2 | 5 | 2 | 67 | 2754 | 2817 MiB | 971713 |
| Wiki-2 | 20 | 2 | 53 | 1751 | 2761 MiB | 714199 |
| Wiki-2 | 40 | 2 | 48 | 1444 | 2763 MiB | 635300 |
| Wiki-2 | 40 | 3 | 143 | 9827 | 3301 MiB | 2789721 |
| Wiki-2 | 40 | 4 | 412 | 34124 | 4723 MiB | 9034060 |
| Wiki-2 | 40 | 5 | 782 | 72745 | 6765 MiB | 18959657 |
| Wiki-2 | 100 | 10 | 965 | 90250 | 8635 MiB | 23458442 |
| ptb | 20 | 2 | 31 | 704 | 4893 MiB | 21669504 |
| enwik8 | 20 | 2 | 4493 | 13 GiB | 120M |
- Corpus aus generate.py raus
- Make CPP extension work
- Build flair language model
- Compare epoch time between bigram and uni gram model
