| Text source | Information | 
|---|---|
| "Alice in Wonderland" | Alice in Wonderland (Ch.1) | 
| "Romeo and Juliet" | Romeo and Juliet | 
| "Bhagavad Gita" | Bhagavad Gita | 
| "Memento screenplay" | Memento screenplay | 
| "100K tweets" | 100,000 tweets from: Sentiment140 dataset training data | 
| "20K tweets" | 20,000 tweets from Gender Classifier Data | 
| "MASC tweets" | MASC tweets (cleaned of html markup) | 
| "MASC spoken" | MASC spoken transcripts (phone and face-to-face: 25,783 words) | 
| "COCA blogs" | Corpus of Contemporary American English blog samples | 
| "Google website" | Google homepage (accessed 10/20/2020) | 
| "Software languages" | "Tower of Hanoi" (programming languages A-Z from Rosetta Code) | 
| "Monkey text" | Ian Douglas's English-generated monkey0-7.txt corpus | 
| "Coder text" | Ian Douglas's software-generated coder0-7.txt corpus | 
| "iweb cleaned corpus" | First 150,000 lines of Shai Coleman's iweb-corpus-samples-cleaned.txt | 
Reference for Monkey and Coder texts: Douglas, Ian. (2021, March 28). Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4642460