+We have added to the content of this corpus by selecting a subset of over 100k Welsh sentences from the CoVost Facebook corpus of machine translated English Common Voice sentences. This subset (originally intended to serve as recording prompts) was created by filtering out sentences that exceeded 15 words, contained digits, acronyms or abbreviations, or contained words not found in the Bangor Welsh Lexicon (with some exceptions). See https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md for more details. As these sentences were not originally written in Welsh, we have kept them separate in a second file, cy_covost_subset.txt, so you may decide whether or not to use them depending on your specific aims. Although these are machine translated sentences, a sample of the texts reviewed by human editors who found that less than 5% of the sentences were problematic (a figure that compares well to the situation with the original Welsh texts that are found on the web). We have found these sentences to be useful as they contain a selection topics and grammatical tenses and persons that are otherwise difficult to find within freely licensed texts. As a result, whilst we do not recommend using cy_covost_subset.txt texts for cultural and social linguistic analysis of the Welsh language, we believe that they are valuable for training monolingual Welsh language models where there would otherwise be insufficient original Welsh texts available.
0 commit comments