Samenvatting
This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
Originele taal-2 | English |
---|---|
Titel | Proceedings of the Fifth Conference on Machine Translation (WMT) |
Uitgeverij | Association for Computational Linguistics, ACL Anthology |
Pagina's | 1099-1103 |
Aantal pagina's | 5 |
Status | Published - nov.-2020 |
Evenement | Fifth Conference on Machine Translation - Online Duur: 19-nov.-2020 → 20-nov.-2020 |
Conference
Conference | Fifth Conference on Machine Translation |
---|---|
Verkorte titel | WMT20 |
Periode | 19/11/2020 → 20/11/2020 |