Abstract
This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
Original language | English |
---|---|
Title of host publication | Proceedings of the Fifth Conference on Machine Translation (WMT) |
Publisher | Association for Computational Linguistics, ACL Anthology |
Pages | 1099-1103 |
Number of pages | 5 |
Publication status | Published - Nov-2020 |
Event | Fifth Conference on Machine Translation - Online Duration: 19-Nov-2020 → 20-Nov-2020 |
Conference
Conference | Fifth Conference on Machine Translation |
---|---|
Abbreviated title | WMT20 |
Period | 19/11/2020 → 20/11/2020 |