This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
|Titel||Proceedings of the Fifth Conference on Machine Translation (WMT)|
|Uitgeverij||Association for Computational Linguistics, ACL Anthology|
|Status||Published - nov-2020|
|Evenement||Fifth Conference on Machine Translation - Online|
Duur: 19-nov-2020 → 20-nov-2020
|Conference||Fifth Conference on Machine Translation|
|Periode||19/11/2020 → 20/11/2020|