Data Selection for Unsupervised Translation of German--Upper Sorbian

    Onderzoeksoutput: Conference contributionAcademicpeer review

    15 Downloads (Pure)


    This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
    Originele taal-2English
    TitelProceedings of the Fifth Conference on Machine Translation (WMT)
    UitgeverijAssociation for Computational Linguistics, ACL Anthology
    Aantal pagina's5
    StatusPublished - nov-2020
    EvenementFifth Conference on Machine Translation - Online
    Duur: 19-nov-202020-nov-2020


    ConferenceFifth Conference on Machine Translation
    Verkorte titelWMT20

    Citeer dit