Data Selection for Unsupervised Translation of German--Upper Sorbian

    Onderzoeksoutput: Conference contributionAcademicpeer review

    15 Downloads (Pure)

    Samenvatting

    This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German--Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.
    Originele taal-2English
    TitelProceedings of the Fifth Conference on Machine Translation (WMT)
    UitgeverijAssociation for Computational Linguistics, ACL Anthology
    Pagina's1099-1103
    Aantal pagina's5
    StatusPublished - nov-2020
    EvenementFifth Conference on Machine Translation - Online
    Duur: 19-nov-202020-nov-2020

    Conference

    ConferenceFifth Conference on Machine Translation
    Verkorte titelWMT20
    Periode19/11/202020/11/2020

    Citeer dit