Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution

Lukas Edman*, Antonio Toral Ruiz, Gertjan Noord, van

*Bijbehorende auteur voor dit werk

    OnderzoeksoutputAcademicpeer review

    482 Downloads (Pure)

    Samenvatting

    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embeddings
    Originele taal-2English
    TitelProceedings of the 22nd Annual Conference of the European Association for Machine Translation
    RedacteurenAndré Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, Mikel L. Forcada
    Plaats van productieLisbon
    Pagina's81-90
    Aantal pagina's10
    ISBN van elektronische versie978-989-33-0589-8
    StatusPublished - 2020
    Evenement22nd Annual Conference of the European Association for Machine Translation - Online Conference
    Duur: 3-nov-20205-nov-2020
    https://eamt2020.inesc-id.pt/

    Conference

    Conference22nd Annual Conference of the European Association for Machine Translation
    Verkorte titelEAMT 2020
    Periode03/11/202005/11/2020
    Internet adres

    Citeer dit