Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution

Lukas Edman*, Antonio Toral Ruiz, Gertjan Noord, van

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    576 Downloads (Pure)

    Abstract

    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embeddings
    Original languageEnglish
    Title of host publicationProceedings of the 22nd Annual Conference of the European Association for Machine Translation
    EditorsAndré Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, Mikel L. Forcada
    Place of PublicationLisbon
    Pages81-90
    Number of pages10
    ISBN (Electronic)978-989-33-0589-8
    Publication statusPublished - 2020
    Event22nd Annual Conference of the European Association for Machine Translation - Online Conference
    Duration: 3-Nov-20205-Nov-2020
    https://eamt2020.inesc-id.pt/

    Conference

    Conference22nd Annual Conference of the European Association for Machine Translation
    Abbreviated titleEAMT 2020
    Period03/11/202005/11/2020
    Internet address

    Cite this