Samenvatting
Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embeddings
Originele taal-2 | English |
---|---|
Titel | Proceedings of the 22nd Annual Conference of the European Association for Machine Translation |
Redacteuren | André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, Mikel L. Forcada |
Plaats van productie | Lisbon |
Pagina's | 81-90 |
Aantal pagina's | 10 |
ISBN van elektronische versie | 978-989-33-0589-8 |
Status | Published - 2020 |
Evenement | 22nd Annual Conference of the European Association for Machine Translation - Online Conference Duur: 3-nov.-2020 → 5-nov.-2020 https://eamt2020.inesc-id.pt/ |
Conference
Conference | 22nd Annual Conference of the European Association for Machine Translation |
---|---|
Verkorte titel | EAMT 2020 |
Periode | 03/11/2020 → 05/11/2020 |
Internet adres |