Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    34 Downloads (Pure)

    Abstract

    Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.
    Original languageEnglish
    Title of host publicationProceedings of the 8th Workshop on Asian Translation (WAT2021)
    EditorsToshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, Pushpak Bhattacharyya
    PublisherAssociation for Computational Linguistics (ACL)
    Pages181-190
    Number of pages10
    Publication statusPublished - 5-Aug-2021

    Cite this