Crawl and crowd to bring machine translation to under-resourced languages

    Research output: Contribution to journalArticleAcademicpeer-review

    5 Citations (Scopus)

    Abstract

    We present a widely applicable methodology to bring machine translation
    (MT) to under-resourced languages in a cost-effective and rapid manner. Our
    proposal relies on web crawling to automatically acquire parallel data to train
    statistical MT systems if any such data can be found for the language pair and
    domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate
    small amounts of text (hundreds of sentences), which are then used to tune statistical
    MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to
    two respective use-cases for Croatian, an under-resourced language that has gained
    relevance since it recently attained official status in the European Union. The first
    use-case regards tourism, given the importance of this sector to Croatia’s economy,
    while the second has to do with tweets, due to the growing importance of social
    media. For tourism, we crawl parallel data from 20 web domains using two state-of-
    the-art crawlers and explore how to combine the crawled data with bigger amounts
    of general-domain data. Our domain-adapted system is evaluated on a set of three
    additional tourism web domains and it outperforms the baseline in terms of auto-
    matic metrics and/or vocabulary coverage. In the social media use-case, we deal
    with tweets from the 2014 edition of the soccer World Cup. We build domain-
    adapted systems by (1) translating small amounts of tweets to be used for tuning by
    means of crowdsourcing and (2) crawling vast amounts of monolingual tweets.
    These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11
    TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian
    on a test set translated by means of crowdsourcing. A complementary manual
    analysis sheds further light on these results.
    Original languageEnglish
    Pages (from-to)1019-1051
    Number of pages33
    JournalLanguage Resources and Evaluation
    Volume51
    Issue number4
    DOIs
    Publication statusPublished - Dec-2017

    Keywords

    • WEB
    • CORPORA
    • CORPUS

    Cite this