The Little Data That Could: Making the Most of Low-Resource Natural Language Processing

Lukas Edman


    76 Downloads (Pure)


    The Little Data That Could concerns the topic of low-resource natural language processing, that is, processing text in various ways (for example translation) with little data (i.e. annotated text online) available. Natural language processing is essentially applying computer algorithms to natural languages (the languages we speak, as opposed to programming languages), from which arise systems like spelling and grammar checkers, automatic translation systems, and question-answering systems. Typically, one requires a lot of data for these tasks, but for many languages around the world, there simply are not enough resources available. In the thesis, we explore various ways to make the most of the little data available. In the end, we find that two techniques can be very useful in this setting. The first method involves transferring information from higher-resource languages to lower-resource ones. The second method concerns models that use individual characters (i.e. letters) as building blocks rather than the more standard approach of using words as the basic unit. These two methods can substantially improve results for languages ​​for which very little data is available.
    Originele taal-2English
    KwalificatieDoctor of Philosophy
    Toekennende instantie
    • Rijksuniversiteit Groningen
    • van Noord, Gertjan, Supervisor
    • Toral Ruiz, Antonio, Co-supervisor
    Datum van toekenning15-feb.-2024
    Plaats van publicatie[Groningen]
    StatusPublished - 2024

    Citeer dit