The Little Data That Could: Making the Most of Low-Resource Natural Language Processing

Lukas Edman

    Research output: ThesisThesis fully internal (DIV)

    70 Downloads (Pure)


    The Little Data That Could concerns the topic of low-resource natural language processing, that is, processing text in various ways (for example translation) with little data (i.e. annotated text online) available. Natural language processing is essentially applying computer algorithms to natural languages (the languages we speak, as opposed to programming languages), from which arise systems like spelling and grammar checkers, automatic translation systems, and question-answering systems. Typically, one requires a lot of data for these tasks, but for many languages around the world, there simply are not enough resources available. In the thesis, we explore various ways to make the most of the little data available. In the end, we find that two techniques can be very useful in this setting. The first method involves transferring information from higher-resource languages to lower-resource ones. The second method concerns models that use individual characters (i.e. letters) as building blocks rather than the more standard approach of using words as the basic unit. These two methods can substantially improve results for languages ​​for which very little data is available.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Awarding Institution
    • University of Groningen
    • van Noord, Gertjan, Supervisor
    • Toral Ruiz, Antonio, Co-supervisor
    Award date15-Feb-2024
    Place of Publication[Groningen]
    Publication statusPublished - 2024

    Cite this