Abstract
The Little Data That Could concerns the topic of low-resource natural language processing, that is, processing text in various ways (for example translation) with little data (i.e. annotated text online) available. Natural language processing is essentially applying computer algorithms to natural languages (the languages we speak, as opposed to programming languages), from which arise systems like spelling and grammar checkers, automatic translation systems, and question-answering systems. Typically, one requires a lot of data for these tasks, but for many languages around the world, there simply are not enough resources available. In the thesis, we explore various ways to make the most of the little data available. In the end, we find that two techniques can be very useful in this setting. The first method involves transferring information from higher-resource languages to lower-resource ones. The second method concerns models that use individual characters (i.e. letters) as building blocks rather than the more standard approach of using words as the basic unit. These two methods can substantially improve results for languages for which very little data is available.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 15-Feb-2024 |
Place of Publication | [Groningen] |
Publisher | |
DOIs | |
Publication status | Published - 2024 |