Developing Infrastructure for Low-Resource Language Corpus Building

Hedwig Sekeres, Wilbert Heeringa, Wietse De Vries, Oscar Yde Zwagers, Martijn Wieling, Goffe T.H. Jensma

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

32 Downloads (Pure)

Abstract

For many of the world’s small languages, few resources are available. In this project, a written online accessible corpus was created for the minority language variant Gronings, which serves both researchers interested in language change and variation and a general audience of (new) speakers interested in finding real-life examples of language use. The corpus was created using a combination of volunteer work and automation, which together formed an efficient pipeline for converting printed text to Key Words in Context (KWICs), annotated with lemmas and part-of-speech tags. In the creation of the corpus, we have taken into account several of the challenges that can occur when creating resources for minority languages, such as a lack of standardisation and limited (financial) resources. As the solutions we offer are applicable to other small languages as well, each step of the corpus creation process is discussed and resources will be made available benefiting future projects on other low-resource languages.

Original languageEnglish
Title of host publicationProceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
EditorsMaite Melero, Sakriani Sakti, Claudia Soria
PublisherEuropean Language Resources Association (ELRA)
Pages72-78
Number of pages7
ISBN (Electronic)9782493814296
Publication statusPublished - 2024
Event3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024 - Turin, Italy
Duration: 21-May-202422-May-2024

Conference

Conference3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024
Country/TerritoryItaly
CityTurin
Period21/05/202422/05/2024

Keywords

  • corpus creation
  • low-resource language
  • online corpus

Fingerprint

Dive into the research topics of 'Developing Infrastructure for Low-Resource Language Corpus Building'. Together they form a unique fingerprint.

Cite this