Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Citation (Scopus)
12 Downloads (Pure)

Abstract

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages – Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

Original languageEnglish
Title of host publicationProceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
EditorsMaite Melero, Sakriani Sakti, Claudia Soria
PublisherEuropean Language Resources Association (ELRA)
Pages189-203
Number of pages15
ISBN (Electronic)9782493814296
Publication statusPublished - 2024
Event3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024 - Turin, Italy
Duration: 21-May-202422-May-2024

Conference

Conference3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024
Country/TerritoryItaly
CityTurin
Period21/05/202422/05/2024

Keywords

  • additional pretraining
  • causal commonsense reasoning
  • Croatian
  • named entity recognition
  • sentiment analysis
  • Serbian

Fingerprint

Dive into the research topics of 'Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining'. Together they form a unique fingerprint.

Cite this