Abstract
An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the BUCC Workshop within LREC 2022 |
| Editors | Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff |
| Place of Publication | Marseille, France |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 23-32 |
| Number of pages | 10 |
| Publication status | Published - 1-Jun-2022 |