Building Domain-specific Corpora from the Web: the Case of European Digital Service Infrastructures

Rik van Noord, Cristian Garcia-Romero, Miquel Esplà-Gomis, Leopoldo Pla Sempere, Antonio Toral

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

2 Citations (Scopus)
32 Downloads (Pure)


An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation.
Original languageEnglish
Title of host publicationProceedings of the BUCC Workshop within LREC 2022
EditorsReinhard Rapp, Pierre Zweigenbaum, Serge Sharoff
Place of PublicationMarseille, France
PublisherEuropean Language Resources Association (ELRA)
Number of pages10
Publication statusPublished - 1-Jun-2022


Dive into the research topics of 'Building Domain-specific Corpora from the Web: the Case of European Digital Service Infrastructures'. Together they form a unique fingerprint.

Cite this