MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Marta Bañón, Miquel Esplà-Gomis*, Mikel L. Forcada*, Cristian García-Romero*, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere*, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

19 Citations (Scopus)
154 Downloads (Pure)

Abstract

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.

Original languageEnglish
Title of host publicationEAMT 2022 - Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
EditorsLieve Macken, Andrew Rufener, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne, Loic Barrault, Marta R. Costa-Jussa, Ellie Kemp, Spyridon Pilos, Christophe Declercq, Christophe Declercq, Maarit Koponen, Mikel L. Forcada, Carolina Scarton, Helena Moniz
PublisherEuropean Association for Machine Translation
Pages303-304
Number of pages2
ISBN (Electronic)9789464597622
Publication statusPublished - 2022
Event23rd Annual Conference of the European Association for Machine Translation, EAMT 2022 - Ghent, Belgium
Duration: 1-Jun-20223-Jun-2022

Conference

Conference23rd Annual Conference of the European Association for Machine Translation, EAMT 2022
Country/TerritoryBelgium
CityGhent
Period01/06/202203/06/2022

Fingerprint

Dive into the research topics of 'MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages'. Together they form a unique fingerprint.

Cite this