MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Marta Bañón, Miquel Esplà-Gomis*, Mikel L. Forcada*, Cristian García-Romero*, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere*, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza

*Corresponding author voor dit werk

OnderzoeksoutputAcademicpeer review

14 Citaten (Scopus)
111 Downloads (Pure)

Samenvatting

We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.

Originele taal-2English
TitelEAMT 2022 - Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
RedacteurenLieve Macken, Andrew Rufener, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne, Loic Barrault, Marta R. Costa-Jussa, Ellie Kemp, Spyridon Pilos, Christophe Declercq, Christophe Declercq, Maarit Koponen, Mikel L. Forcada, Carolina Scarton, Helena Moniz
UitgeverijEuropean Association for Machine Translation
Pagina's303-304
Aantal pagina's2
ISBN van elektronische versie9789464597622
StatusPublished - 2022
Evenement23rd Annual Conference of the European Association for Machine Translation, EAMT 2022 - Ghent, Belgium
Duur: 1-jun.-20223-jun.-2022

Conference

Conference23rd Annual Conference of the European Association for Machine Translation, EAMT 2022
Land/RegioBelgium
StadGhent
Periode01/06/202203/06/2022

Vingerafdruk

Duik in de onderzoeksthema's van 'MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages'. Samen vormen ze een unieke vingerafdruk.

Citeer dit