Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora

Rik van Noord, Miquel Esplà-Gomis*, Malina Chichirau, Gema Ramírez-Sánchez, Antonio Toral

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

19 Downloads (Pure)

Abstract

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has received only limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models' performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

Original languageEnglish
Title of host publicationProceedings of the 31st International Conference on Computational Linguistics
EditorsOwen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
PublisherAssociation for Computational Linguistics, ACL Anthology
Pages1824-1838
Number of pages15
ISBN (Electronic)9798891761964
Publication statusPublished - 2025
Event31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, United Arab Emirates
Duration: 19-Jan-202524-Jan-2025

Publication series

NameProceedings - International Conference on Computational Linguistics, COLING
VolumePart F206484-1
ISSN (Print)2951-2093

Conference

Conference31st International Conference on Computational Linguistics, COLING 2025
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/202524/01/2025

Fingerprint

Dive into the research topics of 'Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora'. Together they form a unique fingerprint.

Cite this