TY - GEN
T1 - Quality Beyond A Glance
T2 - 31st International Conference on Computational Linguistics, COLING 2025
AU - van Noord, Rik
AU - Esplà-Gomis, Miquel
AU - Chichirau, Malina
AU - Ramírez-Sánchez, Gema
AU - Toral, Antonio
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has received only limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models' performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.
AB - Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has received only limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models' performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.
UR - http://www.scopus.com/inward/record.url?scp=85218505359&partnerID=8YFLogxK
UR - https://aclanthology.org/2025.coling-main.124/
M3 - Conference contribution
AN - SCOPUS:85218505359
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 1824
EP - 1838
BT - Proceedings of the 31st International Conference on Computational Linguistics
A2 - Rambow, Owen
A2 - Wanner, Leo
A2 - Apidianaki, Marianna
A2 - Al-Khalifa, Hend
A2 - Di Eugenio, Barbara
A2 - Schockaert, Steven
PB - Association for Computational Linguistics, ACL Anthology
Y2 - 19 January 2025 through 24 January 2025
ER -