IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation

Gabriele Sarti*, Malvina Nissim

*Bijbehorende auteur voor dit werk

Onderzoeksoutput: VoordrukAcademic

62 Downloads (Pure)

Samenvatting

The T5 model and its unified text-to-text paradigm contributed in advancing the state-of-the-art for many natural language processing tasks. While some multilingual variants of the T5 model have recently been introduced, their performances were found to provide suboptimal performances for languages other than English if compared to monolingual variants. We are motivated by these findings to introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We perform a thorough cleaning of a web-crawled Italian corpus including more than 40 billion words and use it to pretrain three IT5 models of different sizes. The performance of IT5 models and their multilingual counterparts is then evaluated on a broad range of natural language understanding and generation benchmarks for Italian. We find the monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for most Italian conditional language generation tasks.
Originele taal-2English
UitgeverarXiv
StatusSubmitted - 9-mrt.-2022

Publicatie series

NaamArXiv
UitgeverijCornell University Press
ISSN van geprinte versie2331-8422

Citeer dit