IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation

Gabriele Sarti*, Malvina Nissim

*Corresponding author for this work

Research output: Contribution to journalArticleAcademic

38 Downloads (Pure)

Abstract

The T5 model and its unified text-to-text paradigm contributed in advancing the state-of-the-art for many natural language processing tasks. While some multilingual variants of the T5 model have recently been introduced, their performances were found to provide suboptimal performances for languages other than English if compared to monolingual variants. We are motivated by these findings to introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We perform a thorough cleaning of a web-crawled Italian corpus including more than 40 billion words and use it to pretrain three IT5 models of different sizes. The performance of IT5 models and their multilingual counterparts is then evaluated on a broad range of natural language understanding and generation benchmarks for Italian. We find the monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for most Italian conditional language generation tasks.
Original languageEnglish
JournalArXiv
Publication statusSubmitted - 9-Mar-2022
EventHuggingface JAX/Flax Community Week - Online
Duration: 7-Jul-202114-Jul-2021
https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104

Keywords

  • italian language
  • pre-training
  • t5
  • deep learning
  • Question answering systems
  • Style transfer
  • natural language processing
  • summarization
  • headline generation
  • news summarization
  • wikipedia summarization

Cite this