Abstract
In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens,lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series.
Original language | English |
---|---|
Title of host publication | CLiC-it 2020 Italian Conference on Computational Linguistics 2020 |
Subtitle of host publication | Proceedings of the Seventh Italian Conference on Computational Linguistics |
Place of Publication | Bologna |
Publisher | CEUR Workshop Proceedings (CEUR-WS.org) |
Number of pages | 6 |
Volume | 2769 |
Publication status | Published - 2020 |
Event | Italian Conference on Computational Linguistics 2020 - Bologna, Italy Duration: 1-Mar-2021 → 3-Mar-2021 |
Conference
Conference | Italian Conference on Computational Linguistics 2020 |
---|---|
Abbreviated title | CLiC-it 2020 |
Country/Territory | Italy |
City | Bologna |
Period | 01/03/2021 → 03/03/2021 |
Keywords
- diachronic corpus
- lexical semantics
- concept shits
- italian
- written corpus