How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Alessandro Lenci*, Francesca Masini, Malvina Nissim, Sara Castagnoli, Gianluca E. Lebani, Lucia C. Passaro, Marco S. G. Senaldi

*Corresponding author for this work

    Research output: Contribution to journalArticleAcademicpeer-review

    38 Downloads (Pure)

    Abstract

    This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance - contrastively, and with reference to external benchmarks - and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy's potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

    Original languageEnglish
    Pages (from-to)45-68
    Number of pages24
    JournalStudi e saggi linguistici
    Volume55
    Issue number2
    Publication statusPublished - 2017

    Keywords

    • word combinations
    • computational methods
    • idiomatic expressions
    • IDIOMATICITY

    Fingerprint

    Dive into the research topics of 'How to harvest Word Combinations from corpora: Methods, evaluation and perspectives'. Together they form a unique fingerprint.

    Cite this