TY - JOUR
T1 - How to harvest Word Combinations from corpora
T2 - Methods, evaluation and perspectives
AU - Lenci, Alessandro
AU - Masini, Francesca
AU - Nissim, Malvina
AU - Castagnoli, Sara
AU - Lebani, Gianluca E.
AU - Passaro, Lucia C.
AU - Senaldi, Marco S. G.
PY - 2017
Y1 - 2017
N2 - This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance - contrastively, and with reference to external benchmarks - and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy's potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.
AB - This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance - contrastively, and with reference to external benchmarks - and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy's potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.
KW - word combinations
KW - computational methods
KW - idiomatic expressions
KW - IDIOMATICITY
M3 - Article
SN - 0085-6827
VL - 55
SP - 45
EP - 68
JO - Studi e saggi linguistici
JF - Studi e saggi linguistici
IS - 2
ER -