A Bigger Fish to Fry: Scaling up the Automatic Understanding of Idiomatic Expressions

Hessel Haagsma


    329 Downloads (Pure)


    In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.

    In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.

    In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems.
    Originele taal-2English
    KwalificatieDoctor of Philosophy
    Toekennende instantie
    • Rijksuniversiteit Groningen
    • Bos, Johan, Supervisor
    • Nissim, Malvina, Supervisor
    Datum van toekenning3-sep.-2020
    Plaats van publicatie[Groningen]
    Gedrukte ISBN's9789403425269
    Elektronische ISBN's9789403425252
    StatusPublished - 2020

    Citeer dit