MAGPIE: A Large Corpus of Potentially Idiomatic Expressions

Hessel Haagsma, Johan Bos, Malvina Nissim

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    4 Citations (Scopus)
    35 Downloads (Pure)


    Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.
    Original languageEnglish
    Title of host publicationProceedings of The 12th Language Resources and Evaluation Conference
    Subtitle of host publicationLREC 2020
    EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
    PublisherEuropean Language Resources Association (ELRA)
    Number of pages9
    ISBN (Electronic)9791095546344
    Publication statusPublished - 2020
    Event12th Language Resources and Evaluation Conference
    : LREC 2020
    - Marseille, France
    Duration: 11-May-202016-May-2020

    Publication series

    NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings


    Conference12th Language Resources and Evaluation Conference
    Internet address

    Cite this