Infrequent forms: noise or not?

Martijn Wieling, Simonetta Montemagni

    Research output: Chapter in Book/Report/Conference proceedingChapterAcademic


    In this study we ask the question whether simplifying the data in dialectometrical
    studies by removing infrequent forms is advantageous to uncover the geographical
    structure in dialect data. By investigating lexical variation in a large corpus of
    Tuscan dialect data via hierarchical bipartite spectral graph partitioning, we are
    able to identify the main geographical areas together with their linguistic basis. In
    order to assess the influence of infrequent forms, we conduct two analyses: one
    which includes only lexical variants used by at least 0.5% of the informants, and
    another which includes all lexical variants in the data. Using this approach we show
    that using all data enables us to find a geographical characterization with a more
    adequate linguistic basis than by using the trimmed data.
    Original languageEnglish
    Title of host publicationThe Future of Dialects
    Subtitle of host publicationselected papers from Methods in Dialectology XV
    EditorsMarie-Hélène Côté, Remco Knooihuizen, John Nerbonne
    PublisherLanguage Science Press
    Number of pages10
    ISBN (Print)9783946234197
    Publication statusPublished - 2016

    Cite this