Normalization and parsing algorithms for uncertain input

Rob Matthijs van der Goot

    Onderzoeksoutput

    677 Downloads (Pure)

    Samenvatting

    The automatic analysis (parsing) of natural language is an important ingredient for many natural language processing applications (search-engines, automatic translation, speech-processing, etc.), as it is the first step towards interpretation. For standard texts, like well-edited news articles, current parsers perform very well. However, for user-generated content, such as tweets, parser performance drops dramatically.

    In this research, we attempt to improve the automatic analysis of spontaneous language by translating it to 'normal' language. For example, the sentence "new pix comming tomorroe" is translated to "new pictures coming tomorrow". In this example sentence, a variety of phenomena occurs: 'pix' is a replacement based on the pronunciation, whereas 'comming' is probably a typo. This translation is also referred to as 'normalization'. Based on the observation that the normalization problem actually consists of multiple sub-problems, we developed a modular normalization model: MoNoise. This normalization model reaches a new state-of-art performance on a variety of languages.

    Normalizing social media texts leads to a performance increase for syntactic parsers. In the basic setup, we use only the single best normalization candidate for each word, which might lead to error propagation. Hence, we introduce two novel methods to let the parser to take multiple normalization candidates into account per position, leading to further improvements in parser performance.
    Originele taal-2English
    KwalificatieDoctor of Philosophy
    Toekennende instantie
    • Rijksuniversiteit Groningen
    Begeleider(s)/adviseur
    • van Noord, Gertjan, Supervisor
    • Nissim, Malvina, Supervisor
    Datum van toekenning4-apr-2019
    Plaats van publicatie[Groningen]
    Uitgever
    Gedrukte ISBN's978-94-034-1458-4
    Elektronische ISBN's978-94-034-1457-7
    StatusPublished - 2019

    Citeer dit