Normalization and parsing algorithms for uncertain input

Rob Matthijs van der Goot

    Research output: ThesisThesis fully internal (DIV)

    360 Downloads (Pure)

    Abstract

    The automatic analysis (parsing) of natural language is an important ingredient for many natural language processing applications (search-engines, automatic translation, speech-processing, etc.), as it is the first step towards interpretation. For standard texts, like well-edited news articles, current parsers perform very well. However, for user-generated content, such as tweets, parser performance drops dramatically.

    In this research, we attempt to improve the automatic analysis of spontaneous language by translating it to 'normal' language. For example, the sentence "new pix comming tomorroe" is translated to "new pictures coming tomorrow". In this example sentence, a variety of phenomena occurs: 'pix' is a replacement based on the pronunciation, whereas 'comming' is probably a typo. This translation is also referred to as 'normalization'. Based on the observation that the normalization problem actually consists of multiple sub-problems, we developed a modular normalization model: MoNoise. This normalization model reaches a new state-of-art performance on a variety of languages.

    Normalizing social media texts leads to a performance increase for syntactic parsers. In the basic setup, we use only the single best normalization candidate for each word, which might lead to error propagation. Hence, we introduce two novel methods to let the parser to take multiple normalization candidates into account per position, leading to further improvements in parser performance.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Awarding Institution
    • University of Groningen
    Supervisors/Advisors
    • van Noord, Gertjan, Supervisor
    • Nissim, Malvina, Supervisor
    Award date4-Apr-2019
    Place of Publication[Groningen]
    Publisher
    Print ISBNs978-94-034-1458-4
    Electronic ISBNs978-94-034-1457-7
    Publication statusPublished - 2019

    Cite this