A Taxonomy for In-depth Evaluation of Normalization for User Generated Content

Rob van der Goot, Rik van Noord, Gertjan van Noord

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    85 Downloads (Pure)


    In this work we present a taxonomy of error categories for lexical normalization, which is the task of translating user generated content to canonical language. We annotate a recent normalization dataset to test the practical use of the taxonomy and read a near-perfect agreement. This annotated dataset is then used to evaluate how an existing normalization model performs on the different categories of the taxonomy. The results of this evaluation reveal that some of the problematic categories only include minor transformations, whereas most regular transformations are solved quite well.
    Original languageEnglish
    Title of host publicationLREC 2018, Eleventh International Conference on Language Resources and Evaluation
    Place of PublicationParis
    PublisherEuropean Language Resources Association (ELRA)
    Number of pages5
    ISBN (Print)979-10-95546-00-9
    Publication statusPublished - 2018
    EventEleventh International Conference on Language Resources and Evaluation - Phoenix Seagaia Resort , Miyazaki , Japan
    Duration: 7-May-201812-May-2018


    ConferenceEleventh International Conference on Language Resources and Evaluation
    Abbreviated titleLREC 2018
    Internet address

    Cite this