A Taxonomy for In-depth Evaluation of Normalization for User Generated Content

Rob van der Goot, Rik van Noord, Gertjan van Noord

    OnderzoeksoutputAcademicpeer review

    76 Downloads (Pure)

    Samenvatting

    In this work we present a taxonomy of error categories for lexical normalization, which is the task of translating user generated content to canonical language. We annotate a recent normalization dataset to test the practical use of the taxonomy and read a near-perfect agreement. This annotated dataset is then used to evaluate how an existing normalization model performs on the different categories of the taxonomy. The results of this evaluation reveal that some of the problematic categories only include minor transformations, whereas most regular transformations are solved quite well.
    Originele taal-2English
    TitelLREC 2018, Eleventh International Conference on Language Resources and Evaluation
    Plaats van productieParis
    UitgeverijEuropean Language Resources Association (ELRA)
    Pagina's684-688
    Aantal pagina's5
    ISBN van geprinte versie979-10-95546-00-9
    StatusPublished - 2018
    EvenementEleventh International Conference on Language Resources and Evaluation - Phoenix Seagaia Resort , Miyazaki , Japan
    Duur: 7-mei-201812-mei-2018
    http://lrec2018.lrec-conf.org/en/

    Conference

    ConferenceEleventh International Conference on Language Resources and Evaluation
    Verkorte titelLREC 2018
    LandJapan
    StadMiyazaki
    Periode07/05/201812/05/2018
    Internet adres

    Citeer dit