Lower Bias, Higher Density Abusive Language Datasets: A Recipe

Juliet van Rosendaal, Tommaso Caselli, Malvina Nissim

    OnderzoeksoutputAcademicpeer review

    68 Downloads (Pure)

    Samenvatting

    Datasets to train models for abusive language detection are at the same time necessary and still scarce. One the reasons for their limited availability is the cost of their creation. It is not only that manual annotation is expensive, it is also the case that the phenomenon is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data overall, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vs unfiltered dataset, and a more meaningful topic distribution after filtering.
    Originele taal-2English
    TitelProceedings of the 1st Workshop on Resources and Techniques for User and Author Profiling in Abusive Language (ResT-UP)
    RedacteurenJohanna Monti, Valerio Basile, Maria Pia Di Buono, Raffaele Manna, Antonio Pascucci, Sara Tonelli
    UitgeverijEuropean Language Resources Association (ELRA)
    Aantal pagina's6
    StatusPublished - 2020
    EvenementWorkshop on Resources and Techniques for User and Author Profiling in Abusive Language - Online
    Duur: 12-mei-2020 → …

    Workshop

    WorkshopWorkshop on Resources and Techniques for User and Author Profiling in Abusive Language
    Periode12/05/2020 → …

    Vingerafdruk

    Duik in de onderzoeksthema's van 'Lower Bias, Higher Density Abusive Language Datasets: A Recipe'. Samen vormen ze een unieke vingerafdruk.

    Citeer dit