Lower Bias, Higher Density Abusive Language Datasets: A Recipe

Juliet van Rosendaal, Tommaso Caselli, Malvina Nissim

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    5 Downloads (Pure)

    Abstract

    Datasets to train models for abusive language detection are at the same time necessary and still scarce. One the reasons for their limited availability is the cost of their creation. It is not only that manual annotation is expensive, it is also the case that the phenomenon is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data overall, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vs unfiltered dataset, and a more meaningful topic distribution after filtering.
    Original languageEnglish
    Title of host publicationProceedings of the 1st Workshop on Resources and Techniques for User and Author Profiling in Abusive Language (ResT-UP)
    EditorsJohanna Monti, Valerio Basile, Maria Pia Di Buono, Raffaele Manna, Antonio Pascucci, Sara Tonelli
    PublisherEuropean Language Resources Association (ELRA)
    Number of pages6
    Publication statusPublished - 2020
    EventWorkshop on Resources and Techniques for User and Author Profiling in Abusive Language - Online
    Duration: 12-May-2020 → …

    Workshop

    WorkshopWorkshop on Resources and Techniques for User and Author Profiling in Abusive Language
    Period12/05/2020 → …

    Cite this