Simply the Best: Minimalist System Trumps Complex Models in Author Profiling

Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, Malvina Nissim

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    Abstract

    A simple linear SVM with word and character n-gram features and minimal parameter tuning can identify the gender and the language variety (for English, Spanish, Arabic and Portuguese) of Twitter users with very high accuracy. All our attempts at improving performance by including more data, smarter features, and employing more complex architectures plainly fail. In addition, we experiment with joint and multitask modelling, but find that they are clearly outperformed by single task models. Eventually, our simplest model was submitted to the PAN 2017 shared task on author profiling, obtaining an average accuracy of 0.86 on the test set, with performance on sub-tasks ranging from 0.68 to 0.98. These were the best results achieved at the competition overall.
    To allow lay people to easily use and see the value of machine learning for author profiling, we also built a web application on top our models.
    Original languageEnglish
    Title of host publicationExperimental IR Meets Multilinguality, Multimodality, and Interaction
    Subtitle of host publication9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings
    EditorsP. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, N. Ferro
    PublisherSpringer
    Pages143-156
    Number of pages14
    ISBN (Electronic)978-3-319-98932-7
    Publication statusPublished - 10-Sep-2018
    Event9th International Conference of the CLEF Association, CLEF 2018 - Avignon, France
    Duration: 10-Sep-201814-Sep-2018

    Conference

    Conference9th International Conference of the CLEF Association, CLEF 2018
    CountryFrance
    CityAvignon
    Period10/09/201814/09/2018

    Cite this