ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study

OnderzoeksoutputAcademicpeer review

13 Downloads (Pure)

Samenvatting

In the context of the ReproHum project aimed at assessing the reliability of human evaluation, we replicated the human evaluation conducted in “Generating Scientific Definitions with Controllable Complexity” by August et al. (2022). Specifically, humans were asked to assess the fluency of automatically generated scientific definitions by three different models, with output complexity varying according to target audience. Evaluation conditions were kept as close as possible to the original study, except of necessary and minor adjustments. Our results, despite yielding lower absolute performance, show that relative performance across the three tested systems remains comparable to what was observed in the original paper. On the basis of lower inter-annotator agreement and feedback received from annotators in our experiment, we also observe that the ambiguity of the concept being evaluated may play a substantial role in human assessment.

Originele taal-2English
Titel4th Workshop on Human Evaluation of NLP Systems, HumEval 2024 at LREC-COLING 2024 - Workshop Proceedings
RedacteurenSimone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
UitgeverijEuropean Language Resources Association (ELRA)
Pagina's238-249
Aantal pagina's12
ISBN van elektronische versie978-249381441-8
StatusPublished - 2024
Evenement4th Workshop on Human Evaluation of NLP Systems, HumEval 2024 - Torino, Italy
Duur: 21-mei-202421-mei-2024

Conference

Conference4th Workshop on Human Evaluation of NLP Systems, HumEval 2024
Land/RegioItaly
StadTorino
Periode21/05/202421/05/2024

Vingerafdruk

Duik in de onderzoeksthema's van 'ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study'. Samen vormen ze een unieke vingerafdruk.

Citeer dit