TY - GEN
T1 - Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
AU - Zotos, Leonidas
AU - van Rijn, Hedderik
AU - Nissim, Malvina
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025/1
Y1 - 2025/1
N2 - Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty. Specifically, we exploit model uncertainty towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models’ behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.
AB - Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty. Specifically, we exploit model uncertainty towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models’ behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.
UR - http://www.scopus.com/inward/record.url?scp=85218500532&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85218500532
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 11304
EP - 11316
BT - Proceedings of the 31st International Conference on Computational Linguistics
A2 - Rambow, Owen
A2 - Wanner, Leo
A2 - Apidianaki, Marianna
A2 - Al-Khalifa, Hend
A2 - Di Eugenio, Barbara
A2 - Schockaert, Steven
PB - Association for Computational Linguistics (ACL)
CY - Abu Dhabi, UAE
T2 - 31st International Conference on Computational Linguistics, COLING 2025
Y2 - 19 January 2025 through 24 January 2025
ER -