TY - JOUR
T1 - Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk
AU - van Aalst, Joëlle E
AU - Maruccio, Federica C
AU - Simoẽs, Rita
AU - Janssen, Tomas M
AU - Wolterink, Jelmer M
AU - van Ooijen, Peter M A
AU - Brouwer, Charlotte L
N1 - Creative Commons Attribution license.
PY - 2025/10/17
Y1 - 2025/10/17
N2 - Objective. Deep learning auto-segmentation has greatly advanced contouring in radiotherapy. However, quality assurance remains necessary due to performance fluctuation among individual patients. This manual process reintroduces variability and partially reduces time-saving benefits. As a solution, uncertainty quantification (UQ) is increasingly explored for its ability to estimate output confidence. While numerous methods to quantify uncertainty exist, their comparative reliability remains underexplored. This study compares the reliability of commonly used UQ approaches for auto-segmentation in radiotherapy. Approach. We evaluated the reliability of three popular uncertainty methods (Monte Carlo dropout, deep ensemble modelling and test-time augmentation) and uncertainty metrics (predictive entropy, mutual information and variance). We trained a 3D U-Net within the nnU-Net framework to segment 19 organs at risk (OAR) for head and neck cancer patients. We evaluated the reliability of the UQ methods and metrics on a set of 10 patients using segmentation model accuracy (surface Dice similarity coefficient), confidence calibration (expected calibration error (label)), and error localisation ability (uncertainty-error (U-E) overlap). Both multi-class and class-specific uncertainty maps were assessed. Main results. Segmentation accuracy remained stable without significant deviations across all UQ methods with respect to the baseline model without UQ. The reliability of different UQ methods in terms of confidence calibration and error localisation was also comparable. In contrast, the choice of UQ metric significantly influenced reliability. Multi-class predictive entropy (ECE-label: 0.06-0.06, U-E overlap: 0.45-0.46) consistently outperformed variance (ECE-label: 0.13-0.14, U-E overlap: 0.32-0.40, p< 0.001) and mutual information (ECE-label: 0.13-0.14, U-E overlap: 0.35-0.40, p< 0.001). Predictive entropy also demonstrated superior reliability as a class-specific UQ metric, though variability was observed between OARs. Significance. This study demonstrates that the choice of UQ approach substantially impacts the reliability of uncertainty maps. While different UQ methods performed comparably, the specific UQ metric chosen significantly affected reliability. These findings underscore the importance of careful metric selection and evaluation prior to application.
AB - Objective. Deep learning auto-segmentation has greatly advanced contouring in radiotherapy. However, quality assurance remains necessary due to performance fluctuation among individual patients. This manual process reintroduces variability and partially reduces time-saving benefits. As a solution, uncertainty quantification (UQ) is increasingly explored for its ability to estimate output confidence. While numerous methods to quantify uncertainty exist, their comparative reliability remains underexplored. This study compares the reliability of commonly used UQ approaches for auto-segmentation in radiotherapy. Approach. We evaluated the reliability of three popular uncertainty methods (Monte Carlo dropout, deep ensemble modelling and test-time augmentation) and uncertainty metrics (predictive entropy, mutual information and variance). We trained a 3D U-Net within the nnU-Net framework to segment 19 organs at risk (OAR) for head and neck cancer patients. We evaluated the reliability of the UQ methods and metrics on a set of 10 patients using segmentation model accuracy (surface Dice similarity coefficient), confidence calibration (expected calibration error (label)), and error localisation ability (uncertainty-error (U-E) overlap). Both multi-class and class-specific uncertainty maps were assessed. Main results. Segmentation accuracy remained stable without significant deviations across all UQ methods with respect to the baseline model without UQ. The reliability of different UQ methods in terms of confidence calibration and error localisation was also comparable. In contrast, the choice of UQ metric significantly influenced reliability. Multi-class predictive entropy (ECE-label: 0.06-0.06, U-E overlap: 0.45-0.46) consistently outperformed variance (ECE-label: 0.13-0.14, U-E overlap: 0.32-0.40, p< 0.001) and mutual information (ECE-label: 0.13-0.14, U-E overlap: 0.35-0.40, p< 0.001). Predictive entropy also demonstrated superior reliability as a class-specific UQ metric, though variability was observed between OARs. Significance. This study demonstrates that the choice of UQ approach substantially impacts the reliability of uncertainty maps. While different UQ methods performed comparably, the specific UQ metric chosen significantly affected reliability. These findings underscore the importance of careful metric selection and evaluation prior to application.
KW - Deep Learning
KW - Humans
KW - Uncertainty
KW - Head and Neck Neoplasms/radiotherapy
KW - Organs at Risk/radiation effects
KW - Reproducibility of Results
KW - Image Processing, Computer-Assisted/methods
KW - Automation
U2 - 10.1088/1361-6560/ae110c
DO - 10.1088/1361-6560/ae110c
M3 - Article
C2 - 41061736
SN - 0031-9155
VL - 70
JO - Physics in Medicine and Biology
JF - Physics in Medicine and Biology
IS - 20
M1 - 205023
ER -