Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk

    Research output: Contribution to journalArticleAcademicpeer-review

    2 Downloads (Pure)

    Abstract

    Objective. Deep learning auto-segmentation has greatly advanced contouring in radiotherapy. However, quality assurance remains necessary due to performance fluctuation among individual patients. This manual process reintroduces variability and partially reduces time-saving benefits. As a solution, uncertainty quantification (UQ) is increasingly explored for its ability to estimate output confidence. While numerous methods to quantify uncertainty exist, their comparative reliability remains underexplored. This study compares the reliability of commonly used UQ approaches for auto-segmentation in radiotherapy. 

     Approach. We evaluated the reliability of three popular uncertainty methods (Monte Carlo dropout, deep ensemble modelling and test-time augmentation) and uncertainty metrics (predictive entropy, mutual information and variance). We trained a 3D U-Net within the nnU-Net framework to segment 19 organs at risk (OAR) for head and neck cancer patients. We evaluated the reliability of the UQ methods and metrics on a set of 10 patients using segmentation model accuracy (surface Dice similarity coefficient), confidence calibration (expected calibration error (label)), and error localisation ability (uncertainty-error (U-E) overlap). Both multi-class and class-specific uncertainty maps were assessed. 

    Main results. Segmentation accuracy remained stable without significant deviations across all UQ methods with respect to the baseline model without UQ. The reliability of different UQ methods in terms of confidence calibration and error localisation was also comparable. In contrast, the choice of UQ metric significantly influenced reliability. Multi-class predictive entropy (ECE-label: 0.06-0.06, U-E overlap: 0.45-0.46) consistently outperformed variance (ECE-label: 0.13-0.14, U-E overlap: 0.32-0.40, p< 0.001) and mutual information (ECE-label: 0.13-0.14, U-E overlap: 0.35-0.40, p< 0.001). Predictive entropy also demonstrated superior reliability as a class-specific UQ metric, though variability was observed between OARs. 

    Significance. This study demonstrates that the choice of UQ approach substantially impacts the reliability of uncertainty maps. While different UQ methods performed comparably, the specific UQ metric chosen significantly affected reliability. These findings underscore the importance of careful metric selection and evaluation prior to application.

    Original languageEnglish
    Article number205023
    Number of pages16
    JournalPhysics in Medicine and Biology
    Volume70
    Issue number20
    DOIs
    Publication statusPublished - 17-Oct-2025

    Keywords

    • Deep Learning
    • Humans
    • Uncertainty
    • Head and Neck Neoplasms/radiotherapy
    • Organs at Risk/radiation effects
    • Reproducibility of Results
    • Image Processing, Computer-Assisted/methods
    • Automation

    Fingerprint

    Dive into the research topics of 'Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk'. Together they form a unique fingerprint.

    Cite this