Large language model-based approach for low-cost, rapid and accurate automated extraction of predictive biomarker testing results from Dutch pathology reports

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

National cancer registries and pathology report databases contain valuable real-world information that may be used for assessment of healthcare quality or retrospective observational studies. As many of these data are currently non- or semi-structured, manual information extraction is needed, which is labor-intensive, costly and causes significant time delay to data analysis. This study aimed to develop a large language model (LLM)-based method to extract biomarker testing data for two routinely tested predictive biomarkers - EGFR and KRAS - from pathology reports of patients with non-small cell lung cancer (NSCLC). Patient cohorts and pathology reports were derived from the Netherlands Cancer Registry (NCR) and the Dutch nationwide pathology databank (Palga). Manually captured data regarding EGFR and KRAS testing in 3887 patients diagnosed with metastatic NSCLC in the Netherlands in 2019 were used for training, testing and validation. Annotated data included biomarker testing status for EGFR and KRAS, the use of next-generation sequencing (NGS), and test results including specific mutation(s). In the test set of the 2019 cohort, the model yielded (micro) F1 scores ≥ 0.98 for all variables (overall biomarker testing status, KRAS/EGFR testing, KRAS/EGFR test results). The trained model was then applied to pathology reports of 4122 patients diagnosed with NSCLC between July 2022 and June 2023, with manual annotation for 410 randomly selected cases to determine model accuracy. In this test set, (micro) F1 scores ≥ 0.95 were achieved. None of the manually annotated positive molecular test results were missed. Standardized notation of reported mutations was correct in 98.7% and 100.0% of KRAS and EGFR mutations, respectively. In the entire 2022-2023 cohort, model output revealed overall test rates of 88.1% and 86.4% for KRAS and EGFR, respectively. Among the tested patients, the model described positive KRAS and EGFR test results in 40.5% and 11.5%, respectively. This study illustrates the possibility to train and use an LLM-based model to accurately extract biomarker testing results from pathology reports of patients with lung cancer. This application enables rapid, low-cost assessment of biomarker testing, which can be used for evaluation of guideline adherence and retrospective biomarker follow-up studies at a nationwide level of the Netherlands.

Original languageEnglish
Number of pages9
JournalVirchows Archiv : an International Journal of Pathology
DOIs
Publication statusE-pub ahead of print - 29-Nov-2025

Fingerprint

Dive into the research topics of 'Large language model-based approach for low-cost, rapid and accurate automated extraction of predictive biomarker testing results from Dutch pathology reports'. Together they form a unique fingerprint.

Cite this