Publications

Show all

Bai, Zilong; Xu, Zihan; Sun, Cong; Zang, Chengxi; Bunnell, H. Timothy; Sinfield, Catherine; Rutter, Jacqueline; Martinez, Aaron Thomas; Bailey, L. Charles; Weiner, Mark G.; Campion, Thomas T.; Carton, Thomas W.; Forrest, Christopher B.; Kaushal, Rainu; Wang, Fei; Peng, Yifan

Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing Journal Article

In: npj Health System, vol. 21, iss. 2, 2025.

Abstract | Links | BibTeX | Tags: COVID-19, long COVID, natural language processing

@article{nokey,

title = {Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing},

author = {Zilong Bai and Zihan Xu and Cong Sun and Chengxi Zang and H. Timothy Bunnell and Catherine Sinfield and Jacqueline Rutter and Aaron Thomas Martinez and L. Charles Bailey and Mark G. Weiner and Thomas T. Campion and Thomas W. Carton and Christopher B. Forrest and Rainu Kaushal and Fei Wang and Yifan Peng},

doi = {10.1038/s44401-025-00033-4},

year  = {2025},

date = {2025-08-21},

journal = {npj Health System},

volume = {21},

issue = {2},

abstract = {Accurately and efficiently diagnosing Post-Acute Sequelae of COVID-19 (PASC) remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals. To address this issue, we developed a hybrid natural language processing pipeline that integrates rule-based named entity recognition with BERT-based assertion detection modules for PASC-symptom extraction and assertion detection from clinical notes. We developed a comprehensive PASC lexicon with clinical specialists. From 11 health systems of the RECOVER initiative network across the U.S., we curated 160 intake progress notes for model development and evaluation, and collected 47,654 progress notes for a population-level prevalence study. We achieved an average F1 score of 0.82 in one-site internal validation and 0.76 in 10-site external validation for assertion detection. Our pipeline processed each note at 2.448 ± 0.812 seconds on average. Spearman correlation tests showed ρ > 0.83 for positive mentions and ρ > 0.72 for negative ones, both with P < 0.0001. These demonstrate the effectiveness and efficiency of our models and its potential for improving PASC diagnosis.},

keywords = {COVID-19, long COVID, natural language processing},

pubstate = {published},

tppubtype = {article}

}

Accurately and efficiently diagnosing Post-Acute Sequelae of COVID-19 (PASC) remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals. To address this issue, we developed a hybrid natural language processing pipeline that integrates rule-based named entity recognition with BERT-based assertion detection modules for PASC-symptom extraction and assertion detection from clinical notes. We developed a comprehensive PASC lexicon with clinical specialists. From 11 health systems of the RECOVER initiative network across the U.S., we curated 160 intake progress notes for model development and evaluation, and collected 47,654 progress notes for a population-level prevalence study. We achieved an average F1 score of 0.82 in one-site internal validation and 0.76 in 10-site external validation for assertion detection. Our pipeline processed each note at 2.448 ± 0.812 seconds on average. Spearman correlation tests showed ρ > 0.83 for positive mentions and ρ > 0.72 for negative ones, both with P < 0.0001. These demonstrate the effectiveness and efficiency of our models and its potential for improving PASC diagnosis.

doi:10.1038/s44401-025-00033-4